Synthetic Data Workshop Results

BLOG

Privacy-Enhancing Technologies

Results from the First Meeting of the Synthetic Data Working Group

Lewis Hotchkiss, Research Officer

What is Synthetic Data?

Privacy/Utility Trade-Off

Differential Privacy

Use Cases

Evaluating Privacy

Ensuring Transparency

Future Work

Dementias Platform UK (DPUK) and the AI Risk Evaluation Group hosted a workshop to bring together Trusted Research Environments (TREs), researchers, and public members to discuss the role of synthetic data and how it can be effectively evaluated for release from TREs.

The DPUK Data Portal is a TRE which enables access to pseudonymised sensitive datasets within a secure environment, to protect privacy while allowing important research to take place. As a TRE, we are interested in how synthetic data can be used for a number of different purposes, such as researcher training, and to ensure the continued protection of privacy in sensitive datasets.

The AI Risk Evaluation Group was established with funding from DARE UK to investigate the risks that AI models developed on TRE data pose and how they can be mitigated. From this group, we found that synthetic data was one of the better methods for preserving-privacy in AI models, and the method researchers felt most comfortable using. The Synthetic Data Working Group was established as an extension of this group to help develop guidance and standards around the use of synthetic data in TREs.

What is Synthetic Data?

Synthetic data is artificially created data, typically created by generative models trained on real world datasets. Synthesiser models aim to learn the statistical properties, distributions and correlations of the real data, to be able to create realistic artificial samples which are statistically similar to the data it was trained on.

There are typically two common methods for generating synthetic data – Generative Adversarial Networks (GANs) and Variational Auto-Encoders (VAEs), both of which aim to create realistic artificial samples by learning the properties of the real data.

GANs are the most common approach and contain two neural networks which compete with each other – a generator which learns how to create realistic new samples, and a discriminator which learns to differentiate between real and fake samples. This aim of the GAN architecture is to be able to generate realistic enough samples that the discriminator can no longer tell the difference between what's real and what's synthetic. This often leads to high quality data which is very similar to the real data.

What is

GAN Architecture

VAE Architecture

The other common method, VAEs also contain two components – an encoder which converts input into a smaller, dense representation called the latent space, and a decoder which takes a sample from the latent space and reconstructs it back into the original data format. Essentially, trying to generate data that looks like the input data by sampling the latent space.

Both of these synthesiser methods aim to create high quality generated data which is statistically similar to the real data. However, if we want higher privacy protections, there may be some additional steps that we need to take improve the privacy guarantees of the generated data.

The Privacy/Utility Trade-Off

When generating synthetic data, there is a privacy/utility trade-off that exists, where the more private the data becomes, the less utility that data has.

It is common in data privacy practices that there is a trade-off between how private the data is, and how useful it is. This is because the more the data becomes anonymised, the more it looses the ability to provide any meaningful information. The same holds true for synthetic data where the more we move away from what's represented in reality, the less likely that synthetic sample represents the real data.

Trade off

Privacy/Utility Trade-Off

If we look at the space of real and synthetic data points in the figure below, representing how similar they are to each other. When we’re generating highly private synthetic data, we don’t want unauthentic high-quality samples, as they are so similar to the real data points, that there is no privacy protection. If we’re generating high utility data, we don't want lots of noisy samples which don’t represent the real space, and therefore loose any meaningful correlations and statistical properties. But what we ideally want when we’re generating any synthetic data, are authentic high-quality samples which represent the real space, but are different enough that they protect privacy.

Real & Synthetic Data Similarity

Differential Privacy

Differential Privacy is a privacy guarantee that ensures that the presence or absence of a single individuals’ data doesn't affect the results of an output of that data.

This mathematical framework, not work enables algorithms to not be affected by any one individual, and therefore, you can’t determine whether an individual was included in the original dataset or not. In a differentially private algorithm, the output of the function doesn’t vary depending on whether an individual is present or not, no matter how unique that individual is. A key feature of this framework is that holds true regardless of knowledge or resources an adversary may have, and is therefore future proof to new methods or knowledge that becomes available.

Differential Privacy

Differential Privacy Diagram

This level of guarantee that the algorithm's output distribution for a given dataset will not change significantly if an individual is present or not is called epsilon, or the “privacy budget”. This is measured as the maximum distance between a query on a dataset with participant x, and the same query on a dataset without participant x.

Selecting the value for epsilon depends on the dataset, but generally:

A smaller epsilon will provide more similar outputs, and therefore provide higher privacy
A larger epsilon will allow less similarity in outputs, and therefore provide less privacy

So a smaller epsilon will generally provide higher privacy but less accurate outputs.

There are several mechanisms which exist for achieving differential privacy, with the most common being to add carefully calibrated noise to mask the contribution of any possible individual in the data, while still preserving the overall accuracy of the analysis. There are two ways of doing this – global differential privacy, where noise is added to the output of the algorithm, and local differential privacy where noise is added to the inputs.

We can incorporate differential privacy into synthetic data generation models to ensure the samples which are generated are private. There are several DP-GAN methods for doing this which essentially add noise to the gradients during the training process, to prevent the models from learning too much about the training data. Additionally, other methods exist such as PATE-GAN which applies the Private Aggregation of Teacher Ensembles (PATE) framework to the generator to ensure a differentially private discriminator. Generally, the PATE-GAN method offers robust privacy guarantees with good levels of utility compared to other DP generative methods.

Differential Privacy GAN Architecture Diagram

We can adjust the differential privacy parameters of the model, such as the privacy budget (epsilon), to control the level of privacy in the model. Additionally, we can adjust delta which accounts for the possibility of a privacy breach occurring, so allows for slightly looser privacy guarantees but can significantly improve the utility. Another parameter that can be adjusted is lambda, which controls the scale of noise that is added to the model.

Use Cases of Synthetic Data

Synthetic data has lots of potential in sensitive healthcare datasets to help preserve privacy for a number of different purposes where we may want to release data from controlled environments. This may be to create datasets for researcher training, to test workflows before gaining access to the real data, or for training private AI models. However, all of these purposes have the same general governance considerations when it comes to effectively judging privacy to ensure it is safe enough to be released from that environment.

Depending on these use cases, the level of privacy/utility we need will change according to the purpose that the synthetic data will be used for. If we’re generating synthetic data for the purpose of training private AI models, we need to make sure there is high utility otherwise that model will be useless. Alternatively, if we want to use synthetic data for researchers to be able to learn how to analyse data before accessing the real data, then utility may not be as much of a consideration, as meaningful analysis isn’t needed at that point. We can categorise these two different cases of synthetic data as high fidelity (which is good quality and statistically useful) and low fidelity (where the data is structurally similar, but has lost statistical meaning).

The examples shown below shows the uses cases highlighted in the workshop. All of these different uses cases require different levels of fidelity, so it is important to consider context when generating synthetic data.

Use cases

Fidelity Levels of Use Cases for Synthetic Data

In high fidelity purposes, it was agreed that differentially private methods which have robust privacy guarantees offer relatively high privacy in synthetic data, while retaining a good level of utility for analysis or developing AI models. Therefore, any generation methods for creating synthetic data for these purposes, should be differentially private and thoroughly evaluated (which will be discussed in the evaluation section later).

In the case of low fidelity purposes, it was discussed that the use of artificial data created from metadata, rather than synthetic data trained on real data, may be more beneficial. For these purposes, we only care about the basic structure of the data such as the data types of columns, and the kind of values which exist within them. The correlations and statistical properties of the data don’t matter as much.

So instead of training a generative model to create low fidelity synthetic samples, we can use publicly available metadata about the dataset to sample the range of values for each variable. The table below shows an example of the standardised metadata that DPUK holds for each dataset in the Data Portal.

For each variable, we have the:

range of values if it is a numerical variable, or the categories if it is a categorical variable
type of data to show whether it is a categorical, float, integer or datetime type
completeness to show how many records have a value for that variable (i.e. percentage of non empty values)

Example Metadata

We also have the size of the dataset, so how many rows and columns exist. Because of all of this information we have for each dataset, we can easily use this to create artificial samples which don’t use the real data at all, and only publicaly available information about the dataset. This can be done by drawing random values within the value range which adhere to the particular data type. This means that we have artificial data which represents the basic structure and data types which can be used for low fidelity purposes such as testing queries in federation or developing algorithms.

Therefore, it was determined, that in low fidelity purposes, this would be the preferred method of generating artificial samples as it has less restrictions and hence makes accessibility easier as there are no governance considerations.

However, there may be cases where we require low fidelity versions of synthetic data which preserve some of the correlations and statistical properties of the data for purposes such as researcher training. In this case, we should generate highly private synthetic data with high levels of privacy guarantees.

Evaluating Privacy in Synthetic Data

Privacy is the biggest consideration when evaluating synthetic data for release from secure environments. GDPR anonymisation practices requires the evaluation of the risk of singling out, inference and linkability when determining if data is anonymised.

These three measures were determined to be the most important metrics when evaluating privacy in synthetic data as they allow us to determine whether the synthetic data complies with GDPR anonymisation. However, it was noted that current anonymisation regulations may not be enough for determining the privacy risk of a dataset, so we cant solely trust these metrics on their own when evaluating synthetic data.

Evaluation

There are a couple of additional privacy metrics which are useful to use when evaluating synthetic data. One of these is to look for exact matches where we can look at how novel the synthetic samples are (i.e. whether there are any exact matches where the features are the same for a record in both the synthetic and real). We can add a degree of variation to this as well so that we can look for values which are very similar like 30.23 and 30.28.

Another metric we can use is to measure the similarity of data points by looking at the nearest neighbours. Before training the synthesiser model, we can split the real data into a training set and a holdout set so we have a separate dataset to compare which hasn’t been used in the generation process.

All of these metrics allow us to effectively evaluate privacy in synthetic data, and provide quantifiable risk scores for determining whether synthetic data is safe enough for release or not. However, from the workshop a key challenge highlighted was not having common thresholds for determining whether the scores are high enough or not, so there's still a degree of subjectivity when determining risk and interpreting the results.

Ensuring Transparency

To be able to assess synthetic data, the generation and evaluation process needs to be as transparent as possible for data providers to be able to make informed decisions.

During the workshop, it was discussed that we need ways to make the process of creating synthetic data transparent, and explaining the results in a simple and meaningful way for people who may not have the technical knowledge to understand. It was suggested that there is a need for a common framework for documenting the generation process of synthetic data and explaining the evaluation metrics in a standardised way. This would ensure that data providers are fully aware of how that synthetic data has been created and whether the privacy risks have adequately been reduced.

This standardised documentation for the generation process should contain key details including:

The purpose/context for generating the synthetic data
Original dataset used in the training process
Data pre-processing steps
Selected model
Model parameters

For the evaluation of the data, the key metrics mentioned should be adequately demonstrated and explained in a sense which is understandable to a lay person. As common threshold values don’t exist, these values should be provided in context as to what would typically be considered high and low values, and where the calculated metric value sits on that spectrum.

All of this should additionally be accompanied by the code developed.

Future Work

This work provides the foundational background for progressing guidance and recommendations on synthetic data for privacy.

Future work of this group will investigate:

Whether we can determine common thresholds for evaluation metrics
Develop a standardised framework for documenting the generation and evaluation process
Decide which purposes synthetic data may not be justified and in what contexts do we actually require high fidelity data
Create learning materials to provide an understanding of what synthetic data is and how it is evaluated
Balancing the privacy/utility trade-off
Engaging with data controllers and understanding what level of evidence is needed to feel comfortable with the release of synthetic versions of their data

Transparency

Future

Join the Group

Building Responsible AI in Neuroscience