Synthetic Data
Synthetic data has lots of potential in sensitive healthcare datasets to help preserve privacy for a number of different purposes where we may want to release data from controlled environments. This may be to create datasets for researcher training, to test workflows before gaining access to the real data, or for training private AI models. However, all of these purposes have the same general governance considerations when it comes to effectively judging privacy to ensure it is safe enough to be released from that environment.
​
Depending on these use cases, the level of privacy/utility we need will change according to the purpose that the synthetic data will be used for. If we’re generating synthetic data for the purpose of training private AI models, we need to make sure there is high utility otherwise that model will be useless. Alternatively, if we want to use synthetic data for researchers to be able to learn how to analyse data before accessing the real data, then utility may not be as much of a consideration, as meaningful analysis isn’t needed at that point.
​
The DPUK Data Portal team has developed a framework for generating different levels of synthetic data and to be able to optimise the privacy/utility trade-off depending on the requirements. We have implemented differentially private synthesiser models to be able to generate good utility data while preserving privacy of individuals, where the amount of noise can be calibrated according to needs. This framework additionally includes comprehensive evaluation for privacy, utility and fidelity which is easily interpretable.