Guidance for Developing Private AI Models in Sensitive Healthcare Data
This guidance explains the risks and concerns associated with developing AI models on private healthcare data and what mitigations (privacy-preserving techniques) can be implemented to deal with these.
Risks to Privacy in AI Models
Data memorisation, also known as overfitting, is a common challenge in training AI models which occurs when the model becomes too focused on the specific details of the training data, rather than learning the underlying patterns. This can typically occur when there are too many features and/or too few participants in the training data. Inversion and inference attacks can exploit this vulnerability to potentially reveal specific participant data.
Differential Privacy
Differential privacy works by adding noise either to the data, or the response of the model, to ensure that an adversary can’t determine with confidence that information about an individual is present in the data. This level of noise is determined by epsilon, also known as the privacy budget, which controls the privacy guarantee of the data. However, differential privacy involves a trade-off between privacy and utility due to the effect of adding noise. Because of this addition of noise, this can reduce the accuracy of an AI model, so researchers have to carefully consider this trade-off and the level of noise suitable.
Synthetic Data
Synthetic data aims to generate artificially created data which replicates the statistical properties and patterns of the real data. This is usually done through training a generative model on some real data to learn the characteristics and structure of that data to be able to create new samples from it. Analysis of this type of data should produce similar results compared to using the original data but this depends on the level of synthetic data generated, and like differential privacy, there is a trade-off between privacy and utility depending on the fidelity of the synthetic data. The more the synthetic data mimics real data, then the more likely it is to reveal individuals’ data.
Homomorphic Encryption
Homomorphic encryption provides high protection while retaining utility as it enables computations to be performed on encrypted data without the need of having to decrypt it. Although this is the most ideal solution, this method is currently very limited in its abilities in AI and can be challenging to implement. HE is more typically used at the inference stage of AI models to protect the query data, and not the training data.
Synthetic Data
Synthetic data aims to generate artificially created data which replicates the statistical properties and patterns of the real data. This is usually done through training a generative model on some real data to learn the characteristics and structure of that data to be able to create new samples from it. Analysis of this type of data should produce similar results compared to using the original data but this depends on the level of synthetic data generated, and like differential privacy, there is a trade-off between privacy and utility depending on the fidelity of the synthetic data. The more the synthetic data mimics real data, then the more likely it is to reveal individuals’ data.
Differential Privacy
Differential privacy works by adding noise either to the data, or the response of the model, to ensure that an adversary can’t determine with confidence that information about an individual is present in the data. This level of noise is determined by epsilon, also known as the privacy budget, which controls the privacy guarantee of the data. However, differential privacy involves a trade-off between privacy and utility due to the effect of adding noise. Because of this addition of noise, this can reduce the accuracy of an AI model, so researchers have to carefully consider this trade-off and the level of noise suitable.
Homomorphic Encryption
Homomorphic encryption provides high protection while retaining utility as it enables computations to be performed on encrypted data without the need of having to decrypt it. Although this is the most ideal solution, this method is currently very limited in its abilities in AI and can be challenging to implement. HE is more typically used at the inference stage of AI models to protect the query data, and not the training data.
Releasing AI Models Safely
If an AI model is ready to deploy, one option could be to host that model with restricted access and queries. This would mean that the AI model would stay within the TRE, and could only be queried through a web interface or an API. By imposing access and query controls, it means that the model can only be used by approved users, and attacks are prevented because of the query restrictions.
If an adversary did somehow manage to be able to query the model. Then they would only be able to run black-box attacks as they wouldn’t have direct access to the model. This makes attacks more difficult to perform as the adversary only has the outputs from the model to attack, therefore limiting their capabilities.