Guide
Appendix A. Brief introduction per PET
For traditional analysis, we assume that all data is locally available (centralized data), thus allowing us to focus on the system efficiency, model performance, and applicability of some analysis of interest. Additionally, we like to assume that the end user of the output has no evil intentions of any kind; for example, we like to assume that the end user has no incentive to reverse-engineer the training data in the context of machine learning. Although this setting suffices in many cases, it also poses severe limitations to collaboration with sensitive data. PETs enable such collaborations.
First consider the assumption of centralized data and instead assume that data is distributed over different organizations or even different devices (smartphones). In this case, it may be time-consuming or even infeasible to send all the distributed data to a central location due to constrained resources, such as high latency or small bandwidth. These constraints motivate the development of techniques that allow for analysis on distributed data. We will shortly describe federated learning, which is a type of learning from distributed data with lower latency, less power consumption, and enhanced end users’ privacy.
Another reason why not all data might be available in a central place is a reserve of sharing data, presumably motivated by the private or confidential nature of the data. This is easily perceived as a reason to not collaborate; however, in many machine learning scenarios only the raw data is sensitive and not the trained model. Several techniques therefore aim to compute (an approximation of) the trained model that respects the privacy of the underlying data. Differential privacy and secure multi-party computation are designed for this purpose and provide mathematical guarantees of the type of privacy that they provide. Federated learning was originally introduced to address issues that arise in the context of distributed analysis involving many devices. One of the improvements over centralized analyses is improved privacy since the data of any device is no longer shared in raw form. However, unlike the mathematically guaranteed privacy in secure multi-party computation protocols, the privacy benefits of federated learning are often hard to quantify. Federated learning is therefore regularly combined with secure multi-party combination or differential privacy or both to boost privacy.
We now describe the techniques in a bit more detail. Note that all techniques are relatively novel and are rapidly improved upon. The attacks that we refer to are described in more detail in Appendix C. Organizations and individuals that act independently of each other are all referred to as parties.
Federated Learning
Federated learning specifically targets the issue of learning on distributed data. The core concept is, for every party, to obtain a partial model by training on the data that is locally available. Then, the partial models are aggregated (by a dedicated aggregator party or in a peer-to-peer architecture) into a global model that captures the information of all data. In doing so, instead of sharing all raw data, only the local models are shared (Konečný et al. 2016; McMahan et al. 2017).
Communication of these local models is not quite as demanding as communication of the raw data, which was one of the main objectives. Complementary to (1) limited communicational resources, federated learning problems are characterized by (2) systems heterogeneity, (3) statistical heterogeneity, and (4) privacy concerns (T. Li et al. 2019). Note that these characteristics also set Federated Learning apart from distributed learning on multiple servers in a data farm.
For the privacy concerns, we note that federated learning achieves some level of privacy because the local models only represent aggregated information of the raw data. In practise, however, Federated Learning is known to be susceptible to several type of attacks including backdooring (Bagdasaryan et al. 2019) and reconstruction attacks (Z. Li et al. 2020; Zhu, Liu, and Han 2019) and it is generally complex to quantify the privacy that is obtained.
While the field of federated learning is evolving, the scope itself is becoming broader. In the past years, the term federated learning has often been used to describe any type of learning where the data is partitioned among parties. In this document and designed Decision Tree, we mainly consider the original federated architecture proposed by Google, and hence our claims and paths chosen in the tree are based on that. A next step of our work should be to broaden the scope and account for different federated solutions that have been proposed and may be of interest when considering the choice of a PET.
Secure Multi-Party Computation (MPC)
MPC also considers multiple parties that collaboratively wish to evaluate a function, exemplarily to train a machine learning model or the intersection of records in two databases. The promise of MPC is that, from a functional perspective, it is indistinguishable from an ideal trusted third party who receives the data from all parties, performs the computation and returns the result (Lindell 2020; Yao 1982). A direct consequence of this functionality is that a party’s data is in no form revealed to any other party apart from what can be deduced from deliberately shared information (e.g. result of the computation).
MPC has been investigated in the academic world for several decades. In contrast to federated learning, MPC is a set of cryptographic techniques that focus on mathematically verifiable security guarantees and usually achieves that at the cost of (considerably) higher system requirements. The first MPC solutions were too involved to be applied to real-world challenges, but advancements in cryptography and computing and networking capabilities have reached a point where the several MPC solutions have been applied successfully. Currently, MPC solutions are tailored to a specific problem. This indicates that it is relatively time consuming to implement MPC solutions, but also implies that MPC solutions can be used only for the purpose that they were designed for. From a legal or compliancy perspective, this can be quite advantageous.
There are MPC protocols for various security models. Some of them only prevent honest parties to deduce information that they should not obtain; others additionally provide security against colluding (or hacked) parties and even parties that maliciously deviate from the protocol. The different security models are briefly described in Appendix B. Do note that the guarantees given by an MPC protocol, e.g. no data is revealed to any other party even if he acts malicious, always relate to the computation phase and not the result. So any information that can be gained from the output result is no longer secure. For example, if the intersection of two databases is securely computed and the result is revealed, then surely the result reveals membership of all records in the intersection. However, in an MPC protocol different choices can be made regarding what happens with the output result – for instance, it may be that only one party is allowed to see the output, or that the output is only revealed if it satisfies certain criteria.
There exist different types of MPC protocols. Two important categories are Homomorphic Encryption and Secret Sharing. Within these protocols there are different roles which each party will play. These can be best described as:
- Input party
A party that inputs data in the computation (encrypted, or not) - Compute party
A party that does the computation (In an encrypted domain, or not) - Output party
A party that receives the results from the joint computation
Each role (or combination of roles) needs to be separately considered when researching the best solution and its complications.
Homomorphic Encryption
A homomorphic encryption protocol uses a public and a private key. The public key is known to everyone and can be used by the parties to encrypt data. The encryption protects the underlying data and can only be lifted using the private key. All parties can therefore encrypt their own data and share the encrypted data with one another. The homomorphic property ensures that analyses can also be performed on the encrypted data. Only when the analyses have been performed is encryption lifted using the private key. During all of the intermediate steps, the data remains encrypted and no secrets are revealed. Please note that the party which holds the private key can decrypt all of the encrypted data. This key is therefore very powerful and a potential privacy risk. It is crucial that this key be handled correctly. A common approach is to divide the private key into pieces so that no single party has access to the entire key.
Secret Sharing
Secret-sharing involves dividing secret data into pieces (shares) in such a way that a single share does not contain any information on the secret data. The shares can therefore be spread among the participating parties without revealing the secret data. Ironically, secret-sharing does not mean that a secret is shared with other parties. All parties distribute the shares of their own input data in this way. The second step is to carry out the analysis. Instead of one party performing the analysis of all data, all parties perform the same analysis of the shares they received from the other parties. All parties receive a different outcome from which nothing meaningful can be derived. Only when the parties combine these local, intermediate results can the analysis result be revealed. This is the third and final step of the MPC approach. This three-step approach is also called the share-compute-reveal approach.
Differential Privacy
Differential privacy is a mathematical framework that limits the amount of information about the input data that can be deduced from the result of a computation. The protection that it provides thus focusses on the output of a computation rather than the computation itself, which sets it apart from federated learning or MPC.
As discussed, neither federated learning nor MPC prevents parties from learning something that can be deduced from the result of the computation. For example, knowing the average annual salary of your department in 2018 and in 2019, combined with the fact that there was only one change in the department members, allows you to deduce the salary of the newest colleague. Differential Privacy strategically introduces specific mathematical uncertainty (noise) somewhere in a computation such that, given the (perturbed) result, it is impossible to make high-confidence deductions about the data of an individual (Dwork 2008). That is, if the average annual salary is computed with a differentially private mechanism, we will only be able to deduce a range of salaries that probably includes the actual salary of our newest colleague. However, the addition of noise to the computation (to improve privacy) often reduces the usefulness of the outcome (e.g., reduced accuracy). A tuning parameter allows the user to trade-off privacy and quality of the model.
The above example only exhibits part of the power of differential privacy. When a differentially private mechanism is used to publish several statistics about the department then it is impossible for someone outside the department to deduce with certainty whether some individual works in the department. That is, differential privacy protects against membership inference. What makes differential privacy stand out is that this protection is independent of any background knowledge available to the attacker. Many other anonymization and pseudonymization techniques fail if the attacker has access to other (public) databases whose information can be used to infer details from the published data that were supposed to be protected.
To summarize, differential privacy limits the information about input data that can be extracted form a computation output. An attractive property of differential privacy is no amount of post-processing or background knowledge can reduce the privacy that is achieved by a differential privacy mechanism. Although it may be hard to show that a protocol is differentially private, the reward of doing so is that the protocol provides a strong protection against reconstruction, membership inference and background knowledge attacks - ideally without jeopardizing the model accuracy.
Trusted Secure Environment (TSE)
A trusted secure environment (TSE) is also known as a trusted execution environment (TEE) or a secure enclave. A TSE is a set of software and hardware features that provide an isolated execution environment to enable strong security guarantees for applications running in the TSE (Rashid 2020). Specifically, TSEs can provide confidentiality, integrity, and attestation. They enable a program to run secure computations over confidential data while providing strong isolation from other applications, the operating system, and the host (Brandão, Resende, and Martins 2021). TSEs establish an isolated execution environment that runs in parallel with a standard operating system, its aim is to defend sensitive code and data against privileged software attacks from a potentially compromised operation systems. Data is stored in the TSE, where it is impossible to view the data or operations performed on it from outside, even with a debugger. The TSE ensures only authorized code can access and compute on the data. The TSE can for example be used to protect the data once it is on the device: while the data is protected during transmission by using encryption, the TSE protects the data once it has been decrypted on the device.
Trusted Third Party (TTP)
The traditional means of performing analyses on sensitive data from multiple stakeholders concerns the involvement of a trusted third party (TTP). Such a TTP collects all input data necessary for the analyses, and provides output back to the involved parties. In this case, the party is aware of all of the assets and is responsible for privacy and confidentiality of the data – data processing agreements should be made between the data input parties and the TTP. Note: TTP is not a PET, but does appear in the decision tree as one of the options. In certain situations, a TTP may still be the most feasible solution. Also hybrid situations are possible, in which a TTP no longer has access to the data itself, but still can play a role in performing the computation done via PETs.