Guide

Type of problem

The first question being reached in both the decision tree as well as the questionnaire is: what is the type of problem being considered?

In this section we will give a brief explanation on the types of problems that are included:

  • Machine learning: For the Machine Learning type of problem, we consider problems where an actual machine learning or AI model is involved. This can be via different ways: either the parties aim to train a new machine learning model on data, or the parties aim to evaluate or apply an existing (already trained) machine learning model on data. Machine learning is a broad concept, containing a broad range of classification or regression models, intended to make a prediction based on a number of features. We do not consider other forms of statistical analysis as they are separately examined in the Statistical Analysis type of problem. It may be that the specific problem you consider fits both machine learning and statistical analysis – in that case, it may be good to traverse both paths in the tree.
  • Set intersection: In this type, we consider problems where two (or more) parties have a list of items (e.g. persons like patients or customers), and wish to determine the overlap between these two lists (the intersection). Set Intersection can be either a subproblem of any of the other problems, or a problem by itself. An example of the second scenario is when organizations wish to match their datasets without planning to perform a specific analysis per se. In this case, only the Set Intersection route shall be traversed. On the contrary, when additional analysis is intended, the tree shall be traversed twice: once for Set Intersection and once for said analysis (Machine Learning, Statistical Analysis or Synthetic Data Generation).
  • Statistical analysis: By statistical analysis, we refer to cases where one or more parties wish to compute a set of statistical metrics (e.g. counts, averages, standard deviations, quantiles, histograms, frequency plots, etc.) on their data and receive the results. Also other simple computations on the data, even if not strictly statistical in nature, may be considered when traversing this path.
  • Synthetic data generation: Synthetic data generation refers to cases where one wishes to generate new (fake) data based on existing data’s distribution and characteristics. For instance, to validate and test models, or to train machine learning models. Here we assume that the original data used to generate the synthetic data is sensitive. If the original data is not, it is very straightforward to synthesize data without having to employ some privacy preserving technology to protect the original data from potential reconstruction by using the synthetic data.