Roadmaps for risk mitigation
  • Risk mitigation roadmaps
  • Mitigation Roadmaps
    • Improving generalization through model validation
      • Step 1: Estimating generalization
      • Step 2: Model validation for hyperparameters tuning
      • Step 3: Performing algorithmic selection
      • Additional Material
    • Hyperparameter Optimisation
      • Step 1: Validation
      • Step 2: Hyperparameter Search
      • Additional Considerations
    • Handling dataset shift
      • Step 1: Understanding dataset shifts
      • Step 2: Detecting dataset shifts
      • Step 3: Handling dataset shifts
      • Additional Material
    • Adversarial training for robustness
      • Step 1: Understanding adversarial examples
      • Step 2: Finding adversarial examples
      • Step 3: Defending against adversarial examples
      • Additional Material
    • Data Minimization techniques
      • Step 1: Understanding the data minimization principle
      • Step 2: Data minimization techniques for Supervised Learning
        • Option 1: Reducing features
        • Option 2: Reducing data points
      • Step 3: Other privacy-preserving techniques
      • Additional Material
    • Measuring Bias and Discrimination
      • Step 1: Understanding bias
      • Step 2A: Measuring Bias for Classification tasks
        • Equality of Outcome metrics
        • Equality of Opportunity metrics
      • Step 2B: Measuring Bias in Regression tasks
        • Equality of Outcome metrics
        • Equality of Opportunity metrics
      • Additional Material
    • Mitigating Bias and Discrimination
      • Step 1: Understanding bias
      • Step 2: Mitigating Bias
        • Option 1: Pre-processing
        • Option 2: In-processing
        • Option 3: Post-Processing
      • Additional Material
    • Documentation for improved explainability of Machine Learning models
      • Step 1: Datasheets for Datasets
      • Step 2: Model Cards for Model Reporting
      • Additional Material
    • Extracting Explanations from Machine Learning Models
      • Step 1: Understanding algorithmic explainability
      • Step 2: In-processing methodologies for Explainability
      • Step 3: Post-processing methodologies for Explainability
      • Additional Material
Powered by GitBook
On this page
  1. Mitigation Roadmaps
  2. Data Minimization techniques
  3. Step 2: Data minimization techniques for Supervised Learning

Option 2: Reducing data points

PreviousOption 1: Reducing featuresNextStep 3: Other privacy-preserving techniques

Last updated 3 years ago

We identify two main families of techniques for reducing data points: clustering and instance selection.

Below, we present a number of methodologies for each family. More techniques can be found in the review paper.

Training data can be grouped into clusters according to certain similarity metrics. The output can then be fed to a machine learning algorithm which then will learn how to classify each cluster rather than each individual data point. There is a very large number of techniques available for clustering, we will cover here some common techniques that can be implemented with the library:

  • : it splits the dataset into k clusters. Each observation belongs to the cluster with the nearest mean. A less computationally expensive variant of the algorithm is the , which uses mini-batches to reduce computation time.

  • : it aims to discover clusters by finding which data points can act as centroids, and be the mean of the points in that cluster.

  • Hierarchical clustering: It is a family of clustering algorithms that creates nested clusters. Clusters can thus be represented as a tree where the root is the set including all data points, and the leaves are composed of only one sample. The algorithm creates this hierarchy by successively merging pairs of clusters until all points are assigned to the same cluster.

  • : the Density-Based Spatial Clustering of Applications with Noise divides the data into clusters according to the density. The algorithm is useful for noisy datasets as it does not require to set the number or the shape of the clusters a priori.

  • : it assumes that the data is generated from a mixture of k of unknown parameters, which are learned during training.

Since not all data points are useful for the prediction task at hand, it is possible to eliminate some individuals from the dataset. Instance selection methods are divided in two groups:

  1. Wrapper. Methods which select instances based on the accuracy obtained by the machine learning model.

    • Most of these methods are based on the k-NN classifier: Condensed Nearest Neighbour (, ), Selective Nearest Neighbour rule (), Generalized Condensed Nearest Neighbour rule (), Edited Nearest Neighbour (), Decremental Reduction Optimization Procedure ().

    • The SVM (Support Vector Machines) can be used not only as a classifier but also as an instance selector.

      • In the method presented in , an SVM is first used to obtain the support vectors, and then a wrapper method is used to select them.

      • Another method is SV-kNNC (Support Vector k-Nearest Neighbour Clustering) (). First an SVM is applied, then the k-means algorithm is used to cluster the support vectors. The data points that do not belong to homogeneous clusters are discarded.

    • Evolutionary algorithms can also be used for instance selection (, ; ; ; ).

  2. Filter. These methods select instances by using a separate selection function, without involving the machine learning model.

    • POP (Pattern by Ordered Projections) were introduced by . This method estimates the amount of times each instance is not a border in a class. If the instance is not near instances from other classes, it is assumed that it can be safely discarded as it does not significantly affect the decision of the algorithm.

    • and assign weights to data points by computing their importance. In the former case the importance is estimated through gradient descent relative to nearest neighbours and enemies. In the latter case importance is given by the similarity of each data point with other instances in the same class, with most similar instances being most important.

Olvera-López et al. 2010
sklearn
K-means algorithm
MiniBatch K-means
Mean-shift
Agglomerative Clustering
DBSCAN
Gaussian mixture model
Hart 1968
Angiulli 2005
Ritter et al. 1975
Chien-Hsing et al. 2006
Wilson 1972
Wilson and Martínez 2000
Yuangui et al. 2005
Srisawat et al. 2006
Kuncheva 1995
1997
Kuncheva and Bezdek 1998
Bezdek and Kuncheva 2001
Cano et al. 2003,
García et al. 2008
Riquelme et al. 2003
Paredes and Vidal 2000
Olvera-López et al. 2008