Option 2: Reducing data points

We identify two main families of techniques for reducing data points: clustering and instance selection.

Below, we present a number of methodologies for each family. More techniques can be found in the review paper Olvera-López et al. 2010.

Training data can be grouped into clusters according to certain similarity metrics. The output can then be fed to a machine learning algorithm which then will learn how to classify each cluster rather than each individual data point. There is a very large number of techniques available for clustering, we will cover here some common techniques that can be implemented with the sklearn library:

  • K-means algorithm: it splits the dataset into k clusters. Each observation belongs to the cluster with the nearest mean. A less computationally expensive variant of the algorithm is the MiniBatch K-means, which uses mini-batches to reduce computation time.

  • Mean-shift: it aims to discover clusters by finding which data points can act as centroids, and be the mean of the points in that cluster.

  • Hierarchical clustering: It is a family of clustering algorithms that creates nested clusters. Clusters can thus be represented as a tree where the root is the set including all data points, and the leaves are composed of only one sample. The Agglomerative Clustering algorithm creates this hierarchy by successively merging pairs of clusters until all points are assigned to the same cluster.

  • DBSCAN: the Density-Based Spatial Clustering of Applications with Noise divides the data into clusters according to the density. The algorithm is useful for noisy datasets as it does not require to set the number or the shape of the clusters a priori.

  • Gaussian mixture model: it assumes that the data is generated from a mixture of k of unknown parameters, which are learned during training.

Last updated