Step 1: Understanding the data minimization principle

According to the article 5(1)(c) of the General Data Protection Regulation (GDPR), we should ensure that the personal data we are employing is:

  • adequate – sufficient to properly fulfil the stated purpose;

  • relevant – has a rational link to that purpose;

  • limited to what is necessary – we do not hold more than you need for that purpose.

Since the amount of data is usually proportional to the performance of the algorithm, it is common for data scientists to collect and employ as much data as possible. According to the data minimization principle, however, we should always consider if there may be ways to achieve the same goal with less data.

We should strive to identify the minimum amount of personal data needed for our task, and use and more than that. It is important to notice that, according to GDPR, we should also not hold data that is not currently useful for algorithms. The fact that some data may become useful in the future is not reason enough, by itself, to justify its collection or retention. Furthermore, we should periodically check if the personal/sensitive data held is still relevant, and delete if not.

In the following pages, we will introduce a few techniques to minimize the amount of data needed by our algorithm. It is recommended that these techniques are employed from the beginning of the design process.

Last updated