Step 1: Understanding dataset shifts

We define dataset shift as a change in the data which would lead to a degradation of model performance over time. Broadly, we can identify the main causes of dataset shift in four different phenomena:

  • Feature drift, when the distribution of the input data changes p(X)

  • Concept drift, when the actual relationship between the input and the output target variable changes p(Y |X)

  • Prediction drift, when model’s prediction shifts p(ŷ |X)

  • Label drift, when the change is in the label distribution p(Y)

You could encounter feature drift, for example, when applying your algorithm to a new market. The distribution of the new data is likely to be different from the training data.

You may incur into concept drift, for example, when trying to predict customer behaviour. If a competitor introduces a new product to the market, customers may gradually shift their behaviour, modifying the statistical properties of the target variable. A sudden change like the COVID-19 pandemic may also occur and drastically alter the properties of the target.

Last updated