Step 3: Handling dataset shifts
Last updated
Last updated
Once a drift in the model has been detected, you need to address the issue in order to recover the desired performance.
The most direct and obvious way to address model decay is to retrain the model. Retraining could be triggered when one of the monitored variables requires it, for example if the performance falls under a specified threshold. Training could also be periodical. You could decide the frequency based on how often the new data is received, or you could estimate how long it takes for the model to decay by using older data as a proxy.
When retraining, you may use a combination of new and old data, perhaps assigning higher weights to the new data points, or you may even have to discard the old dataset. It is possible that retraining alone is not enough to reduce drift. In these cases, you may have to change/update your model, perhaps modifying the architecture or creating an ensemble of new and old models. You can find an example of how to handle dataset shift in our notebook.
If it is not possible to retrain, these are some alternative options
Feature dropping. By removing one feature at a time, you can discover which features are determining the drifting, and you can remove these from the dataset.
Importance Reweighting. You can attribute higher weights to those data points that are closest to the new data. This is particularly useful if the new data is unlabeled.
Reframing the problem or business use. For example, changing the prediction horizon of your model or simply running it more frequently.
Online machine learning. This type of algorithm learns on one sample at a time. The model is automatically updated with every new datapoint received and thus can intrinsically handle drifts.
You can find an implementation of various techniques for handling dataset shift in our notebook, which can be accessed here or downloaded as a file: