Step 1: Validation

Within the Machine Learning context, (model) “validation” describes the process where a training model is evaluated with a testing dataset, that is, the performance of the trained model is computed over data that was not used during training. The purpose of validation is to generate an estimate of the generalisation performance of the trained model so as to provide the model owner with an understanding of how the model might perform in ‘real-world’ use. It is the performance measures gathered via a validation process that are used during the process of hyperparameter optimisation to compare between different hyperparameter settings. As such, the hyperparameter settings that exhibit the best (estimated) generalisation performance are selected under the optimisation process.

There are a wide number of validation methods available in the literature. However, here we focus on three methods which are most commonly used for hyperparameter optimisation: i) the three-way holdout method; ii) k-fold cross-validation; and iii) leave-one-out cross-validation.

The Three-Way Holdout Method

The three-way holdout method splits the dataset available to the model owner into a training, validation and test dataset, with the test dataset being “held out” until a final set of hyperparameters has been determined through training successive hyperparameter settings the training dataset and validating each trained model over the validation dataset. Generalisation performance for the best performing model is estimated by validating that final model over the test dataset. The method can be summarised as follows:

  1. Split the dataset into three parts: a training set for training (or ‘fitting’) a model for each set of hyperparameter settings, a validation set for estimating generalisation performance for each set of hyperparameter settings, and a test set for the final validation of the best performing model.

  2. For each set of hyperparameter settings to be investigated, perform the following loop:

    1. Train the model over the training dataset for the set of hyperparameter settings;

    2. Validate the trained model over the validation dataset;

    3. If the trained model performance (from step 2, sub-step 2) exceeds the best performance seen previously, record the hyperparameter settings as the ‘best hyperparameter settings’ and record the model as the ‘best model’.

  3. We now have the ‘best hyperparameter settings’ and ‘best model’ from step 2. However, at this step, it can be chosen to retrain the model over the ‘best hyperparameter settings’ over a merged training and validation dataset. This increases the amount of data being shown to the model during training and can improve model performance (especially in the case of small training datasets) by decreasing ‘pessimistic bias’ (‘pessimistic bias’ describes the situation where a model underperforms due to not having received enough data during training) or pushing the model towards ‘capacity’ (when a model is at ‘capacity’, it is assumed that any further data provided for training would not generate a better model). If the model is indeed retrained over this merged dataset, the new model becomes the ‘best model’.

  4. The ‘best model’ is then validated over the test dataset to generate an estimate for the model’s generalisation performance.

  5. Finally, we can repeat a step similar to step 3. A model with the ‘best hyperparameter settings’ can now be trained over a merged training, validation and test dataset in order to maximise the amount of data being shown to the model. Although the expected generalisation performance for this model cannot be ascertained, this model would be expected to exhibit performance at least as good as the validation performance computed in step 4.

Scikit-learn provides a helpful implementation in Python for splitting datasets into two pieces via the train_test_split() function.

The dataset split for the three-way holdout method can be effected via two calls to the train_test_split() function - for example, the first call can split the dataset into an initial training set and a test set, and the second call can then split the initial training set into a final training set and a validation set. The train_test_split() function can generate stratified splits (please see the ‘Validation Methods Summary’ below for a brief discussion of stratification) via the setting of its stratify argument to True.

k-Fold Cross-Validation

‘Cross-validation’ captures the idea that each sample in the available dataset, other than the samples in a held-out test dataset, are used for validation. In k-fold cross-validation, with the available dataset split into a training dataset and a held-out test dataset, the training dataset is iterated over k times. The method can be summarised as follows:

  1. Split the dataset into two parts: a training set for performing training and validation for each set of hyperparameter settings, and a test set for the final validation of the best performing model.

  2. For each set of hyperparameter settings to be investigated, perform the following sub-loop:

    1. Split the training set into k parts or ‘folds’;

    2. Loop through the following k times:

      1. Set the k-th fold as a validation set, with the remaining k-1 folds merged into a single training subset;

      2. Train the model over the training subset using the hyperparameter settings;

      3. Validate the model over the validation set, recording the model performance;

    3. Compute the mean average performance of the k performance measurements from step 2, sub-step 2;

    4. If the mean average model performance (from step 2, sub-step 3) exceeds the best mean average performance seen previously, record the hyperparameter settings as the ‘best hyperparameter settings’.

  3. Using the ‘best hyperparameter settings’, re-train the model over the entire training set (i.e. without any folds being removed for validation) to give the ‘best model’.

  4. The ‘best model’ is then validated over the test dataset to generate an estimate for the model’s generalisation performance.

  5. Finally, we can repeat a step similar to step 3. A model with the ‘best hyperparameter settings’ can now be trained over the merged training and test dataset in order to maximise the amount of data being shown to the model. Although the expected generalisation performance for this model cannot be ascertained, this model would be expected to exhibit performance at least as good as the validation performance computed in step 4.

Scikit learn provides an implementation for the creation of folds via the KFold() function, together with an implementation for the creation of stratified folds, StratifiedKFold().

Leave-One-Out Cross-Validation

Leave-One-Out Cross-Validation (LOOCV) is a special case of k-fold cross-validation, with k set to n, the number of samples in the training dataset. LOOCV seeks to maximise the amount of data over which the model is trained throughout the cross-validation process and is of particular benefit when only a small dataset is available to the model owner.

The Scikit learn implementation for LOOCV is a special case of the implementation for k-fold cross-validation with the n_splits argument for the relevant function being set to match the number of samples in the training dataset.

Validation Methods Summary

The three validation methods outlined above are in order of increasing computational intensity. The three-way holdout method, for each set of hyperparameter settings, only iterates over the training set once (in step 2, sub-step 1), whereas k-fold cross-validation iterates k times (in step 2, sub-step 2, sub-sub-step 2), and LOOCV iterates n times where n is the number of samples in the training set. However, it should be noted that the holdout method uses a single fixed split between the training and validation datasets, and thus the data samples within the validation dataset are not used for training until later in the process once the ‘best hyperparameter settings’ have already been found. As such, the holdout method only tends to suffice for large datasets where sacrificing so much data for training only causes minimal impact. The k-fold cross-validation and LOOCV methods, through their iterations, do see all the non-test data during training and thus are well suited to smaller datasets. We therefore see that the most computationally intensive methods tend only to be needed for smaller datasets which does help to dampen the computational resources needed. It should also be noted that k=5 is a common choice for k-fold cross-validation and k=10 has been cited as a good setting for hyperparameter optimisation.

Finally, we mention ‘stratification’. Stratification is an additional treatment used for classification tasks which seeks to ensure that any splitting of a dataset (for example, into training, validation and test sets, or into folds) yields subsets where each data class is represented in the same proportions as the original dataset. This treatment can be used within any of the validation methods outlined in this section. Stratification is particularly useful for smaller, less balanced datasets (i.e. where classes are not evenly distributed).

For an example of hyperparameter optimization you can access our GitHub page here or download the following file:

Last updated