Datasheets were first proposed by Gebru et al. 2018, as an effort to bring a standardised process for documenting dataset in the machine learning community.
Machine Learning relies heavily on data, with most models being trained on static datasets. This can often create issues during the deployment phase, where it is not unusual to see drops in performance. This is definitely the case when the deployment context does not match the data used during training (see bias roadmap). It seems therefore crucial to bring more attention and transparency around data.
Datasheets aim to document important information regarding a specific dataset. This will help introduce more transparency and accountability around data usage, and help ensure compliance with current legislations. It will also make it easier to make results more reproducible, and it will be useful for mitigating bias.
A datasheet is roughly divided into seven sections. We will now list the sections and provide some example questions:
Motivation
For what purpose was the dataset created?
Who created the dataset?
Who funded the creation of the dataset?
Composition
What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
How many instances?
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
Is there a label or target associated with each instance?
Are there recommended data splits (e.g., training, development/validation, testing)?
Are there any errors, sources of noise, or redundancies in the dataset?
Collection process
How was the data associated with each instance acquired?
If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?
Were any ethical review processes conducted (e.g., by an institutional review board)?
Did the individuals in question consent to the collection and use of their data?
Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted?
Processing/clearing/labeling
Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?
Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)?
Is the software that was used to preprocess/clean/label the data available?
Uses
Has the dataset been used for any tasks already?
Is there a repository that links to any or all papers or systems that use the dataset?
What (other) tasks could the dataset be used for?
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
Distribution
Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
How will the dataset be distributed (e.g., tarball on website, API, GitHub)?
When will the dataset be distributed?
Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?
Maintenance
Who will be supporting/hosting/maintaining the dataset?
Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?
Will older versions of the dataset continue to be supported/hosted/maintained?
If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?
A full template of the datasheet can be found here, and two examples of application of the datasheets can be found in the appendix of Gebru et al. 2021.