Thinking about high-quality human data

High-quality data is vital for training deep learning models. While many machine learning techniques can help improve data quality, human data collection requires attention to detail and careful execution. The value of high-quality data is well-known, but there is a perception that people prefer to work on the models rather than the data. Collecting human data involves task design, selecting and training annotators, and aggregating the data. The concept of “Vox populi” or the wisdom of the crowd has been used in crowdsourcing tasks, such as evaluating machine translation or creating new gold reference translations. Different methods, such as majority voting, raw agreement, Cohen’s Kappa, and probabilistic graph modeling, can be used to measure rater agreement. Two paradigms, descriptive and prescriptive, exist for data annotation, with each having its pros and cons. Disagreement among annotators can provide valuable insights, especially in subjective tasks, and approaches like disagreement deconvolution and multi-annotator models have been proposed to handle such disagreements. Jury learning, inspired by the jury process, models different annotators’ behavior and aggregates their labels to make decisions. Finally, methods to assess data quality during model training can help identify incorrectly labeled samples.

https://lilianweng.github.io/posts/2024-02-05-human-data-quality/