Learning from Crowds with Crowd-Kit

This paper presents Crowd-Kit, a general-purpose computational quality control toolkit for crowdsourcing. Crowd-Kit provides efficient and convenient implementations of popular quality control algorithms in Python, including methods for truth inference, deep learning from crowds, and data quality estimation. Our toolkit supports multiple modalities of answers and provides dataset loaders and example notebooks for faster prototyping. We extensively evaluated our toolkit on several datasets of different natures, enabling benchmarking computational quality control methods in a uniform, systematic, and reproducible way using the same codebase. We release our code and data under the Apache License 2.0 at https://github.com/Toloka/crowd-kit.

df, gt = load_dataset('relevance-2') # binary relevance sample dataset # run the Dawid-Skene categorical aggregation method agg_ds = DawidSkene(n_iter=10).fit_predict(df)# same format as gt We implemented all the methods in Crowd-Kit from scratch in Python.Although unlike spark-crowd (Rodrigo et al., 2019), our library did not provide a means for running on a distributed computational cluster, it leveraged efficient implementations of numerical algorithms in underlying libraries widely used in the research community.In addition to categorical aggregation methods, Crowd-Kit offers non-categorical aggregation methods, dataset loaders, and annotation quality estimators.

Maintenance and governance
Crowd-Kit is not bound to any specific crowdsourcing platform, allowing analyzing data from any crowdsourcing marketplace (as soon as one can download the labeled data from that platform).Crowd-Kit is an open-source library working under most operating systems and available under the Apache license 2.0 both on GitHub and Python Package Index (PyPI).1All code of Crowd-Kit has strict type annotations for additional safety and clarity.By the time of submission, our library had a test coverage of 93%.
We built Crowd-Kit on top of the established open-source frameworks and best practices.We widely use the continuous integration facilities via GitHub Actions for two purposes.First, every patch (commit in git terminology) invokes unit testing and coverage, type checking, linting, documentation and packaging dry run.Second, every release is automatically submitted to PyPI directly from GitHub Actions via the trusted publishing mechanism to avoid potential side effects on the individual developer machines.Besides commit checks, every code change (pull request on GitHub) goes through a code review by the Crowd-Kit developers.We accept bug reports via GitHub Issues.

Functionality
Crowd-Kit implements a selection of popular methods for answer aggregation and learning from crowds, dataset loaders, and annotation quality characteristics.

Aggregating and learning with Crowd-Kit
Crowd-Kit features aggregation methods suitable for most kinds of crowdsourced responses, including categorical, pairwise, sequential, and image segmentation answers (see the summary in Table 1).Methods for categorical aggregation, which are the most widespread in practice, assume that there is only one correct objective label per task and aim at recovering a latent true label from the observed noisy data.Some of these methods, such as Dawid-Skene and GLAD, also estimate latent parameters -aka skills -of the workers.Where the task design does not meet the latent label assumption, Crowd-Kit offers methods for aggregation pairwise comparisons, which are essential for subjective opinion gathering.Also, Crowd-Kit provides specialized methods for aggregating sequences (such as texts) and image segmentation.All these aggregation methods are implemented purely using NumPy, SciPy, pandas, and scikit-learn without any deep learning framework.Last but not least, Crowd-Kit offers methods for deep learning from crowds that learn an end-to-end machine learning model from raw responses submitted by the workers without the use of aggregation, which are available as ready-to-use modules for PyTorch (Paszke et al., 2019).
One can easily add a new aggregation method to Crowd-Kit.For example, without the loss of generality, to create a new categorical aggregator, one should extend the base class BaseClassificationAggregator and implement two methods, fit() and fit_predict(), filling the instance variable labels_ with the aggregated labels.2Also, to add a new method for learning from crowds, one has to create a subclass from torch.nn.Module and implement the forward() method. 3able 1: Summary of the implemented methods in Crowd-Kit.

Dataset loaders
Crowd-Kit offers convenient dataset loaders for some popular or demonstrative datasets (see Table 2), allowing downloading them from the Internet in a ready-to-use form with a single line of code.It is possible to add new datasets in a declarative way and, if necessary, add the corresponding code to load the data as pandas data frames and series.

Evaluation
We extensively evaluate Crowd-Kit methods for answer aggregation and learning from crowds.
When possible, we compare with other authors; either way, we show how the currently implemented methods perform on well-known datasets with noisy crowdsourced data, indicating the correctness of our implementations.

Evaluation of aggregation methods
Categorical.To ensure the correctness of our implementations, we compared the observed aggregation quality with the already available implementations by Zheng et al. (2017) and Rodrigo et al. (2019).Table 3 shows evaluation results, indicating a similar level of quality as them: D_Product, D_PosSent, S_Rel, and S_Adult are real-world datasets from Zheng et al. (2017), and binary1 and binary2 are synthetic datasets from Rodrigo et al. (2019).Our implementation of M-MSR could not process the D_Product dataset in a reasonable time, KOS can be applied to binary datasets only, and none of our implementations handled binary3 and binary4 synthetic datasets, which require a distributed computing cluster.Sequence.We used two datasets, CrowdWSA (Li & Fukumoto, 2019) and CrowdSpeech (Pavlichenko et al., 2021).As the typical application for sequence aggregation in crowdsourcing is audio transcription, we used the word error rate as the quality criterion (Fiscus, 1997) in Table 5. Segmentation.We annotated on the Toloka crowdsourcing platform a sample of 2,000 images from the MS COCO (Lin et al., 2014) dataset consisting of four object labels.For each image, nine workers submitted segmentations.The dataset is available in Crowd-Kit as mscoco_small.
In total, we received 18,000 responses.Table 6 shows the comparison of the methods on the above-described dataset using the intersection over union (IoU) criterion.

Evaluation of methods for learning from crowds
To demonstrate the impact of learning on raw annotator labels compared to answer aggregation in crowdsourcing, we compared the implemented methods for learning from crowds with the two classical aggregation algorithms, Majority Vote (MV) and Dawid-Skene (DS).We picked the two most common machine learning tasks for which ground truth datasets are available: text classification and image classification.For text classification, we used the IMDB Movie Reviews dataset (Maas et al., 2011), and for image classification, we chose CIFAR-10 ( Krizhevsky, 2009).In each dataset, each object was annotated by three different annotators; 100 objects were used as golden tasks.
We compared how different methods for learning from crowds impact test accuracy.We picked two different backbone networks for text classification, LSTM (Hochreiter & Schmidhuber, 1997) and RoBERTa (Liu et al., 2019), and one backbone network for image classification, VGG-16 (Simonyan & Zisserman, 2015).Then, we trained each backbone in three scenarios: use the fully connected layer after the backbone without taking into account any specifics of crowdsourcing (Base), CrowdLayer method by Rodrigues & Pereira (2018), and CoNAL method by Chu et al. (2021).Table 7 shows the evaluation results.It is crucial to make a well-informed model selection to achieve optimal results.We believe that Crowd-Kit can seamlessly integrate these methods into machine learning pipelines that utilize crowdsourced data with reliability and ease.

Conclusion
Our experience running Crowd-Kit in production for processing crowdsourced data at Toloka shows that it successfully handles industry-scale datasets without needing a large compute cluster.We believe that the availability of computational quality control techniques in a standardized way would open new venues for reliable improvement of the crowdsourcing quality beyond the traditional well-known methods and pipelines.

Table 2 :
Summary of the datasets provided by Crowd-Kit.

Table 3 :
Comparison of the implemented categorical aggregation methods (accuracy is used).
Pavlichenko & Ustalov (2021)parison of the Bradley-Terry and noisyBT methods implemented in Crowd-Kit to the random baseline on the graded readability dataset byChen et al. (2013)and a larger people age dataset byPavlichenko & Ustalov (2021).

Table 4 :
Comparison of implemented pairwise aggregation methods (Spearman's  is used).

Table 5 :
Comparison of implemented sequence aggregation methods (average word error rate is used).

Table 6 :
Comparison of implemented image aggregation algorithms (IoU is used).

Table 7 :
Comparison of different methods for deep learning from crowds with traditional answer aggregation methods (test set accuracy is used).Our experiment shows the feasibility of training a deep learning model directly from the raw annotated data, skipping trivial aggregation methods like MV.However, specialized methods like CoNAL and CrowdLayer or non-trivial aggregation methods like DS can significantly enhance prediction accuracy.