With thousands of annotators making millions of evaluations in hundreds of tasks every day, Toloka is a major source
of human-marked training data. Toloka supports academic research and innovation by sharing large amounts of
accurate data applicable to machine learning in a variety of areas.
Please note: These public datasets are only available for non-commercial use with a clear reference to Toloka as the source of data.
If you plan to use any of these datasets for commercial purposes, please contact us for our consent.
ZIP archive, 10.8 GB
Labels: texts.tsv
Photos: images/
ZIP archive, 19.5 GB
Labels: data.tsv
Photos: photos/
ZIP archive, 981 MB
Photos: images/
Masks: masks/
Collages: collage/
Raw data: general-new.tsv
In each folder:
Labels: full-annotation-result-new.tsv
Demonstrative examples with their
expected labels: train.csv
Development set: RuBQ_2.0_dev.json
Test set: RuBQ_2.0_test.json
Paragraphs: RuBQ_2.0_paragraphs.json
Development set:
RuBQ_1.0_dev.json
Test set: RuBQ_1.0_test.json
ZIP archive, 8.19 MB
Profiles: profile.tsv
Dialogues: dialogues.tsv
ZIP archive, 95.6 KB
Training data: task2_ru_train.tsv
Validation data: task2_ru_validation.tsv
Testing data: task2_ru_test.tsv
Script for downloading tweets: download_tweets.py
Description and script instructions: Readme.md
ZIP archive, 2.01 MB
Input data: lrwc-1.1-assignments.tsv
Training tasks: toloka-isa-50-skip-300-train-hit.tsv
Aggregated results: lrwc-1.1-aggregated.tsv
ZIP archive, 0.45 MB
Ground truth: golden_labels.tsv
Features: features.tsv
Crowd labels: crowd_labels.tsv
ZIP archive, 2.23 MB Crowd labels:
assignments_01-12-2017.tsv
Ground truth: report-curated.tsv.xz
Aggregated results: bts-rnc-crowd.tsv
ZIP archive, 2.6 MB
crowdspeech-dev-clean:
Crowd labels: crowd_labels.csv
Ground truth: gt.csv
crowdspeech-dev-other:
Crowd labels: crowd_labels.csv
Ground truth: gt.csv
crowdspeech-test-clean:
Crowd labels: crowd_labels.csv
Ground truth: gt.csv
crowdspeech-test-other:
Crowd labels: crowd_labels.csv
Ground truth: gt.csv
ZIP archive, 3.08 MB
Crowd labels: crowd_labels.tsv
Ground truth: golden_labels.tsv
ZIP archive, 7.17 MB
Crowd labels: crowd_labels.tsv
Ground truth: golden_labels.tsv
Banned users: bans.tsv
ZIP archive, 1.07 GB
Completed tasks: assignments.tsv
Project data: projects.tsv
Anonymized user data: users.tsv
Task selection sessions: visits.tsv
ZIP archive, 9 MB
Crowd labels: crowd_labels.csv
Ground truth: gt.csv
Have a dataset that you are ready to share? Submit it for publication on this page.
Use the Toloka platform to prepare a dataset that meets your needs.
Start now