Public datasets

With thousands of annotators making millions of evaluations in hundreds of tasks every day, Toloka is a major source
of human-marked training data. Toloka supports academic research and innovation by sharing large amounts of
accurate data applicable to machine learning in a variety of areas.

Please note: These public datasets are only available for non-commercial use with a clear reference to Toloka as the source of data.
If you plan to use any of these datasets for commercial purposes, please contact us for our consent.

    • Handwritten Text Datasets
      This dataset contains over 7,000 images of handwritten text from more than 100 unique contributors in 3 languages: Spanish, French, and Arabic. The dataset works well for training and testing recognition models for handwritten text. The images contain text with punctuation and characters unique to each language that don't exist in the Latin alphabet, which makes the recognition task more challenging compared to other open-source benchmark datasets available for text recognition.

    ZIP archive, 10.8 GB
    Labels: texts.tsv
    Photos: images/

    • Toloka Business ID Recognition
      This dataset, commissioned by the Yandex Business Directory, contains 10,000 photos of organization information signs shot in the Russian Federation along with the INN (taxpayer ID) and OGRN (Primary State Registration Number) codes shown on these signs. Toloka was used for both capturing photos and recognizing INN and OGRN codes.

    ZIP archive, 19.5 GB
    Labels: data.tsv
    Photos: photos/

    • Toloka WaterMeters
      This dataset, collected by Roman Kucev from, contains 1244 images of hot and cold water meters as well as their readings and coordinates of the displays showing those readings. Each image contains exactly one water meter. The archive also includes the pictures of the results of segmentation with the masks and collages. Toloka was used for photo capturing, segmentation, and recognizing the readings.

    ZIP archive, 981 MB
    Photos: images/
    Masks: masks/
    Collages: collage/

    • Human Evaluation of Generated Stories
      Collected for the paper “Crowdsourced Human Evaluation Data in Plot Writing From Pre-Trained Language Models”, this dataset evaluates generated stories from various baselines on multiple aspects: naturalness, interestingness, cohesiveness, and story ending. Separate evaluation tasks were run for each aspect of naturalness, interestingness, and cohesiveness in 50 generated stories. An additional task evaluated story endings in 50 randomly selected pairs (story, ending) as pairwise comparisons.
    Learn More

    Raw data: general-new.tsv
    In each folder:
    Labels: full-annotation-result-new.tsv
    Demonstrative examples with their
    expected labels: train.csv

    • RuBQ 2.0: An Innovated Russian Question Answering Dataset
      RuBQ 2.0 is the second version of RuBQ. It contains 2,910 questions along with the answers and SPARQL queries. The dataset can be used for the evaluation of KBQA and machine reading comprehension, paragraph retrieval, end-to-end open-domain question answering and experiments in hybrid QA, where KBQA and text-based QA can enrich and complement each other.
    Learn More

    Development set: RuBQ_2.0_dev.json
    Test set: RuBQ_2.0_test.json
    Paragraphs: RuBQ_2.0_paragraphs.json

    • RuBQ 1.0: A Russian Dataset for Question Answering over Wikidata
      RuBQ 1.0 (Russian Knowledge Base Questions, pronounced [‘rubik]) is the first Russian dataset for Knowledge Base Question Answering (KBQA). It consists of 1,500 questions of varying complexity along with their English machine translations, corresponding SPARQL queries, answers, and a subset of Wikidata covering entities with Russian labels. The dataset is thought to be used as a development and test sets in cross-lingual transfer, few-shot learning, or learning with synthetic data scenarios.
    Learn More

    Development set:
    Test set: RuBQ_1.0_test.json

    • Toloka Persona Chat Rus
      This dataset of 10,000 dialogues for chatbot research was gathered by the MIPT's Neural Networks and Deep Learning Lab for conversational AI research. The dataset contains profiles of imaginary personalities with descriptions and dialogues between participants who are given a random profile and instructed to mimic a described personality.

    ZIP archive, 8.19 MB
    Profiles: profile.tsv
    Dialogues: dialogues.tsv

    • The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT)
      Created as part of the Social Media Mining for Health Applications (#SMM4H '20) shared tasks, this dataset consists of 9515 tweets describing health issues. Each tweet is labeled for whether it contains information about an adverse side effect that occurred when taking a drug. The dataset was a joint effort with the UPenn HLP Center and the Chemoinformatics and Molecular Modeling Research Laboratory at Kazan Federal University.

    ZIP archive, 95.6 KB
    Training data: task2_ru_train.tsv
    Validation data: task2_ru_validation.tsv
    Testing data: task2_ru_test.tsv
    Script for downloading tweets:
    Description and script instructions:

    • Lexical Relations from the Wisdom of the Crowd (LRWC)
      This dataset, assembled by Dmitry Ustalov in 2017 for the Watlink method, contains the opinions of Russian native speakers about the relationship between a generic term (hypernym) and a specific instance of this term (hyponym) in 10,600 word pairs. It is based on the nouns from the Russian National Corpus and relationships from the RuThes and RuWordNet lexical ontologies.

    ZIP archive, 2.01 MB
    Input data: lrwc-1.1-assignments.tsv
    Training tasks: toloka-isa-50-skip-300-train-hit.tsv
    Aggregated results: lrwc-1.1-aggregated.tsv

    • Toloka Aggregation Features
      This dataset contains about 60,000 crowdsourced labels gathered on Toloka for 1,000 tasks and ground truth labels for almost all of them. The task was to classify websites into five categories based on the presence of adult content. Additionally, each task has 52 real-valued features that can be used to predict the category.

    ZIP archive, 0.45 MB
    Ground truth: golden_labels.tsv
    Features: features.tsv
    Crowd labels: crowd_labels.tsv

  • Download

    ZIP archive, 2.23 MB Crowd labels:
    Ground truth: report-curated.tsv.xz
    Aggregated results: bts-rnc-crowd.tsv

  • Download

    ZIP archive, 2.6 MB
    Crowd labels: crowd_labels.csv
    Ground truth: gt.csv
    Crowd labels: crowd_labels.csv
    Ground truth: gt.csv
    Crowd labels: crowd_labels.csv
    Ground truth: gt.csv
    Crowd labels: crowd_labels.csv
    Ground truth: gt.csv

    • Toloka Aggregation Relevance 2
      This dataset, designed for evaluating answer aggregation methods in crowdsourcing, contains around 0.5 million anonymized crowdsourced labels collected in the Relevance 2 Gradations project in 2016 at Yandex. In this project, query-document pairs are provided with binary labels: relevant or non-relevant. The dataset also contains gold labels for comparing aggregation methods.

    ZIP archive, 3.08 MB
    Crowd labels: crowd_labels.tsv
    Ground truth: golden_labels.tsv

    • Toloka Aggregation Relevance 5
      This dataset was designed for evaluating answer aggregation methods in crowdsourcing. It contains around 1 million anonymized crowdsourced labels collected in the Relevance 5 Gradations project in 2016 at Yandex. In this project, query-document pairs are labeled on a scale of 1 to 5. from most relevant to least relevant. The dataset also contains gold labels for comparing aggregation methods.

    ZIP archive, 7.17 MB
    Crowd labels: crowd_labels.tsv
    Ground truth: golden_labels.tsv
    Banned users: bans.tsv

    • Toloka Users & Tasks
      Collected for the KDD '20 paper "Prediction of Hourly Earnings and Completion Time on a Crowdsourcing Platform", this dataset contains user activity sessions recorded in 18 million tasks performed by 161,377 users in Toloka over a three-month period (September-November 2018). It includes timestamps, anonymized project and user identifiers, reward information, number of microtasks, instructions, data schema description, responses, and various descriptive task properties.

    ZIP archive, 1.07 GB
    Completed tasks: assignments.tsv
    Project data: projects.tsv
    Anonymized user data: users.tsv
    Task selection sessions: visits.tsv

      This dataset, as described in the NeurIPS '20 Data-Centric AI Workshop paper entitled "IMDB-WIKI-SbS: An Evaluation Dataset for Crowdsourced Pairwise Comparisons" , contains 9,150 images appearing in 250,249 paired comparisons annotated on the Toloka crowdsourcing platform. It has balanced distributions of age and gender using the well-known IMDB-WIKI dataset as ground truth.

    ZIP archive, 9 MB
    Crowd labels: crowd_labels.csv
    Ground truth: gt.csv

Have a dataset that you are ready to share? Submit it for publication on this page.

Collect and annotate
your dataset

Use the Toloka platform to prepare a dataset that meets your needs.

Start now