Last week Yandex, the Russian NASDAQ-listed digital tech giant, launched an online course on data labeling, presenting the initiative as a world’s first.
This five-week course aims to teach “efficient and scalable data labeling for ML and various business processes” based on a crowdsourcing approach. Students are explained how to “split complex challenges into small tasks and distribute them among a vast cloud of performers,” while “integrating this on-demand workforce directly into [business] processes and build human-in-the-loop processes.”
The course also covers such topics as the “control the quality and accuracy of data labeling to develop high performing ML models”, and the “design and run of a full-cycle crowdsourcing project, from planning to getting labeled data.”
‘Practical Crowdsourcing for Efficient Machine Learning’ is available free of charge on Coursera platform. No background knowledge is required, since the course is intended to both professionals in the fields of ML development, data analysis and research, and to students who are just considering these careers.
The teachings are supported with in-depth expertise and real-life case studies by Toloka, an international crowdsourced data-labeling platform with Russian roots.
Russian data know-how for global companies
Praising the “wisdom of the global crowd for high-quality data,” Toloka is a Yandex spin-off. Back in 2014, it started as an internal service to address in-house data labelling needs. A Yandex staff data quality assessor working remotely, Olga Megorskaya, noticed that the company’s data labelling process was far from being satisfactory, and suggested to use a flexible distributed workforce.
From an internal data markup tool, the project became a commercial solution for IT, e-commerce and retail companies in Russia with a leading position on the domestic market. Toloka ultimately began providing services to foreign companies in Europe, Asia and the United States.
Earlier this year, Toloka registered a separate entity in Switzerland. Today the company serves some 2,400 businesses worldwide and continuing to expand its reach. It boasts a worker base – who are referred to as ‘Tolokers’ – of around 9 million people, located all across the world.
The company’s press service told East-West Digital News it is in the process of obtaining its global data safety certificate (ISO) and moving its servers to the Azure cloud.
A $12 billion market by 2025
AI and machine learning still rely on human intelligence at the data preparation stage, explains Yandex: people are needed to label the data. This preparation often involves text or image classification or checking audio transcriptions.
Thus, as the use of AI becomes more commonplace, the global data labeling market is growing considerably. McKinsey has estimated it as $2 billion in 2019, and expects it to reach some $12.1 billion by 2025 (cited by Toloka).
More specifically, the market for third-party data labeling could exceed $1 billion by 2023, up from $150 million in 2018, according Cognilytica forecasts (cited by Yandex).