Collecting Data for ASR

Effective methods for gathering speech data: from Common Voice contributions to scripted recordings.

Automatic Speech Recognition (ASR) is the technology that powers voice assistants, automatic transcription, and voice commands. The performance of any ASR system is directly tied to the quantity and quality of the speech data it was trained on. This guide covers the best ways to collect this vital resource for your language.

The Gold Standard: Mozilla Common Voice

The best place to start and focus your community's efforts is Mozilla Common Voice. It is a global, open-source initiative to build the world's largest public domain voice dataset.

Why Common Voice is the Best Choice

  • High Visibility: It's the first place AI researchers and major tech companies look for speech data. Being on Common Voice makes your language visible to the world.
  • Structured Data: It doesn't just collect audio; it collects structured, validated pairs of audio clips and their corresponding text transcriptions, which is exactly what's needed for training.
  • Community-Driven: It provides a ready-made platform for your community to contribute by reading sentences (speaking) and listening to others' recordings to validate them.

A More Flexible Platform: The New Common Voice API

In the past, some communities found the Common Voice website inflexible. Now, this has changed completely. In response to community feedback, Mozilla has launched a Public API for Common Voice. This is a game-changer for data collection.

What the API Allows You to Do

The API means you are no longer limited to the official website. You can now build your own applications and tools that send data directly to the Common Voice database. This opens up incredible possibilities:

  • Create a simple mobile app to encourage on-the-go contributions.
  • Build a Telegram bot for a gamified contribution experience.
  • Develop tools that support offline contributions in areas with poor internet access.
  • Design interfaces that support multiple writing systems or dialects for your language.

Get Funding for Your Project

To support the adoption of this new API, Mozilla has also launched a Developer Fund to help projects bring Common Voice closer to their communities. If you have an idea for a custom tool, you may be eligible for funding. You can contact their team directly with questions at commonvoice@mozilla.com.

How to Drive Contributions: Our Contest Method

Simply having a platform isn't enough; you need to motivate people to use it. At Homai, we've had great success turning data collection into a community-wide contest.

1

Organize a Competition

Announce a data collection drive with a clear goal (e.g., "Let's record 100 hours of speech in one month!"). Secure desirable prizes for the most active contributors.

2

Offer Meaningful Incentives

We've found that prizes like smartphones, laptops, or gift certificates generate significant excitement and participation. A small investment in prizes can yield a massive return in high-quality data.

3

Promote and Gamify

Use social media to promote the contest. Post regular updates on leaderboards to foster friendly competition. This gamification makes contributing feel rewarding and fun.

Diversity is Key! It's crucial that as many different people as possible contribute. An ASR model needs to learn from a wide variety of voices—young and old, male and female, with different accents and speaking styles. A contest helps achieve this by broadening participation beyond a small group of experts.

How Much Data Do You Need?

The simple answer is: the more, the better. However, modern ASR techniques have made it possible to get started with surprisingly little data.

You can train the first useful ASR models with as little as 10 hours of validated speech data. This is an achievable goal for a motivated community and provides a clear target for your first data collection drive.

Finding Data in the Wild

In addition to new recordings, you can find valuable speech data from existing sources. This audio is often "un-transcribed" and requires extra work, but it's a great way to supplement your dataset.

  • Public radio and television broadcasts
  • YouTube channels, podcasts, and vlogs in your language
  • Audiobooks and other recorded media

Advanced: Improving "Wild" Data with AI

You can use an iterative, AI-assisted process to create transcriptions for this "found" audio:

  1. Initial ASR Pass: Run the audio through a preliminary, even low-quality, ASR model to get a rough first draft of the transcription.
  2. LLM Refinement: Feed the rough transcription and the audio context to a powerful LLM. Using your digitized dictionary as a guide, ask the LLM to correct the text.
  3. Human Verification: The final, AI-cleaned text must be reviewed by a human to catch any remaining errors. This process is much faster than transcribing from scratch.

Resources

Mozilla Common Voice

The central platform for community-driven speech data collection.

Start Contributing

YouTube

A vast, searchable source of video and audio content in many languages.

Search YouTube

Hugging Face Datasets

Once collected, you can host your ASR dataset here for global visibility.

Explore ASR Datasets

Ready to Train Your Model?

With a growing dataset of clean, transcribed audio, you're ready to train an Automatic Speech Recognition model.

Learn How to Train an ASR Model