Text-to-Speech (TTS), or speech synthesis, is the technology that gives a voice to our devices. Creating a natural, high-quality synthetic voice requires a very special kind of dataset. Unlike data for speech recognition (ASR), which benefits from many different speakers and noisy environments, TTS data must be pristine, consistent, and recorded under studio-like conditions.
TTS vs. ASR Data: Quality over Quantity
For ASR, you want thousands of voices in different environments. For TTS, you want one perfect voice in one perfect environment. Every background noise, every change in tone, every cough or lip smack in your dataset can be learned by the model and reproduced in the final synthetic voice.
The Three Pillars of Quality TTS Data
Creating a great TTS voice rests on three critical components: the speaker, the space, and the equipment.
1. The Speaker
You are not just recording a voice; you are cloning it. The speaker you choose will define the final synthetic voice. Look for someone with:
- A Clear and Consistent Voice: The speaker should have excellent articulation and be able to maintain a consistent volume, pitch, and pace for long periods.
- Stamina and Patience: Recording 10+ hours of audio is a marathon. The speaker needs the physical and mental stamina to read thousands of sentences without their voice degrading.
- A Neutral Tone: For a general-purpose voice, the default recording style should be a neutral, declarative tone—like a newscaster. Expressive styles can be recorded later.
2. The Recording Space
A professional microphone is useless in a bad room. Your recording space must be:
- Quiet: Free from external noise (traffic, birds) and internal noise (refrigerators, air conditioning, computer fans).
- Non-Reverberant ("Dead"): Free from echo. Hard, flat surfaces like walls and windows create echo. You must treat the space with sound-absorbing materials like acoustic foam, heavy blankets, or even a closet full of clothes to dampen reflections.
3. The Equipment
- Professional Microphone: A large-diaphragm condenser microphone is the industry standard. It captures the rich detail and nuance of the human voice.
- Audio Interface: This device connects the microphone to your computer, providing clean power and high-quality digital conversion.
- Pop Filter: A screen placed between the speaker and the microphone. It is essential to prevent "plosives"—the harsh bursts of air from 'p' and 'b' sounds—from creating loud, unpleasant thumps in the recording.
- Headphones: For monitoring the recording in real-time to catch any noise or issues.
The Recording Process
How much data is needed?
A good starting point for a high-quality voice is around 10 hours of clean, recorded audio. The exact amount can vary depending on the phonetic complexity of your language. For more natural and expressive voices, 20+ hours may be required.
Prepare a Phonetically Balanced Script
You can't just read a random book. The recording script must be carefully constructed to include all the sounds (phonemes) and sound combinations (diphones) of your language multiple times. This ensures the model has enough examples to learn how to pronounce anything.
Guide the Speaker
During the session, the speaker should maintain a consistent distance from the microphone. They should read in a steady, neutral tone unless a specific emotion is requested. Frequent breaks are essential to prevent vocal fatigue.
Record Different Intonations (Advanced)
Once you have a solid neutral base, you can record smaller, supplementary datasets with different intonations (e.g., questions, exclamations). This allows you to train a more expressive, multi-style TTS model.
Post-Processing
After recording, the long audio files must be precisely cut into individual sentences and matched with their corresponding text from the script. The audio should also be normalized to a consistent volume level and meticulously checked for any remaining noise.
Resources
LJ Speech Dataset
A famous public domain English TTS dataset. Its structure is a great example to follow.
View ExampleSweetwater's Mic Guide
A comprehensive guide to different types of microphones, including condensers for vocals.
Read the GuideAudacity (Free Software)
A free, open-source audio editor perfect for recording and post-processing your TTS data.
Download Audacity