After creating your parallel corpus, you're left with thousands of sentence pairs. But how accurate are they? Both automated alignment and AI-generated translations can introduce errors. Training a model on bad data will result in a bad model. Therefore, human validation is an absolutely essential, unskippable step.
Manually checking every single pair is slow, tedious, and can burn out even the most dedicated experts. A much better approach is to turn this task into a simple, engaging activity for your entire language community.
Our Solution: Rapid Validation with a Telegram Bot
At Homai, we use a custom Telegram bot to crowdsource the validation process. Telegram is a secure, popular messaging app that works great on mobile devices, making it easy for volunteers to contribute from anywhere, anytime. The process is designed to be as simple as a game.
How the Validation Bot Works
A Task is Sent
The bot randomly selects an unverified sentence pair from your database and sends it to an active volunteer in your community group.
Simple, Clear Interface
The volunteer sees the original sentence in their native language and the proposed translation below it. There are only two options.
Example Bot Message:
Please check this translation:
Original (English): The quick brown fox jumps over the lazy dog.
Translation (Russian): Быстрая коричневая лиса прыгает через ленивую собаку.
Is this translation correct?
[👍 Yes] [👎 No]
Instant Feedback
The volunteer taps either "👍 Yes" (like) or "👎 No" (dislike). Their response is instantly recorded in the database, and the bot immediately sends them a new sentence pair. This creates a fast, engaging loop that encourages participation.
Data Segregation
After thousands of these interactions, your dataset is automatically sorted into two groups:
- High-Confidence Data: Sentence pairs that received one or more "likes" and no "dislikes." This data is considered clean and ready for training.
- Low-Confidence Data: Pairs that received a "dislike." These are flagged for review by expert linguists.
Why is this so effective?
- Speed: 100 volunteers checking 50 sentences each can validate 5,000 pairs in a single afternoon.
- Community Engagement: It gives speakers a tangible way to contribute to their language's future and fosters a sense of collective ownership.
- Accessibility: No special software is needed, just a smartphone with Telegram.
- Focuses Expert Time: It allows your most skilled linguists to focus their time on fixing the difficult, problematic sentences instead of manually approving thousands of correct ones.
Your Data is Clean and Verified
With a high-quality, validated parallel corpus, you have everything you need to train your first machine translation model.
Learn How to Train a Translation Model