Training MT Models (Machine Translation)

Learn the basics of neural machine translation (NMT), with detailed guides and code resources.

With a clean parallel corpus in hand, you are ready to train a Neural Machine Translation (NMT) model. This model will learn the statistical patterns between your source and target languages, enabling it to translate new sentences. There are two primary paths you can take: fine-tuning a massive multilingual model or training your own model from scratch.

Two Paths to Translation

  • Fine-tuning (Recommended Start): Take a huge, pre-trained model like Meta's NLLB-200 and continue training it on your specific language pair. This is often faster and requires less data. We have a guide for this here.
  • Training from Scratch: Use a toolkit like Sockeye to build a model from the ground up. This gives you more control but is more resource-intensive. This guide will walk you through this process.

Step-by-Step: Training a Mari-Russian Translator with Sockeye

We will demonstrate how to build a translator from Mari to Russian using Sockeye, an open-source NMT framework from Amazon built on PyTorch. It powers Amazon Translate and is optimized for both training and inference. The full code is available as a Jupyter Notebook.

Step 1: Installing Dependencies

First, we install Sockeye and the `datasets` library from Hugging Face, which makes it easy to download our parallel corpus.

!pip install sockeye datasets

Step 2: Downloading and Splitting the Corpus

We load our Mari-Russian parallel corpus from the Hugging Face Hub, shuffle it randomly, and then split it into a training set (99% of the data) and a development/validation set (1%). The development set is used to check the model's performance during training.

from datasets import load_dataset
import os

dataset = load_dataset("AigizK/mari-russian-parallel-corpora")
shf_dataset = dataset.shuffle(seed=42)
split_dataset = shf_dataset['train'].train_test_split(test_size=0.01)

# Create directories to store the data
os.mkdir('content'); os.mkdir('content/data')
os.mkdir('content/data/train'); os.mkdir('content/data/dev')

def writelines(data, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        for item in data:
            f.write(f"{item}\n")

# Write the data to text files
writelines([l['mhr'] for l in split_dataset['train']], 'content/data/train/train.mhr.txt')
writelines([l['rus'] for l in split_dataset['train']], 'content/data/train/train.rus.txt')
writelines([l['mhr'] for l in split_dataset['test']], 'content/data/dev/dev.mhr.txt')
writelines([l['rus'] for l in split_dataset['test']], 'content/data/dev/dev.rus.txt')

Step 3: Preparing the Data with Subword Tokenization

Languages have vast vocabularies. To manage this, we break words down into common "subwords" using a technique called Byte Pair Encoding (BPE). This allows the model to handle rare or new words by constructing them from smaller, known pieces. We use the subword-nmt library for this.

# Clone the subword-nmt repository
!git clone https://github.com/rsennrich/subword-nmt.git

# Define language codes
src = 'mhr'
tgt = 'rus'

# Learn a joint BPE vocabulary from the training data
!python subword-nmt/subword_nmt/learn_joint_bpe_and_vocab.py \
    --input content/data/train/train.{src}.txt content/data/train/train.{tgt}.txt \
    -s 10000 -o content/data/bpe.codes \
    --write-vocabulary content/data/bpe.vocab.{src} content/data/bpe.vocab.{tgt}

# Apply BPE to all our data files
!python subword-nmt/subword_nmt/apply_bpe.py -c content/data/bpe.codes --vocabulary content/data/bpe.vocab.{src} < content/data/train/train.{src}.txt > content/data/train/train.{src}.bpe
!python subword-nmt/subword_nmt/apply_bpe.py -c content/data/bpe.codes --vocabulary content/data/bpe.vocab.{tgt} < content/data/train/train.{tgt}.txt > content/data/train/train.{tgt}.bpe
!python subword-nmt/subword_nmt/apply_bpe.py -c content/data/bpe.codes --vocabulary content/data/bpe.vocab.{src} < content/data/dev/dev.{src}.txt > content/data/dev/dev.{src}.bpe
!python subword-nmt/subword_nmt/apply_bpe.py -c content/data/bpe.codes --vocabulary content/data/bpe.vocab.{tgt} < content/data/dev/dev.{tgt}.txt > content/data/dev/dev.{tgt}.bpe

Step 4: Training the Model

With our data preprocessed, we can begin training. First, we run `sockeye.prepare_data`, which converts our text data into an optimized binary format for faster loading. Then, we launch the main training process with `sockeye.train`. This command specifies the model architecture (Transformer), size, and other critical hyperparameters.

# Prepare data for Sockeye
!python3 -m sockeye.prepare_data \
    -s content/data/train/train.{src}.bpe \
    -t content/data/train/train.{tgt}.bpe \
    --shared-vocab -o content/{src}_{tgt}_data

# Start training!
!python3 -m sockeye.train \
    -d content/{src}_{tgt}_data \
    -vs content/data/dev/dev.{src}.bpe \
    -vt content/data/dev/dev.{tgt}.bpe \
    --encoder transformer --decoder transformer \
    --transformer-model-size 512 \
    --transformer-feed-forward-num-hidden 2048 \
    --num-embed 512 \
    --max-seq-len 100 \
    --decode-and-evaluate 500 \
    -o {src}_{tgt}_model \
    --batch-size 1024 \
    --optimized-metric bleu \
    --max-num-checkpoint-not-improved 7

This process will take several hours, depending on your GPU. Sockeye will periodically save checkpoints and evaluate the model's performance (BLEU score) on the development set.

Step 5: Testing the Translation

Once training is complete, you have a model ready for translation. We can pass a Mari sentence through our BPE processor and then into the Sockeye translator to get the Russian output. The final `sed` command simply cleans up the BPE subword markers.

echo "Икана кеҥежым лӱдшӧ мераҥ кужу кож тӱҥыштӧ шоген." | \
  python3 subword-nmt/subword_nmt/apply_bpe.py -c content/data/bpe.codes --vocabulary content/data/bpe.vocab.mhr | \
  python -m sockeye.translate -m {src}_{tgt}_model 2>/dev/null | \
  sed -r 's/@@( |$)//g'
# Expected output: Однажды летом сидел трусливый заяц у высокой ели.

Resources

Sockeye NMT Toolkit

The official GitHub repository for the Sockeye project.

View on GitHub

Fine-tuning NLLB-200

Our guide and code for fine-tuning Meta's powerful multilingual model.

View on GitHub

Mari-Russian Example

The complete Jupyter Notebook for the example in this guide.

View Notebook

Subword-NMT

The library used for Byte Pair Encoding (BPE) subword tokenization.

View on GitHub

You've Built a Translator! What's Next?

Now that you can translate text, the next frontier is working with speech. Learn how to train a model that can create a synthetic voice for your language.

Learn About Training TTS Models