Language Technology Glossary
Essential terms and concepts for language digitization, AI models, and linguistic technology
A
API (Application Programming Interface)
A set of protocols and tools that allows different software applications to communicate with each other. APIs enable developers to integrate language technology into their applications.
ASR (Automatic Speech Recognition)
Technology that converts spoken language into written text. Modern ASR systems use deep learning to understand speech patterns and can work in real-time.
B
BERT (Bidirectional Encoder Representations from Transformers)
A foundational language model that understands context by looking at words from both directions. Many language-specific models are based on BERT.
BLEU (Bilingual Evaluation Understudy)
A metric for evaluating machine translation quality by comparing machine output to human translations. Score ranges from 0 to 100.
C
Corpus (plural: Corpora)
A large, structured collection of texts used for linguistic research and training language models. Can be monolingual (one language) or multilingual.
Colab (Google Colaboratory)
A free cloud-based platform for running Python code with access to GPUs. Ideal for training models without expensive hardware.
D
Dataset
A structured collection of data used for training, validating, or testing machine learning models. For language technology, this includes text, audio, or aligned text-audio pairs.
F
Fine-tuning
The process of taking a pre-trained model and training it further on specific data to adapt it to a particular task or language. This is key to making LLMs work for minority languages.
G
Git/GitHub
Version control system (Git) and platform (GitHub) for collaborating on code and documents. Essential for managing language technology projects and sharing resources.
GPT (Generative Pre-trained Transformer)
A family of large language models developed by OpenAI. GPT models can generate human-like text and perform various language tasks.
H
Hugging Face
A platform and community for sharing machine learning models, datasets, and applications. The go-to place for finding and sharing language models and datasets.
J
Jupyter Notebook
An interactive computing environment where you can combine code, visualizations, and documentation. Popular for data science and ML experiments.
K
Keyboard Layout
The arrangement of keys on a keyboard for typing in a specific language. Essential for digital communication in any language.
L
Lemmatization
The process of reducing words to their base or dictionary form (lemma). Essential for dictionary lookup and text analysis.
LLM (Large Language Model)
AI models trained on vast amounts of text data that can understand and generate human-like text. LLMs like GPT, Claude, and Gemini have revolutionized language technology by enabling work with low-resource languages.
M
Morphology
The study of word structure and how words are formed. Important for languages with complex word formation rules.
MOS (Mean Opinion Score)
A measure of voice quality in TTS systems based on human listener ratings, typically on a scale of 1-5.
MT (Machine Translation)
Automated translation between languages using AI. Modern neural MT systems can produce near-human quality translations with sufficient training data.
N
NER (Named Entity Recognition)
The task of identifying and classifying named entities (people, places, organizations) in text. Important for information extraction and translation.
NLP (Natural Language Processing)
The field of AI focused on enabling computers to understand, interpret, and generate human language. NLP encompasses everything from basic text processing to advanced language understanding.
NMT (Neural Machine Translation)
Machine translation using neural networks, which has largely replaced older statistical methods. NMT produces more fluent and accurate translations.
O
OCR (Optical Character Recognition)
Technology that converts images of text (from scanned documents, photos, etc.) into machine-readable text. Modern OCR uses AI for better accuracy.
P
Parallel Corpus
A collection of texts in two or more languages where each text is a translation of the other. Essential for training machine translation systems.
POS Tagging (Part-of-Speech Tagging)
The process of marking words in text with their grammatical categories (noun, verb, adjective, etc.). Foundation for many NLP tasks.
PyTorch
An open-source machine learning framework widely used for developing and training neural networks. Many language models are built using PyTorch.
S
Spell Checker
Software that identifies and suggests corrections for misspelled words. Modern spell checkers use context and AI to provide better suggestions.
STT (Speech-to-Text)
Another term for ASR (Automatic Speech Recognition). Converts spoken words into written text.
T
Tacotron
A neural text-to-speech synthesis model that generates speech directly from text. Known for producing natural-sounding speech.
TensorFlow
Google's open-source platform for machine learning. Alternative to PyTorch, with strong production deployment capabilities.
Tokenization
The process of breaking text into smaller units (tokens) like words, subwords, or characters. Critical for processing text in machine learning models.
Transfer Learning
A machine learning technique where knowledge gained from training on one task is applied to a different but related task. Crucial for low-resource languages.
Transformer
A neural network architecture that revolutionized NLP. Transformers use attention mechanisms to process sequences and are the foundation of models like BERT and GPT.
TTS (Text-to-Speech)
Technology that converts written text into spoken audio. Modern TTS systems can create natural-sounding speech that closely mimics human voice patterns and intonation.
U
Unicode
The international standard for encoding text in all writing systems. Ensures that characters from any language can be consistently represented and displayed across different devices and platforms.
UTF-8
The most common character encoding for Unicode. UTF-8 can represent any character in the Unicode standard while being backward compatible with ASCII.
V
VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)
A modern TTS model that produces high-quality, natural-sounding speech. Can work with relatively small amounts of training data.
Voice Assistant
AI-powered systems that understand spoken commands and respond with speech. Combines ASR, NLP, and TTS technologies.
W
Wav2Vec2
Facebook/Meta's self-supervised speech recognition model. Particularly good for low-resource languages as it can learn from unlabeled audio data.
WER (Word Error Rate)
The standard metric for evaluating ASR systems. Measures the percentage of words incorrectly recognized. Lower is better.
Whisper
OpenAI's speech recognition model that works across many languages. Known for its robustness and ability to work with accented speech and background noise.