Glossary | Homai Knowledge Base

A

Technical Infrastructure

API (Application Programming Interface)

A set of protocols and tools that allows different software applications to communicate with each other. APIs enable developers to integrate language technology into their applications.

Example: Google Translate API allows developers to add translation features to their apps.

AI & Machine Learning

ASR (Automatic Speech Recognition)

Technology that converts spoken language into written text. Modern ASR systems use deep learning to understand speech patterns and can work in real-time.

Example: When you speak to your phone's voice assistant, ASR converts your words into text that the system can process.

B

Models & Frameworks

BERT (Bidirectional Encoder Representations from Transformers)

A foundational language model that understands context by looking at words from both directions. Many language-specific models are based on BERT.

Example: mBERT (multilingual BERT) supports 104 languages and can be fine-tuned for specific tasks.

Metrics & Evaluation

BLEU (Bilingual Evaluation Understudy)

A metric for evaluating machine translation quality by comparing machine output to human translations. Score ranges from 0 to 100.

Example: A BLEU score of 30+ generally indicates understandable translations, while 50+ suggests high quality.

C

Language Data

Corpus (plural: Corpora)

A large, structured collection of texts used for linguistic research and training language models. Can be monolingual (one language) or multilingual.

Example: A corpus of 1 million words of Navajo text collected from books, websites, and transcribed speeches.

Development Tools

Colab (Google Colaboratory)

A free cloud-based platform for running Python code with access to GPUs. Ideal for training models without expensive hardware.

Example: Training a translation model for your language using Colab's free GPU resources.

D

Language Data

Dataset

A structured collection of data used for training, validating, or testing machine learning models. For language technology, this includes text, audio, or aligned text-audio pairs.

Example: A TTS dataset containing 5 hours of audio recordings with corresponding text transcriptions.

F

AI & Machine Learning

Fine-tuning

The process of taking a pre-trained model and training it further on specific data to adapt it to a particular task or language. This is key to making LLMs work for minority languages.

Example: Taking a multilingual BERT model and fine-tuning it on Cherokee text to create a Cherokee language model.

G

Development Tools

Git/GitHub

Version control system (Git) and platform (GitHub) for collaborating on code and documents. Essential for managing language technology projects and sharing resources.

Example: Storing your language's keyboard layout files on GitHub so others can contribute improvements.

AI & Machine Learning

GPT (Generative Pre-trained Transformer)

A family of large language models developed by OpenAI. GPT models can generate human-like text and perform various language tasks.

Example: Using ChatGPT (based on GPT) to help translate documents or create language learning materials.

H

Technical Infrastructure

Hugging Face

A platform and community for sharing machine learning models, datasets, and applications. The go-to place for finding and sharing language models and datasets.

Example: Uploading your language's speech dataset to Hugging Face so researchers worldwide can access it.

J

Development Tools

Jupyter Notebook

An interactive computing environment where you can combine code, visualizations, and documentation. Popular for data science and ML experiments.

Example: Creating a notebook that documents your language's tokenization rules with examples.

K

Practical Applications

Keyboard Layout

The arrangement of keys on a keyboard for typing in a specific language. Essential for digital communication in any language.

Example: Creating a mobile keyboard that includes special characters like ñ, č, or ą specific to your language.

L

Language Processing

Lemmatization

The process of reducing words to their base or dictionary form (lemma). Essential for dictionary lookup and text analysis.

Example: "running", "runs", and "ran" would all be lemmatized to "run".

AI & Machine Learning

LLM (Large Language Model)

AI models trained on vast amounts of text data that can understand and generate human-like text. LLMs like GPT, Claude, and Gemini have revolutionized language technology by enabling work with low-resource languages.

Example: Using ChatGPT to translate text or using Claude to help digitize a dictionary from images.

M

Language Processing

Morphology

The study of word structure and how words are formed. Important for languages with complex word formation rules.

Example: In agglutinative languages like Turkish, one word can contain what would be an entire sentence in English.

Metrics & Evaluation

MOS (Mean Opinion Score)

A measure of voice quality in TTS systems based on human listener ratings, typically on a scale of 1-5.

Example: A TTS system with MOS of 4.2 sounds nearly as natural as human speech.

Practical Applications

MT (Machine Translation)

Automated translation between languages using AI. Modern neural MT systems can produce near-human quality translations with sufficient training data.

Example: Building a translator app that converts between your indigenous language and the national language.

N

Language Processing

NER (Named Entity Recognition)

The task of identifying and classifying named entities (people, places, organizations) in text. Important for information extraction and translation.

Example: Identifying "Microsoft" as an organization and "Seattle" as a location in a text.

AI & Machine Learning

NLP (Natural Language Processing)

The field of AI focused on enabling computers to understand, interpret, and generate human language. NLP encompasses everything from basic text processing to advanced language understanding.

Example: Spam filters, machine translation, sentiment analysis, and chatbots all use NLP techniques.

AI & Machine Learning

NMT (Neural Machine Translation)

Machine translation using neural networks, which has largely replaced older statistical methods. NMT produces more fluent and accurate translations.

Example: Google Translate's modern system uses NMT to provide translations between languages.

O

Language Data

OCR (Optical Character Recognition)

Technology that converts images of text (from scanned documents, photos, etc.) into machine-readable text. Modern OCR uses AI for better accuracy.

Example: Converting a photographed page from an old dictionary into editable text using Google Vision or Tesseract.

P

Language Data

Parallel Corpus

A collection of texts in two or more languages where each text is a translation of the other. Essential for training machine translation systems.

Example: The Bible in English and Swahili, with verses aligned between languages.

Language Processing

POS Tagging (Part-of-Speech Tagging)

The process of marking words in text with their grammatical categories (noun, verb, adjective, etc.). Foundation for many NLP tasks.

Example: In "The cat sleeps", tagging would identify "The" as determiner, "cat" as noun, "sleeps" as verb.

Development Tools

PyTorch

An open-source machine learning framework widely used for developing and training neural networks. Many language models are built using PyTorch.

Example: Training a custom TTS model for your language using PyTorch's neural network libraries.

S

Practical Applications

Spell Checker

Software that identifies and suggests corrections for misspelled words. Modern spell checkers use context and AI to provide better suggestions.

Example: A keyboard app that underlines misspelled words and suggests corrections in your language.

AI & Machine Learning

STT (Speech-to-Text)

Another term for ASR (Automatic Speech Recognition). Converts spoken words into written text.

Example: Dictation features in smartphones that let you speak instead of type.

T

Models & Frameworks

Tacotron

A neural text-to-speech synthesis model that generates speech directly from text. Known for producing natural-sounding speech.

Example: Training Tacotron 2 on your language's audio data to create a TTS system.

Development Tools

TensorFlow

Google's open-source platform for machine learning. Alternative to PyTorch, with strong production deployment capabilities.

Example: Deploying your language's spell checker model as a TensorFlow Lite app on mobile devices.

Language Processing

Tokenization

The process of breaking text into smaller units (tokens) like words, subwords, or characters. Critical for processing text in machine learning models.

Example: Breaking "Hello world!" into ["Hello", "world", "!"] or ["Hell", "o", " ", "world", "!"]

AI & Machine Learning

Transfer Learning

A machine learning technique where knowledge gained from training on one task is applied to a different but related task. Crucial for low-resource languages.

Example: A model trained on 100 languages can be adapted to work with a new language using just a small amount of data.

AI & Machine Learning

Transformer

A neural network architecture that revolutionized NLP. Transformers use attention mechanisms to process sequences and are the foundation of models like BERT and GPT.

Example: All modern LLMs like ChatGPT, Claude, and Gemini are based on the transformer architecture.

AI & Machine Learning

TTS (Text-to-Speech)

Technology that converts written text into spoken audio. Modern TTS systems can create natural-sounding speech that closely mimics human voice patterns and intonation.

Example: Screen readers for visually impaired users, or navigation apps that speak directions aloud.

U

Technical Infrastructure

Unicode

The international standard for encoding text in all writing systems. Ensures that characters from any language can be consistently represented and displayed across different devices and platforms.

Example: Unicode includes characters for Latin, Cyrillic, Arabic, Cherokee syllabary, and thousands of other scripts.

Technical Infrastructure

UTF-8

The most common character encoding for Unicode. UTF-8 can represent any character in the Unicode standard while being backward compatible with ASCII.

Example: Web pages use UTF-8 encoding to correctly display text in any language.

V

Models & Frameworks

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)

A modern TTS model that produces high-quality, natural-sounding speech. Can work with relatively small amounts of training data.

Example: Creating a voice for your language with just 2-3 hours of recordings using VITS.

Practical Applications

Voice Assistant

AI-powered systems that understand spoken commands and respond with speech. Combines ASR, NLP, and TTS technologies.

Example: A smart speaker that understands commands in Cherokee and responds appropriately.

W

Models & Frameworks

Wav2Vec2

Facebook/Meta's self-supervised speech recognition model. Particularly good for low-resource languages as it can learn from unlabeled audio data.

Example: Training Wav2Vec2 on just 10 hours of transcribed audio to create a functional ASR system.

Metrics & Evaluation

WER (Word Error Rate)

The standard metric for evaluating ASR systems. Measures the percentage of words incorrectly recognized. Lower is better.

Example: An ASR system with 5% WER means it correctly recognizes 95 out of 100 words.

Models & Frameworks

Whisper

OpenAI's speech recognition model that works across many languages. Known for its robustness and ability to work with accented speech and background noise.

Example: Using Whisper to transcribe interviews in indigenous languages for documentation projects.

Language Technology Glossary

A

API (Application Programming Interface)

ASR (Automatic Speech Recognition)

B

BERT (Bidirectional Encoder Representations from Transformers)

BLEU (Bilingual Evaluation Understudy)

C

Corpus (plural: Corpora)

Colab (Google Colaboratory)

D

Dataset

F

Fine-tuning

G

Git/GitHub

GPT (Generative Pre-trained Transformer)

H

Hugging Face

J

Jupyter Notebook

K

Keyboard Layout

L

Lemmatization

LLM (Large Language Model)

M

Morphology

MOS (Mean Opinion Score)

MT (Machine Translation)

N

NER (Named Entity Recognition)

NLP (Natural Language Processing)

NMT (Neural Machine Translation)

O

OCR (Optical Character Recognition)

P

Parallel Corpus

POS Tagging (Part-of-Speech Tagging)

PyTorch

S

Spell Checker

STT (Speech-to-Text)

T

Tacotron

TensorFlow

Tokenization

Transfer Learning

Transformer

TTS (Text-to-Speech)

U

Unicode

UTF-8

V

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)

Voice Assistant

W

Wav2Vec2

WER (Word Error Rate)

Whisper