Knowledge Base

Introduction

Why Digitize Your Language?

Understand the significance of digital presence for preserving and revitalizing indigenous languages in today’s connected world.

Read

Digitization Process Overview

Discover how your language goes digital: from concept to practical tools and apps.

Read

Glossary

Not familiar with ASR, TTS, corpora, or Hugging Face? Here's your guide to essential terms.

Read

Stage 1: Digital Foundations, Fonts and Keyboards

Unicode Characters

Check if your language’s characters exist in Unicode and learn how to request additions.

Read

Creating Desktop Keyboards

Step-by-step instructions to build keyboard layouts for Windows, macOS, and Linux.

Read

Ordering Mobile Keyboards

Find out who can create mobile keyboards for iOS and Android, what you’ll need, and estimated costs.

Read

Stage 2: Data Collection and Preparation

Digitizing Texts (OCR)

How to convert printed books and documents into digital text using FineReader, Vision LLM, and other ML tools.

Read

Digitizing Dictionaries

Methods for turning printed dictionaries into structured databases. This foundational step directly determines the quality of your future corpora and AI models.

Read

Creating a Monolingual Corpus

How to clean and format digitized texts to create a high-quality text corpus.

Read

Parallel Corpora

Sources for parallel texts and where to find them; discover tools to automatically align sentence pairs.

Read

Validating Alignments

Quickly check the quality of automatically aligned texts with volunteer and community help, using tools like Telegram bots.

Read

Recording Audio for TTS

Best practices for choosing equipment, recording spaces, and speaker guidelines for quality speech synthesis datasets.

Read

Collecting Data for ASR

Effective methods for gathering speech data: from Common Voice contributions to scripted recordings.

Read

Uploading Data to HuggingFace

How to format and upload your text and audio datasets to Hugging Face. (Paid automation tool available!)

Read

Stage 3: DIY Model Training

Training ASR Models (Automatic Speech Recognition)

ОAn overview of leading models (Wav2Vec2, Whisper), including tutorials and code references for training your own ASR models.

Read

Training TTS Models (Speech Synthesis)

Understand popular frameworks (Tacotron, VITS), their strengths and weaknesses, and follow step-by-step training instructions.

Coming Soon

Training MT Models (Machine Translation)

Learn the basics of neural machine translation (NMT), with detailed guides and code resources.

Read

Creating Spellcheckers

Methods and tools to develop effective spellcheck systems for your language.

Coming Soon

Simplify Your Model Training!

We offer easy-to-use, paid tools for ASR, TTS, and MT training: no programming needed, just provide your dataset and launch your training.

Reach out to learn more

Stage 4: Applying Language Technologies

Ideas and Opportunities

Explore practical ways your ASR, TTS, and MT models can be implemented: via smart assistants, content translation, educational apps, and more.

Coming soon

Complete, Ready-to-Use Solutions

Discover our turnkey products—like smart speakers and automated video translation—that seamlessly integrate your trained AI models.

View Products

Homai Knowledge Base

Introduction

Why Digitize Your Language?

Digitization Process Overview

Glossary

Stage 1: Digital Foundations, Fonts and Keyboards

Unicode Characters

Creating Desktop Keyboards

Ordering Mobile Keyboards

Stage 2: Data Collection and Preparation

Digitizing Texts (OCR)

Digitizing Dictionaries

Creating a Monolingual Corpus

Parallel Corpora

Validating Alignments

Recording Audio for TTS

Collecting Data for ASR

Uploading Data to HuggingFace

Stage 3: DIY Model Training

Training ASR Models (Automatic Speech Recognition)

Training TTS Models (Speech Synthesis)

Training MT Models (Machine Translation)

Creating Spellcheckers

Simplify Your Model Training!

Stage 4: Applying Language Technologies

Ideas and Opportunities

Complete, Ready-to-Use Solutions