Digitizing Dictionaries

Transforming paper dictionaries into structured digital data—the cornerstone of high-quality language corpora and AI.

Digitizing a book involves turning pages into text. Digitizing a dictionary is far more complex: it's about converting pages into a structured database. A dictionary isn't just a list of words; it's a rich collection of headwords, definitions, parts of speech, and example sentences. The quality of this structured data directly determines the quality of every tool you build from it, from spell checkers to machine translation models.

The Easiest Method: Vision LLMs

Traditional OCR will only give you a flat text file, leaving you with the difficult task of writing complex code to parse each entry. Vision Language Models (like GPT-4o or Gemini) are a game-changer because they can understand the visual structure of a dictionary page and output the data directly into a machine-readable format like JSON or XML. This dramatically simplifies the process for developers.

Our Proven Workflow for Dictionary Digitization

At Homai, we have developed a robust, three-step process to digitize dictionaries accurately and efficiently using Vision LLMs. Here’s how we do it:

1

The Precision Prompt: Constrain the Alphabet

An LLM might invent or misinterpret characters not present in your language. To prevent this, the first step is to give the model strict boundaries. We explicitly list every valid character of the language in our prompt. This forces the model to only output characters from the defined set, drastically reducing errors.

Prompt Snippet:
"...The language is Navajo. The only valid characters you can use in the 'navajo_word' field are: a, á, ą, ą́, b, ch, ch', d, dl, dz, e, é, ę, ę́, g, gh, h, hw, i, í, į, į́, j, k, k', l, ł, m, n, o, ó, ǫ, ǫ́, s, sh, t, t', tł, tł', ts, ts', w, x, y, z, zh, '. Do not output any other characters."

2

The Structure Prompt: Define a Custom Format

No two dictionaries are alike. Some have example sentences, others have phonetic notations or etymologies. We first analyze the source dictionary to identify all its data fields. Then, we design a custom JSON structure and instruct the LLM to populate it. If the dictionary contains example sentences, we always ask for them in a separate field. This is incredibly valuable, as these examples can later be used to build a parallel corpus for machine translation.

Prompt Snippet:
"For each entry in the image, provide the output in the following JSON format. If a field is not present, use an empty string "":
{
  "source_word": "The word in our language",
  "part_of_speech": "e.g., noun, verb",
  "translation": "The English translation",
  "example_sentence": "An example sentence using the word",
  "example_translation": "The translation of the example sentence"
}"

3

The Verification Workflow: Trust but Verify

AI is a powerful assistant, not an infallible expert. Every single parsed entry must be verified. To make this efficient, we use a two-phase process. First, we extract all the `source_word` entries from the generated JSON and give this list to a native speaker. They can quickly scan the list and mark any words that are misspelled or don't exist. Second, we take this shorter list of "problem words" and manually check them against the original scans, correcting the data in a targeted, semi-manual way.

Crucial Step: Do not skip the verification process! A small error in your dictionary data can cascade into major problems in your downstream applications like spell checkers or language learning apps.

Resources and Tools

OpenAI's ChatGPT-4o

State-of-the-art Vision LLM. Excellent for structured data extraction from images.

Try ChatGPT

Google Gemini

A powerful alternative Vision LLM with strong multimodal capabilities.

Try Gemini

JSON Validator

A useful tool to check if the LLM's output is well-formed and valid JSON.

Visit JSONLint

Your Dictionary is Now a Database

With a high-quality, structured digital dictionary, you are ready to build the most critical asset for language technology: a digital corpus.

Learn How to Build a Digital Corpus