Digitizing Text from
old Sanskrit Manuscripts

Text-Line Image
🤖 Output CER: -
👾 Output CER: -
Ground Truth CER: 0.00%

Extra Missing Replaced

The first step of digitizing text from a historical handwritten manuscript is cropping of text-line images from page images. Next, AI 🤖 recognises the text content from the text-line images and converts them into machine-readable text. Finally, mistakes made by 🤖, are autocorrected by AI 👾.

Digitizing text from historical manuscripts enables scholars to search and trace how usage of terms shifts over time. This enables a better understanding of the intellectual culture of a time period in a way that isn't otherwise possible.

At the Centre for Interdisciplinary Artificial Intelligence, FLAME University, we have used Aritficial Intelligence to digitize the 500 page Sanskrit manuscript Vādakautūhala (“Delight in Dispute”). This is a text in the school of Mīmāṃsā, a discipline concerned with the analysis of Vedic statements.

Artificial Intelligence, in this application, can be more specifically defined as a combination of three AI models. The first AI, which is not discussed in this blog, analyses the layout of the manuscript page and crops out text-line images. Next, 🤖 takes in the cropped text-line images and recognizes what the contents of the text lines are, and converts them in a machine-readable format.

The output of 🤖 is not perfect - there are various missing, extra and incorrect characters. This is partly because of scarcity of annotated data, and also partly because the lines of the manuscript are too close to each other such that matras and halants from the line above or below can overlap with the current line. As an analogy, one could imagine a human who knows the devanagari script, but not the Sanskrit language, making such mistakes while reading.

On the other hand, Sanskrit experts, who know the script and also the language, do not make such mistakes. Because they know the language, they can intuitively know if a dot is an ‘anuswara’ or a sneaky matra coming from the line above. This motivates the need for the language AI: 👾, which makes language aware spelling corrections to the output of 🤖. More specifically, 👾 is the Sanskrit Language foundation model “ByT5-Sanskrit”, which is fine-tuned (i.e taught) to perform the specific task of spelling correction. The dataset which we use to teach ByT5-Sanskrit👾 consists of two columns: one which has the output text of 🤖, and one which has the ground truth text.

This is a recurring theme in Natural Language Processing, where a pre-trained foundation Language Model such as ByT5-Sanskrit, is fine-tuned to perform a specific downstream task robustly.

Digitizing Text fromold Sanskrit Manuscripts

Digitizing Text from
old Sanskrit Manuscripts