The first step of digitizing text from a historical handwritten manuscript is cropping of text-line images from page images. Next, AI š¤ recognises the text content from the text-line images and converts them into machine-readable text. Finally, mistakes made by š¤, are autocorrected by AI š¾.
Digitizing text from historical manuscripts enables scholars to search and trace how usage of terms shifts over time. This enables a better understanding of the intellectual culture of a time period in a way that isn't otherwise possible.
At the Centre for Interdisciplinary Artificial Intelligence, FLAME University, we have used Aritficial Intelligence to digitize the 500 page Sanskrit manuscript VÄdakautÅ«hala (āDelight in Disputeā). This is a text in the school of MÄ«mÄį¹sÄ, a discipline concerned with the analysis of Vedic statements.
Artificial Intelligence, in this application, can be more specifically defined as a combination of three AI models. The first AI, which is not discussed in this blog, analyses the layout of the manuscript page and crops out text-line images. Next, š¤ takes in the cropped text-line images and recognizes what the contents of the text lines are, and converts them in a machine-readable format.
The output of š¤ is not perfect - there are various missing, extra and incorrect characters. This is partly because of scarcity of annotated data, and also partly because the lines of the manuscript are too close to each other such that matras and halants from the line above or below can overlap with the current line. As an analogy, one could imagine a human who knows the devanagari script, but not the Sanskrit language, making such mistakes while reading.
On the other hand, Sanskrit experts, who know the script and also the language, do not make such mistakes. Because they know the language, they can intuitively know if a dot is an āanuswaraā or a sneaky matra coming from the line above. This motivates the need for the language AI: š¾, which makes language aware spelling corrections to the output of š¤. More specifically, š¾ is the Sanskrit Language foundation model āByT5-Sanskritā, which is fine-tuned (i.e taught) to perform the specific task of spelling correction. The dataset which we use to teach ByT5-Sanskritš¾ consists of two columns: one which has the output text of š¤, and one which has the ground truth text.
This is a recurring theme in Natural Language Processing, where a pre-trained foundation Language Model such as ByT5-Sanskrit, is fine-tuned to perform a specific downstream task robustly.