The Newsletter 81 Autumn 2018

Naval Kishore Press - digital: From hidden treasure to open access

Nicole Merkel-Hilf

The Naval Kishore Press was established in the north Indian city of Lakhnau in 1858 by Munshi Naval Kishore (1836-1895). In the following decades it grew to one of India’s most important publishing houses. During Naval Kishore’s lifetime the press published around 5,000 titles covering literature in Hindi, Urdu, Arabic, Persian and Sanskrit on subjects as diverse as religion, education, medicine, school-books, popular editions of Sanskrit literature, and much more. The library of the South Asia Institute (SAI) at Heidelberg University holds a representative cross section of the Naval Kishore Press' publications with 1,400 titles in print and around 700 titles on microfilm.

In order to make this treasure more visible for scholars the Naval Kishore Press Bibliography has been set up by using the open source software VuFind. The bibliography is intended as a provenance database and aims to provide access to bibliographic records as well as digitized online editions of works issued by the Naval Kishore Press that are distributed in libraries worldwide – and not only to the SAI library collection. Currently we are enriching the bibliography with 1,200 title records from the Bodleian Library in Oxford. The bibliography will then contain more than 3,500 entries from eight different libraries.

From the mid-19th century onwards wood pulp paper was used for printing which tends to be acidic and therefore paper deterioration is a problem for the printed part of the collection. For reasons of preservation the Naval Kishore Press – digital project was initiated by the SAI library and Heidelberg University Library.1 Naval Kishore Press – digital is part of a larger, three-year project ‘Fachinformationsdienst Asien’ - FID Asien-, funded by the Deutsche Forschungsgemeinschaft (DFG) until the end of 2018. The FID Asien project is cooperatively carried out by the State Library in Berlin, Heidelberg University Library and the South Asia Institute. The web portal CrossAsia is used as the central access point to the project results and for scientific information in Asian studies  - https://crossasia.org/en .  Within this project, selected Hindi and Sanskrit titles in Devanagari script from the Naval Kishore Press collection are digitized, but the primary aim of Naval Kishore Press – digital is to offer scholars more than a digitized image facsimile. The goal is to produce machine-readable texts that can be further edited online by using digital editing techniques.

Suitable OCR software especially for South Asian scripts has long been unavailable due to the complexity of the writing systems and has turned out often to be unsuitable for mass digitization projects. For the Naval Kishore Press – digital project two text recognition methods have been used – the OCR software for Hindi and Sanskrit developed by ind.senz and, more recently, a data model trained by Transkribus.2 Transkribus is the platform for automated recognition, transcription and searching of historical documents. It is part of the EU-funded READ project https://transkribus.eu/Transkribus.  For the training of the model 200 pages of a so-called ‘ground truth’ transcription was produced, i.e., an accurate representation of the text on the image facsimile. The ground truth transcription and the images are then used to train a recurrent neural network to get a data model to automatically transcribe more texts from the Naval Kishore Press collection. With an error rate of 5,59% on a random test set the results are very promising and we are using the model now on the digitized Hindi and Sanskrit texts of the Naval Kishore Press collection.

For the web presentation of the digitized images and the OCRed full-texts created with Transkribus the software ‘DWork – Heidelberg Digitization Workflow’ is used, an in-house development by Heidelberg University Library. It provides a variety of functions for the use of digital copies, such as thumbnail overview, zooming in and out, full text search, and various navigation features as well as components for annotations.

Words or phrases from the Hindi and Sanskrit texts can be searched in Devanagari script or in Latin transliteration and the results are highlighted in the image facsimile as well as the recognized text. Furthermore, users can download a high quality OCR-PDF of the facsimile from the project website where the text is also fully searchable in both scripts.

The annotation tool implemented in DWork allows scholars worldwide to work collaboratively on a text or text corpus independent of place and time. Each annotation can be entered comfortably via a web form, is provided with the name of its author and can be reliably referenced and quoted by being assigned a DOI. Revisions are saved as new versions, while earlier versions remain still visible and can be accessed through the revision history.

Both resources can be accessed on CrossAsia: https://themen.crossasia.org

Nicole Merkel-Hilf, Chief coordinator "South Asia" FID Asien, SAI Library.