AnalyseResolutions v1.2 is a highly tailored tool (read: a PhDWare code-sketch) designed to access the digital facsimile of a single document, the Resolutions of the States General for the year 1740; to interpret select features of the document's layout and the visual marks of the textual content; and to use that information to generate a structured full-text XML document.
The tool's tailored application of image analysis is based on heuristic rules derived from modelling which are implemented with simple document image analysis and understanding concepts and methods. The text is recognised using the Tesseract v3.05 OCR engine.
This tool will not work for you out-of-the-box—no, not even for you.
The work discussed here was part of a case study within Tuomo Toljamo's PhD research (2014–) at King's College London carried out within the DiXiT ITN (Digital Scholarly Editions Initial Training Network). The research explores the impact of digital facsimiles on digital scholarly editing, and focusses especially on their computational affordances. The case study work was started during a research secondment at the Huygens ING (KNAW) and relates closely to other work ongoing at the institution.
The research leading to these results has received funding from the People Programme (Marie Curie Actions) of the European Union's Seventh Framework Programme FP7/2007–2013/ under REA grant agreement nº 317436.
The AnalyseResolutions v1.2 tool and the other resources which will be made available here were developed during a PhD case study. The case study was closely related to a series of pilot projects initiated at the Huygens ING to explore methods alternative to traditional editing for the digital opening of a vast archival series, the Resolutions of the States General 1576–1796 (see Sluijter et al. 2016). The series is of considerable interest to historians, but its sheer breadth—the series spans unbroken for over two-hundred years and runs up to some 200,000 pages—has posed a challenge to its comprehensive editing.
While historians are keen to access the series' textual content, the documents in the series are highly structured and the specifics of how the texts are laid out on the pages (i.e. the graphical layout arrangement) is also a carrier of information. The specific aims of this case study were to explore whether and how the visual information encoded in the layout details and captured in the cover-to-cover facsimile could be usefully harnessed for the digital opening of the series. The means selected for this exploration was tool development.
Towards these purposes, the case study focussed on a single document, the printed Resolutions for the year 1740, and borrowed the conceptual and methodological toolbox from the field of document image analysis and understanding. The development work itself was highly document-centric and based on an extensive modelling of the 1740 document's systems of structuring.
To be added when the code and resources are released.
Please see the results page for a preliminary communication of the direct outputs and some background to them.
Release Outline and Technical Terms
To be added.
As a PhDWare product, the tools, resources and documentation made available here are done so both as a means of communication in the hope that it will be useful for someone, and as a thank you for all the resources that in turn have made this work all the much easier; at the same time, however, please understand that (1) no support is promised, but some may be offered; and (2) the code itself is a thicket in need of tending.
Rik Hoekstra and Peter Boot (Huygens ING); Ray Smith (Google) for Tesseract OCR; Thomas Breuel (Google) for OCRopus; Jaakko Sauvola and Matti Pietikäinen (University of Oulu) for Sauvola binarisation; Matthew Christy et al. (Texas A&M University) and the eMOP project for font training files and informative blogposts; PRImA (University of Salford) for Aletheia 3.0; Bryan Tarpley (Texas A&M University) for Franken+; Jesse de Does et al. (Instituut voor Nederlandse Lexicologie) for the Historical Dutch Lexicon; Rafael C. Carrasco (University of Alicante) for ocrevalUAtion; and the EU-funded IMPACT and SUCCEED projects for various types of communications, resources, and reports.
Hoekstra, Rik, and Ida Nijenhuis. 2012. “Enhanced Access for a Paper World.” Paper presented at the European Society for Textual Scholarship 2012, KNAW, Amsterdam, 22–24 November 2012.
Sluijter, Ronald, Marielle Scherer, Sebastiaan Derks, Ida Nijenhuis, Walter Ravenek, and Rik Hoekstra. 2016. “From Handwritten Text to Structured Data: Alternatives to Editing Large Archival Series.” Paper presented at the Digital Humanities 2016, Kraków, 11–16 July 2016.
Toljamo, Tuomo. 2016. “On Building Tools: A Highly Tailored Approach to Digitally Access and Prepare the 1740 Dutch Resolutions of the States General.” Paper presented at the European Society for Textual Scholarship 2016 / DiXiT 3, University of Antwerp, 5–7 October 2016.
Last update: 31st March 2017. Tuomo Toljamo (King's College London).