Abstract:
We are at risk of losing a connection to our past if we don’t convert old historical documents into a searchable, electronic form. There are thousands of documents in the M¯aori Land Court that have been scanned but are not fully utilized because the documents aren’t stored in text format that people can search through. This research investigates the use of analytical and optimisation methodologies to enable the conversion of two scanned historical M¯aori Land Court documents for the Parininihi ki Waitotara Incorporation into electronic text. The documents examined were degraded, with rips, tears, bends, wrinkles, fading, and smudging. Some of the documents also appear to be a scan of a printed out scanned image, making them harder to convert. Professional recognition software was tested to no avail.This thesis implemented algorithms specific to the documents. Minimal pre-processing was conducted on the original image. Instead, the majority of the image issues are addressed in the recognition stage. To segment the characters, we test several algorithms. Four different blob finding techniques were chosen to segment the characters, including the Laplacian of Gaussian (LoG), Difference of Gaussian (DoG), Determinant of Hessian (DoH), and MSER. The MSER algorithm worked the best of the four implemented. We also test the usage of character contours to segment the characters. As an alternative to heuristics and neural networks, optimisation techniques are used to locate the individual characters. This is accomplished with a dynamic program, which uses the structure of the typewriter to break the page down into the different lines of text and individual columns of characters. Template matching heuristics are used to recognize the located characters. While the results are significantly better than from the generic state of the art professional packages, there is still room for improvement with a character recognition rate of 76% and 58%. To improve this rate, we apply a dictionary matching algorithm to ensure that the string of recognized characters is an actual word or number sequence. This gives an accuracy of 86% and 67%.