Optimisation Based Optical Character Recognition For Historical Māori Land Court Documents

Symons, Jonathan

Optimisation Based Optical Character Recognition For Historical Māori Land Court Documents

Symons, Jonathan

Identifier: http://hdl.handle.net/2292/49304

Issue Date: 2019

Degree Grantor: The University of Auckland

Rights: Copyright: The author

Rights (URI): https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm

Abstract:

We are at risk of losing a connection to our past if we don’t convert old historical documents into a searchable, electronic form. There are thousands of documents in the M¯aori Land Court that have been scanned but are not fully utilized because the documents aren’t stored in text format that people can search through. This research investigates the use of analytical and optimisation methodologies to enable the conversion of two scanned historical M¯aori Land Court documents for the Parininihi ki Waitotara Incorporation into electronic text. The documents examined were degraded, with rips, tears, bends, wrinkles, fading, and smudging. Some of the documents also appear to be a scan of a printed out scanned image, making them harder to convert. Professional recognition software was tested to no avail.This thesis implemented algorithms specific to the documents. Minimal pre-processing was conducted on the original image. Instead, the majority of the image issues are addressed in the recognition stage. To segment the characters, we test several algorithms. Four different blob finding techniques were chosen to segment the characters, including the Laplacian of Gaussian (LoG), Difference of Gaussian (DoG), Determinant of Hessian (DoH), and MSER. The MSER algorithm worked the best of the four implemented. We also test the usage of character contours to segment the characters. As an alternative to heuristics and neural networks, optimisation techniques are used to locate the individual characters. This is accomplished with a dynamic program, which uses the structure of the typewriter to break the page down into the different lines of text and individual columns of characters. Template matching heuristics are used to recognize the located characters. While the results are significantly better than from the generic state of the art professional packages, there is still room for improvement with a character recognition rate of 76% and 58%. To improve this rate, we apply a dictionary matching algorithm to ensure that the string of recognized characters is an actual word or number sequence. This gives an accuracy of 86% and 67%.

Description:

Full Text is available to authenticated members of The University of Auckland only.

Show full item record

Files in this item

Name: whole.pdf

Size: 56.26Mb

Format: PDF

This item appears in the following Collection(s)

Masters Theses - Authenticated Access [6749]

Optimisation Based Optical Character Recognition For Historical Māori Land Court Documents

Optimisation Based Optical Character Recognition For Historical Māori Land Court Documents

Abstract:

Description:

Files in this item

This item appears in the following Collection(s)

Search ResearchSpace

Browse

All of ResearchSpace

This Collection

Statistics

Optimisation Based Optical Character Recognition For Historical Māori Land Court Documents

Optimisation Based Optical Character Recognition For Historical Māori Land Court Documents

Abstract:

Description:

Files in this item

This item appears in the following Collection(s)

Share

Search ResearchSpace

Browse

All of ResearchSpace

This Collection

Statistics