Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/178284
Title: Improving transformer for scene text and handwritten text recognition
Authors: Tan, Yew Lee
Keywords: Computer and Information Science
Issue Date: 2024
Publisher: Nanyang Technological University
Source: Tan, Y. L. (2024). Improving transformer for scene text and handwritten text recognition. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178284
Abstract: Scene text recognition (STR) involves reading text from images of natural scenes. The texts in such images come in a wide array of fonts, shapes, and orientations. Therefore, various works often rely on rectification network to rectify text images before passing them to the recognition network. However, rectifying an image that does not require it may create unwanted distortion. This may result in wrong predictions that would have otherwise been correct. In order to alleviate the adverse impact of rectification, a portmanteauing of features is presented. The method is introduced to a transformer-based model through a proposed block matrix initialisation which achieved competitive results. Although transformer has achieved notable success in various fields, areas for improvements with its application in STR were identified in this study. Firstly, vision transformer requires an input image to be resized into fixed height and width before being split into patches. However, it was discovered that certain patch resolutions seem to result in better accuracy for images with particular original aspect ratios. Secondly, the first decoded character generally has lower accuracy. In view of these issues, pure transformer with integrated experts (PTIE) is proposed. PTIE is able to process multiple patch resolutions and decode in both the original and reverse character orders thereby capitalising on the aforementioned areas and achieved state-of-the-art results. Handwritten text recognition (HTR) deals with handwritten text images that come from scanned or photograph documents. Works that employed transformer-based models often train them with additional synthetic data. However, these data are not publicly available. Furthermore, experimentation in this study seems to suggest that transformer trained on real HTR data generalises poorly to unseen data. Therefore, in the scope of real data, a simple transformer model which outperformed related works is presented in this thesis. This is achieved by adopting attention masking which addressed the issue of generalisation as well as introducing various pre-processing methods.
URI: https://hdl.handle.net/10356/178284
DOI: 10.32657/10356/178284
Schools: College of Computing and Data Science 
Organisations: A*STAR Institute for Infocomm Research
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:CCDS Theses

Files in This Item:
File Description SizeFormat 
Final_PhD_Thesis.pdf6.64 MBAdobe PDFView/Open

Page view(s)

159
Updated on Oct 11, 2024

Download(s) 50

146
Updated on Oct 11, 2024

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.