Please use this identifier to cite or link to this item:
https://hdl.handle.net/10356/178284
Title: | Improving transformer for scene text and handwritten text recognition | Authors: | Tan, Yew Lee | Keywords: | Computer and Information Science | Issue Date: | 2024 | Publisher: | Nanyang Technological University | Source: | Tan, Y. L. (2024). Improving transformer for scene text and handwritten text recognition. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178284 | Abstract: | Scene text recognition (STR) involves reading text from images of natural scenes. The texts in such images come in a wide array of fonts, shapes, and orientations. Therefore, various works often rely on rectification network to rectify text images before passing them to the recognition network. However, rectifying an image that does not require it may create unwanted distortion. This may result in wrong predictions that would have otherwise been correct. In order to alleviate the adverse impact of rectification, a portmanteauing of features is presented. The method is introduced to a transformer-based model through a proposed block matrix initialisation which achieved competitive results. Although transformer has achieved notable success in various fields, areas for improvements with its application in STR were identified in this study. Firstly, vision transformer requires an input image to be resized into fixed height and width before being split into patches. However, it was discovered that certain patch resolutions seem to result in better accuracy for images with particular original aspect ratios. Secondly, the first decoded character generally has lower accuracy. In view of these issues, pure transformer with integrated experts (PTIE) is proposed. PTIE is able to process multiple patch resolutions and decode in both the original and reverse character orders thereby capitalising on the aforementioned areas and achieved state-of-the-art results. Handwritten text recognition (HTR) deals with handwritten text images that come from scanned or photograph documents. Works that employed transformer-based models often train them with additional synthetic data. However, these data are not publicly available. Furthermore, experimentation in this study seems to suggest that transformer trained on real HTR data generalises poorly to unseen data. Therefore, in the scope of real data, a simple transformer model which outperformed related works is presented in this thesis. This is achieved by adopting attention masking which addressed the issue of generalisation as well as introducing various pre-processing methods. | URI: | https://hdl.handle.net/10356/178284 | DOI: | 10.32657/10356/178284 | Schools: | College of Computing and Data Science | Organisations: | A*STAR Institute for Infocomm Research | Rights: | This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). | Fulltext Permission: | open | Fulltext Availability: | With Fulltext |
Appears in Collections: | CCDS Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Final_PhD_Thesis.pdf | 6.64 MB | Adobe PDF | View/Open |
Page view(s)
159
Updated on Oct 11, 2024
Download(s) 50
146
Updated on Oct 11, 2024
Google ScholarTM
Check
Altmetric
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.