Please use this identifier to cite or link to this item:
https://hdl.handle.net/10356/156461
Title: | An evaluation of tokenizers on domain specific text | Authors: | Tao, Yuan | Keywords: | Engineering::Computer science and engineering | Issue Date: | 2022 | Publisher: | Nanyang Technological University | Source: | Tao, Y. (2022). An evaluation of tokenizers on domain specific text. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/156461 | Abstract: | The healthcare industry is fast realizing the value of data, collecting information from electronic health record systems (EHRs), sensors, and other sources. However, the problem of understanding the data collected in the process has been existed for years. According to big data analytics in healthcare, up to 80% of healthcare documentation is unstructured and hence generally unutilized, because mining and extracting this data is challenging and resource intensive. This is where Natural Language Processing can come in. NLP technology services have the potential to extract meaningful insights and concepts from data that was previously considered buried in text form. In NLP studies, text preprocessing is traditionally the first step in building a Machine Learning model, and in the process of text preprocessing, the very first and usually the most important step is tokenization. Currently, many open-source tools for tokenization are available for tokenizing text based on different rules, but few studies have been done on the performance of tokenizers on domain specific text—e.g., healthcare domain. Therefore, this project aims to, first, evaluate different open-source tokenizers’ performance on medical text data and select the best-performing tokenizer; after that, build a wrapper based on the best-performing tokenizer, to further improve its performance on medical text data. In this way, more accurate tokenization results of medical text data can be achieved, and these results can be used in the following NLP process to generate more meaningful insights. With NLP technology, physicians can enhance patient care, research efforts, and disease diagnosis methods. | URI: | https://hdl.handle.net/10356/156461 | Schools: | School of Computer Science and Engineering | Fulltext Permission: | restricted | Fulltext Availability: | With Fulltext |
Appears in Collections: | SCSE Student Reports (FYP/IA/PA/PI) |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Tao Yuan_FYP Report.pdf Restricted Access | 2.54 MB | Adobe PDF | View/Open |
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.