Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/47535
Full metadata record
DC FieldValueLanguage
dc.contributor.authorWong, Dajien_US
dc.date.accessioned2011-12-27T08:36:21Z
dc.date.available2011-12-27T08:36:21Z
dc.date.copyright2009en_US
dc.date.issued2009
dc.identifier.urihttp://hdl.handle.net/10356/47535
dc.description59 p.en_US
dc.description.abstractIn this thesis, an algorithm is presented that selects samples of documents for training text classifiers. Often the number of documents is very large and the documents are noisy. Both for efficiency purposes and accuracy purposes, one need good samples not just blind samples such as that of simple random sampling. The proposed algorithm is far superior to simple random sampling both for small sampling ratios and in the presence of noise. The proposed algorithm is based on a simple fact that the terms in the set of training sample documents should have approximately equal document frequency as in the whole set (not including the test set).en_US
dc.rightsNanyang Technological Universityen_US
dc.subjectDRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processingen_US
dc.titleSelecting training samples from large and noisy corpora for efficient text classificationen_US
dc.typeThesisen_US
dc.contributor.supervisorManoranjan Dashen_US
dc.contributor.schoolWee Kim Wee School of Communication and Informationen_US
dc.description.degreeMaster of Science (Information Studies)en_US
item.fulltextWith Fulltext-
item.grantfulltextrestricted-
Appears in Collections:WKWSCI Theses
Files in This Item:
File Description SizeFormat 
WKWSCI_THESES_23.pdf
  Restricted Access
6.86 MBAdobe PDFView/Open

Page view(s) 50

226
Updated on Feb 27, 2021

Download(s)

6
Updated on Feb 27, 2021

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.