An Indonesian resource grammar (INDRA) : and its application to a treebank (JATI)
Date of Issue2018-11-07
School of Humanities
This dissertation describes the creation and the development of an open-source, broadcoverage Indonesian computational grammar, called Indonesian Resource Grammar (INDRA), within the framework of Head-Driven Phrase Structure Grammar (HPSG) (Pollard & Sag, 1994; Sag et al., 2003) and Minimal Recursion Semantics (MRS) (Copestake et al., 2005), using computational tools and resources developed by the DEep Linguistic Processing with HPSG-INitiative (DELPH-IN) research consortium. As a resource grammar, INDRA was employed to build an open-source treebank, called JATI. The research I have conducted on INDRA and its application to JATI was done in four years, from January 2014 to January 2018, during my PhD candidature. Previous work on the computational grammar of Indonesian are mainly done in the framework of Lexical-Functional Grammar (LFG) (Kaplan & Bresnan, 1982; Dalrymple, 2001) such as Arka (2010a) and Mistica (2013). A computational grammar of Indonesian called IndoGram (Arka, 2012) was developed within the LFG-based Parallel Grammar (ParGram) framework, using the Xerox Linguistic Environment (XLE) parser. To the best of my knowledge, no work on Indonesian HPSG has been done. Thus, the development of INDRA can also function as an investigation of the cross-linguistic potency of HPSG and MRS. The approach taken is a corpus-driven approach. The scope is on the analysis and computational implementation of some basic Indonesian constructions and some phenomena in the Indonesian text: from the Nanyang Technological University Multilingual Corpus (NTU-MC) (Tan & Bond, 2012) and from definition sentences in the fifth edition of Kamus Besar Bahasa Indonesia (KBBI) (Amalia, 2016); the later contains 2,003 sentences and was treebanked, named JATI. The lexicon was semi-automatically acquired from various sources: the English Resource Grammar (ERG) (Copestake & Flickinger, 2000) via Wordnet Bahasa (Nurril Hirfana Mohamed Noor et al., 2011; Bond et al., 2014), the NTU-MC, and the KBBI definition sentence corpus. The coverage, i.e. the quality and the quantity of parsed sentences in the corpus by the grammar, is evaluated using test-suites. INDRA can parse and generate complex noun phrases with clitics, determiners, numerals, classifiers, and defining relative clause; verb phrases with auxiliaries and voice markers; major copula constructions; compounds; coordination of words and phrases with the same part-of-speech; and subordination. However, at the time of submission, INDRA still cannot handle phenomena such as equative, comparative, and superlative adjective phrases; coordination of words and phrases of different parts-of-speech; possessor topiccomment relative clause with more than one comment; imperatives; and constructions with Wh-question words. These are for future work. Despite its limitations, compared with IndoGram, INDRA has more precision in the analyses for some phenomena and has fifteen times more sentences in the open-source treebank. In addition, INDRA has the potential to be used in various applications such as multilingual machine translation and computer-assisted language learning. Since INDRA is developed in the DELPH-IN community along with other grammars such as the English Resource Grammar (ERG) (Flickinger et al., 2010) using the same semantics (MRS), a semantic-transfer-based machine translation system can be easily built. In summary, INDRA serves as the first, open-source computational grammar for Indonesian which covers most of the important constructions. INDRA has reached to a stage that it has the potential to be applied to various applications such as treebanking, machine translation, and computer-assisted language learning.