Named entity recognition for extracting concept in ontology building on Indonesian language using end-to-end bidirectional long short term memory

Named entity recognition for extracting concept in ontology building on Indonesian language using end-to-end bidirectional long short term memory

Author Joan Santoso, Esther Irawati Setiawan, Christian Nathaniel Purwanto, Eko Mulyanto Yuniarno, Mochamad Hariadi, Mauridhi Hery Purnomo
Article number114856
PublicationExpert Systems with Applications
Date 15 August 2021

Information Extraction has been widely used to extract information from text. Named Entity Recognition (NER) is one of the primary tasks of Information Extraction to extract entities such as person, location, and organization. Extraction from text collection is essential to obtain information from unstructured text. Moreover, Named Entity Recognition is part of ontology building, which is the main objective of this research. Ontology can be built on the basis of a collection of concepts and relation between concepts. Concepts in ontology usually consist of a group of entities and are obtained using Noun Phrase Extraction or Named Entity Recognition. Our main focus in this research is to extract concepts in Ontology Building automatically using Named Entity Recognition. In this paper, Named Entity Recognition was chosen as our approach due to the lack of results from the previous Noun Phrase Extraction works, which is not all nouns obtained are entities. Our proposed methodology for Named Entity Recognition is applying an end-to-end model using Bidirectional Long Short Term Memory (Bi-LSTM). Bi-LSTM is able to perform a sequence classification task by understanding the context of the input. Named Entity Recognition approaches in the previous study uses Part-of-Speech (POS) Tagging in the preprocessing phase by using other tools or models. This Part-of Speech is also used as a feature to improve the performance of Named Entity Recognition. Our proposed methodology provides an end-to-end system that can be used for both POS Tagging and Named Entity Recognition. By using our proposed end-to-end model, no additional tool is needed for Part-of-Speech Tagging. This the advantage of our model compared to other models. Experiments were conducted on news documents that were labeled with four types of entity classes and 35 types of part-of-speech. The target entities that we have extracted in this study are person, location, organization, and miscellaneous. We evaluated the performance of our model using F1-Score. We have achieved the best F1-Score for Part-of-Speech Tagging of 91.79% and Named Entity Recognition of 83.18%.

Nowadays, the Internet as our source of information is tremendously growing. The widespread data on the Internet gives us a challenge on how to store and process this massive data for various purposes. However, the most challenging task is how to obtain and manage information into a knowledge database that can be processed in a computer. Further, some of the knowledge representation is developed for representing the information from domain specified topics. Most of the recent researches utilize ontology as a knowledge representation in Natural Language Processing.

Named Entity Recognition is essential research in Natural Language Processing since it is the foundation of knowledge acquisition and some other related tasks, i.e. coreference resolution, machine translation, question answering system, and other NLP tasks. Information Extraction is one of the tasks for populating ontology (Humphreys, 2016), while Named Entity Recognition is a task of Information Extraction. Therefore, Named Entity Recognition is a fundamental technique in Ontology construction.

An ontology can be defined as a tuple T=(C,R)C is a set of concept that belongs to the ontology, and R is the relationships between those concepts. Ontology building or taxonomy induction is a process to organize extracted concepts C and its relation R from text corpus into a logical hierarchy for knowledge representation (Yang & Callan, 2008). This process consists of two subtasks, which are concept extraction and relation formation (Yang & Callan, 2009). Concept extraction is defined as a task to find some important entities that appear in a document. Relation extraction is defined as a task to define semantic relations between those entities that appear in the documents.

Ontology uses concepts, and each relation of those concepts represents knowledge. This knowledge has been utilized in several Natural Language Processing research, like chatbot (Vegesna, Jain, & Porwal, 2018) and in some question answering systems like in Rajendran and Sharon (2017). Tahar et al. (2016) and Dutta, Sinha, and Others (2018) define most of the ontology building or construction is done manually, and the process is expensive. With these motivations, several kinds of research have been done to build an ontology automatically, i.e. research by Yang and Callan (2008) and Vairavasundaram and Logesh (2018).

In fact, ontology construction research has been conducted in various languages such as English (Cahyani and Wasito, 2017, Xian et al., 2018), Japanese (Joyce et al., 2017, Kawakami et al., 2017), Chinese (Lin et al., 2018, Qiu et al., 2018), and Arabic (Albukhitan, Helmy, & Alnazer, 2017). Automatic ontology development is also quite popular in Indonesian. Harjito, Cahyani, and Doewes (2016) built a bilingual ontology with a corpus and ontology design pattern in Indonesian and English for tuberculosis disease. Santoso, Nugraha, Yuniarno, and Hariadi (2015) also developed an ontology with a focus on hyponymy and meronymy relation. Research by Silalahi, Cahyani, Sensuse, and Budi (2015) created a Medicinal Plant Ontology by considering social and technical sides in all phases of ontology building using Socio-Technical Approach. Another approach by Virginia and Son (2010) automatically constructs an ontology by a Linear Model of Expert System. Further, Rahayu, Krisnadhi, Wulandari, and Sensuse (2018) developed an ontology for the competency certification domain.

Classical techniques in building ontology obtain a Noun Phrase in text and extract relation between them. This is challenging since noun phrase is usually not an entity. There are some cases where the indefinite noun or noun chunks do not refer to entities (Elhadad, 2009). Moreover, a decision or further analysis also needs to be made whether a multiword unit is a noun compound or Named Entity (Nagy, Berend, & Vincze, 2011). Our previous approach to get an Entity is using a Noun Phrase with Shallow Parsing (Santoso et al., 2015) or Grammatical Induction (Hermawan et al., 2011) in Indonesian Language. However, some false entity was also taken. To overcome this problem, we utilized Named Entity Recognition (NER) to extract the concept in the ontology.

An ontology comprises concept and relation in a domain where abstractions of objects in the actual world could not be simply understood by applications. Thus, concepts and expressions are essential. These can be achieved by using NER task, which can identify and classify instances of concepts (Zhang and Ciravegna, 2011, Buitelaar and Cimiano, 2008). Many tasks of relation extraction include the Named Entity Recognition to extract the entity in the processes, i.e. Pantel and Pennacchiotti, 2006, Miwa and Bansal, 2016, and Giorgi et al. (2019).

We focus our research into the entity extraction domain because we believe that entity extraction has a significant role in various ontology building researches (Kim, 2017). Our approach uses multitask Bidirectional Long Short Term Memory (Bi-LSTM). By using this proposed model, we provide the end-to-end system that is able to extract the entity, which includes the Part-of-Speech (POS) Tagging task in our Named Entity Recognition models. Part-of-Speech is believed to be able to improve the entity extraction performance (Feng, Zhang, Hao, & Chen, 2017), and most of the works take this task into preprocessing phase.