Subword Tokenization of Noisy Housing Defect Complaints for Named Entity Recognition

Kahyun Jeon; Ghang Lee

Abstract:

In domain-specific named entity recognition (NER), the out-of-vocabulary (OOV) problem arises due to linguistic features and rare vocabulary. OOV problem is particularly challenging in agglutinative languages such as Korean. The irregular decomposition of morphemes makes it difficult to represent all of them in language model dictionaries, resulting in poor NER performance. Subword tokenization which segments a word into atomic tokens that are no longer divided can be one of the possible solutions. In the construction industry, existing NER methods do not effective on housing defect complaints which contain many rare words, including jargon, slang, and typos. To address this challenge, we propose subword tokenization algorithms that can mitigate OOV problems based on considering linguistic features and pre-trained language models (PLMs). The primary objective of this study is to identify the optimal NER performance by comparing different subword tokenization methods depending on the language models used. For domain-specific NER, we defined and used 23 defect-specific named entity tags for dataset labelling. We then experimented with a total of three state-of-the-art language models: one SentencePiece-based and two WordPiece-based subword tokenization models. The results demonstrate that the SentencePiece-based Korean Bidirectional Encoder Representations from Transformers (KoBERT) outperformed the two WordPiece-based language models (multilingual-BERT and Korean Efficiently Learning an Encoder that Classifies Token Replacements Accurately (KoELECTRA)) with an F1 score of 84.7%. The proposed method is expected to improve not only NER but also other downstream tasks that involve using Korean documents in the construction industry.