Publications / 2021 Proceedings of the 38th ISARC, Dubai, UAE
In the construction industry, contractors require precise knowledge of design restrictions originating from regulatory documents and contract specifications. For the automatic compliance checking of the building design regarding these rules, they have to be converted from the representation in natural language to a machine-readable format. This task, if carried out by human experts, is quite laborious and error-prone, and thus its automation is anticipated. A building block of this information extraction process is to find the key terms which carry the semantic information in each design rule. Named Entity Recognition, a sub-task of Information Extraction in the field of natural language processing aims towards finding these entities in unstructured text and assigning them a label according to predefined classes. This paper presents a method based on a supervised deep learning transformer model, which is used to extract relevant terms from a corpus of German regulatory documents. It requires few training data, no user interaction and achieves weighted performance scores of over 90 % precision and 90 % recall, given that 20 unbalanced classes are specified. Additionally, it is investigated how different tagging schemes and model variations affect the classifier's performance. For future extensions, the class labels are chosen such that they can be linked to concepts already defined by Industry Foundation Classes. As part of this study, a training data set is created consisting of 9000 sentences from construction law documents, annotated with named entity tags.