Publications / 2023 Proceedings of the 40th ISARC, Chennai, India

A pre-trained language model-based framework for deduplication of construction safety newspaper articles

Abhipraay Nevatia , Soukarya Saha , Sundar Balarka Bhagavatula , Nikhil Bugalia
Pages 387-394 (2023 Proceedings of the 40th ISARC, Chennai, India, ISBN 978-0-6458322-0-4, ISSN 2413-5844)
Abstract:

The unavailability of Occupational Health and Safety (OHS) statistics for the construction sector is a systemic hurdle in improving safety, particularly for developing countries. Alternatively, online newspaper articles are deemed a potential source for OHS statistics. Machine Learning (ML) approaches for text-mining are deemed essential for the resource-intensive processing of news articles. However, the previous literature applying ML for newspaper articles has been scarce, and tasks, such as removing duplicate articles, have not been addressed satisfactorily. To address the research gap, the current study develops and evaluates a novel framework based on pre-trained language models for the deduplication tasks for construction safety-related news articles. The study relies on the Question and Answering (QA) ability of the Longformer model pre-trained on Stanford QA Dataset (SQUAD) to identify the date and location of the construction accidents from the news articles. A combination of date and location is used as a key for deduplicating news articles that refer to the same accidents. The comparative performance of the developed framework is evaluated on 141 accident articles systematically extracted from 7 months of construction-relevant news articles in India. With an accuracy of more than 90%, the proposed method outperforms other methods for date identification. The performance of the deduplication process based on Longformer, i.e., F1 score of 0.79, is comparable to the Cosine similarity-based approaches. However, compared to the commonly adopted Cosine similarity-based method, the newly developed method in this study is reliable and consistent for periodically processing large quantities of unlabeled datasets.

Keywords: Construction safety, News Articles, Machine Learning, BERT, Longformer, Deduplication
Presentation Video: https://youtu.be/HPFePiJ9vfk