YAKE-Guided LDA approach for automatic classification of construction safety reports

Hrishikesh Gadekar; Nikhil Bugalia

Abstract:

Identifying efficient processes for classifying text-based safety reports using Machine-Learning (ML) is an essential area of research. However, much of the previous work on the topic relies on supervised learning approaches, which are often manually intensive and require large volumes of pre-labeled data. To achieve reduced requirements for human intervention during the classification process, the current study tests the applicability and validity of a state-of-the-art unsupervised learning approach, i.e., Yet Another Keyword Extractor (YAKE) integrated with Guided Latent Dirichlet allocation (GLDA). The current study is the first known application of the approach for the construction sector. Web-based, readily accessible information is used to develop a domain corpus. The keywords obtained from the domain corpus using YAKE are seeded in GLDA to classify nearly 13,000 safety reports from two different datasets in 4 commonly used category labels. The study demonstrates that moderate to high classification performance is achievable through the YAKE-GLDA approach. A high F1 score of 0.82 for the Personal-protective equipment category and a total F1 score of 0.62 is achievable. Furthermore, the same domain corpus helps achieve good classification performance across different datasets, highlighting the generalizability of the YAKE-GLDA approach. However, results from novel sensitivity analysis show a non-generalizable trend for sensitivity to hyperparameters. Hence attention is warranted for potential consistency issues facing the approach. The preliminary results demonstrate outstanding potential for the YAKE-GLDA approach for wide-ranging adoption in the construction industry. However, future work should also focus on more granular classification labels applications and improving classification efficiency.