Publications / 2024 Proceedings of the 41st ISARC, Lille, France
Monitoring fatigue is challenging under computer-vision-based action recognition due to the changes in motion patterns caused by fatigue. Particularly in the construction scenario, the motion patterns are unique per trade and longer than daily life actions, causing challenging scenarios. This paper aims to understand the patterns that can guide the selection of optimal clip durations for aggregating motion features specific to each task. We compare the performance of three action recognition models (I3D, MViT, and VideoMAE) on different construction tasks (excavation, masonry, plastering, etc.) at varying clip lengths. We evaluate the models based on frame-wise accuracy, sequence predictability error, and normalized evaluation duration. Our results show that the transformer-based models outperform the convolutional neural network-based models. The model trained directly over videos performs better than those trained on images. Also, the clip duration affects the model performance differently depending on the task type. Neither the 3s context window from the AVA dataset nor the 10s context window from the Kinetics-400 dataset is suitable for construction tasks. Instead, we suggest a variable clip duration between 5s and 7s, which is preferable depending on the tasks and model architecture. Our work provides insights for developing a dynamic and context-aware duration selection system for action recognition in construction.