Automatic activity recognition plays an important role in addressing the efficiency issue of site management. In recent years, there has been an increasing interest in vision-based activity recognition, while its relatively low recognition accuracy and speed impede the practical application. This paper introduces a discriminative model to combine deep activity features and contextual information to improve the recognition of activities of workers on foot in site surveillance videos. Specifically, a conditional random field (CRF) model is designed based on deep activity features, which are extracted with a single-stream deep activity recognition network, and spatial relevance, which are obtained with a tracking-by-detection multiple-object tracking method. We have evaluated various deep activity features, including action features, activity features, and joint features. Also, we have parameterized the contextual information of activities in terms of spatial relevance and represent the context with graphs of K-nearest neighbors. The experimental results show that the CRF model based on deep activity features and activity context can significantly improve activity recognition performance to 98.77% average accuracy by 22.10% from the baseline 77.67%, which is obtained using the single-stream deep activity recognition network, with a small computational overhead of 0.025 ms per segment.
|Journal||Computer-Aided Civil and Infrastructure Engineering|
|Publication status||Accepted/In press - 1 Jan 2020|
ASJC Scopus subject areas
- Civil and Structural Engineering
- Computer Science Applications
- Computer Graphics and Computer-Aided Design
- Computational Theory and Mathematics