Abstract
Real-time video perception tasks are often challenging on resource-constrained edge devices due to the issues of accuracy drop and hardware overhead, where saving computations is the key to performance improvement. Existing methods either rely on domain-specific neural chips or priorly searched models, which require specialized optimization according to different task properties. These limitations motivate us to design a general and task-independent methodology, called Patch Automatic Skip Scheme (PASS), which supports diverse video perception settings by decoupling acceleration and tasks. The gist is to capture inter-frame correlations and skip redundant computations at patch level, where the patch is a non-overlapping square block in visual. PASS equips each convolution layer with a learnable gate to selectively determine which patches could be safely skipped without degrading model accuracy. Specifically, we are the first to construct a self-supervisory procedure for gate optimization, which learns to extract contrastive representations from frame sequences. The pre-trained gates can serve as plug-and-play modules to implement patch-skippable neural backbones, and automatically generate proper skip strategy to accelerate different video-based downstream tasks, e.g., outperforming state-of-the-art MobileHumanPose in 3D pose estimation and FairMOT in multiple object tracking, by up to $9.43 \times$9.43× and $12.19 \times$12.19× speedups, respectively, on NVIDIA Jetson Nano devices.
Original language | English |
---|---|
Article number | 10381763 |
Pages (from-to) | 3938-3954 |
Number of pages | 17 |
Journal | IEEE Transactions on Pattern Analysis and Machine Intelligence |
Volume | 46 |
Issue number | 5 |
DOIs | |
Publication status | Published - 1 May 2024 |
Keywords
- On-device processing systems
- video perception
- visual analytics
ASJC Scopus subject areas
- Software
- Computer Vision and Pattern Recognition
- Computational Theory and Mathematics
- Artificial Intelligence
- Applied Mathematics