Temporal Enhancement for Video Affective Content Analysis

Abstract

With the popularity and advancement of the Internet and videosharing platforms, video affective content analysis has been greatly developed. Nevertheless, existing methods often utilize simple models to extract semantic information. This might not capture comprehensive emotional cues in videos. In addition, these methods tend to overlook the presence of substantial irrelevant information in videos, as well as the uneven importance of modalities for emotional tasks. This could result in noise from both temporal fragments and modalities, thus diminishing the capability of the model to identify crucial temporal fragments and recognize emotions. To tackle the above issues, in this paper, we propose a Temporal Enhancement (TE) method. Specifically, we employ three encoders for extracting features at various levels and sample features to enhance temporal data, thereby enriching video representation and improving the model’s robustness to noise. Subsequently, we design a cross-modal temporal enhancement module to enhance temporal information for every modal feature. This module interacts with multiple modalities at once to emphasize critical temporal fragments while suppressing irrelevant ones. The experimental results on four benchmark datasets show that the proposed temporal enhancement method achieves state-of-the-art performance in video affective content analysis. Moreover, the effectiveness of each module is confirmed through ablation experiments.

Publication
MM 24: Proceedings of the 32st ACM International Conference on Multimedia
Shangfei Wang
Shangfei Wang
Professor of Artificial Intelligence

My research interests include Pattern Recognition, Affective Computing, Probabilistic Graphical Models, Computation Intelligence.

Related