Integrating Facial Images, Speeches and Time for Empathy Prediction

Publication
14th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2019, Lille, France, May 14-18, 2019

We propose a multi-modal method for the OneMinute Empathy Prediction competition. First, we use bottleneck residual and fully-connected network to encode facial images and speeches of the listener. Second, we propose to use the current time stage as a temporal feature and encoded it into the proposed multi-modal network. Third, we select a subset training data based on its performance of empathy prediction on the validation data. Experimental results on the testing set show that the proposed method outperforms the baseline methods significantly according to the CCC metric (0.14 vs 0.06).

Fig. Model architecture.The inputs of the proposed multi-modal deep network are facial images, audio signals and time stamps. Specifically, we extract facial images of the listener in each frame through opencv, and then reshape the size of facial images to (120, 120). The preprocessed facial images are fed into a network with one convolution layer and six sequential bottleneck residual modules.See text for details.
Fig. Model architecture.The inputs of the proposed multi-modal deep network are facial images, audio signals and time stamps. Specifically, we extract facial images of the listener in each frame through opencv, and then reshape the size of facial images to (120, 120). The preprocessed facial images are fed into a network with one convolution layer and six sequential bottleneck residual modules.See text for details.

Shi Yin
Shi Yin
Technical Researcher
Shangfei Wang
Shangfei Wang
Professor of Artificial Intelligence

My research interests include Pattern Recognition, Affective Computing, Probabilistic Graphical Models, Computation Intelligence.

Related