Empathetic Response Generation Through Multi-modality

Jiaqiang Wu, Ya'nan Chang, Shangfei Wang, Zhouan Zhu

August 2025

Abstract

Despite remarkable advancements in empathetic response generation (ERG) area, existing research has centered on achieving affective and cognitive empathy by perceiving users’ emotions and deducing contextual information from knowledge databases. Human communication combines textual, visual, and audio cues to interpret other’s intentions. However, previous ERG works have focused on text-based methods and neglected contextual information within audiovisual data. To bridge the gap, we propose fostering empathy with users by integrating audiovisual and text modalities. First, the proposed method uses a cross-modal attention mechanism to perceive users’ emotions from the multi-modal conversation. It integrates multi-modal data with the perceived emotions during response generation process, so that the generated responses resonate with users at the affective level by mirroring their emotions. Second, we import guidance text that focuses on visual context or user experiences and provides contextual information, thus enhancing cognitive empathy. The proposed method aligns multi-modal dialogue history and guidance text through the multi-source attention mechanism. Finally, the proposed method produces empathetic responses by understanding users’ backgrounds and emotions. Experiments on three multi-modal datasets, e.g., MELD, IEMOCAP, and MEDIC, demonstrate that the proposed method outperforms state-of-the-art works.

Type

Journal article

Publication

IEEE Transactions on Affective Computing