EmIT: Emotional Interaction control in Text-to-image diffusion models

July 2025

Abstract

Large-scale text-to-image diffusion models excel at generating coherent images from text, making them valuable for content generation. Although current work can preliminarily generate interactions, it fails to consider the emotions involved in human-object interactions. Moreover, people often experience emotions when using objects or interacting with them. In this work, we propose Emotional Interaction Generation task. It is a novel image generation task that uses a prompt, the human-object interaction (HOI) region, and an emotion to generate more emotionally expressive interaction images. Specifically, due to the lack of relevant datasets, we first constructed a new emotional interaction dataset. This dataset contains 117,871 training samples and 33,405 testing samples. After that, we propose an Emotional Interaction Diffusion Model. Firstly, we use tokenizer to integrate emotion embedding with HOI information. Secondly, it will be trained in latent space in the Static-Emo-Interaction Gated Self-Attention layer and the Hierarchical Emotion-Visual Cross Attention layer. In this way, the model can effectively learn the emotional aspects of interactions. Our experimental results demonstrate superior performance compared to the Interaction Generation method in terms of image quality and HOI detection scores. In the field of emotional image generation, our method also scores higher than EmoGen.

Type

Conference paper

Publication

MM 25: Proceedings of the 33rd ACM International Conference on Multimedia.