Most current methods for image emotion analysis suffer from the affective gap, in which features directly extracted from images are supervised by a single emotional label, which may not align with users’ perceived emotions. To effectively address this limitation, this paper introduces a novel multi-stage perception approach inspired by the human staged emotion perception process. The proposed approach comprises three perception modules: entity perception, attribute perception, and emotion perception. The entity perception module identifies entities in images, while the attribute perception module captures the attribute content associated with each entity. Finally, the emotion perception module combines entity and attribute information to extract emotion features. Pseudo-labels of entities and attributes are generated through image segmentation and vision-language models to provide auxiliary guidance for network learning. A progressive understanding of entities and attributes allows the network to hierarchically extract semantic-level features for emotion analysis. Comprehensive experiments on image emotion classification, regression, and distribution learning demonstrate the superior performance of our multi-stage perception network.