Although expression descriptions provide additional information about facial behaviors despite of different poses, and pose features are beneficial to adapt to pose variety, neither has been fully leveraged in facial expression recognition. This paper proposes a pose-aware text-assisted facial expression recognition method using cross-modality attention. Specifically, the method contains three components. The pose feature extractor extracts pose-related features from facial images, and then cooperates with a fully-connected layer for pose classification. When poses can be clearly discriminated and classified, features obtained from the extractor can represent the corresponding poses. To eliminate bias due to appearance and illumination, cluster centers are taken as the final pose features. The text feature extractor obtains embeddings from expression descriptions. These descriptions are first passed through Intra-Exp attention to obtain preliminary embeddings. To leverage the correlations among expressions, all expression embeddings are then concatenated and passed through Inter-Exp attention. The cross-modality module attempts to learn attention maps that distinguish the importance of facial regions by using prior knowledge about poses and expression descriptions. The image features weighted by the attention maps are utilized to recognize pose and expression jointly. Experiments on three benchmark datasets demonstrate the superiority of the proposed method.