We propose the Unified Transformer for Facial Reaction GeneratioN (UniFaRN) framework for facial reaction prediction in dyadic interactions. Given the video and audio of one side, the task is to generate facial reactions of the other side. The challenge of the task lies in the fusion of multi-modal inputs and balancing appropriateness and diversity. We adopt the Transformer architecture to tackle the challenge by leveraging its flexibility of handling multi-modal data and ability to control the generation process. By successfully capturing the correlations between multi-modal inputs and outputs with unified layers and balancing the performance with sampling methods, we have won first place in the REACT2023 challenge. Github link is: https://github.com/lc150303/REACT23_Challenge