In this paper, we propose a multi-modal deep network to predict the empathy of the listener during the conversation between two people. First, we use a bottleneck residual network proposed by to learn visual representation from facial images, and adopt fully connected network to extract audio features from the listener’s speech. Second, we propose to use the current time stage as a temporal feature, and fuse it with the learned visual and audio representations. Neural network regression is used to predict the empathy level. We further select the representative subset training data to train the proposed multi-modal deep network. Experimental results on the One-Minute Empathy Prediction dataset demonstrate the effectiveness of the proposed method.