Recently, convolutional neural network (CNN) based stereo image quality assessment (SIQA) has been extensively researched, achieving impressive performance. However, most SIQA methods tend to only mine features from distorted stereo image, neglecting the exploitation of valuable features present in other image domains. Moreover, some simple fusion strategies like addition and concatenation for binocular fusion further limit the network's prediction performance. Therefore, we design a cross-domain feature interaction network (CDFINet) for SIQA in this paper, which considers the complementarity between different domain features and realizes binocular fusion between the left and right monocular features based on difference information. Specifically, to boost the prediction ability, we design a dual-branch network with image and gradient feature extraction branches, extracting hierarchical features from both domains. Moreover, to explore more proper binocular information, we propose a difference information guidance based binocular fusion (DIGBF) module to achieve the binocular fusion. Furthermore, to better achieve information compensation between image and gradient domain, binocular features obtained from image domain and gradient domain are fused in the proposed cross-domain feature fusion (CDFF) module. In addition, considering the feedback mechanism of the visual cortex, higher-level features are backpropagated to lower-level regions, and the proposed cross-layer feature interaction (CLFI) module realizes the guidance of higher-level features to lower-level features. Finally, to encourage a more effective way to get the perceptual quality, a hierarchical multi-score quality aggregation method is proposed. The experimental results on four SIQA databases show that our CDFINet outperforms the compared mainstream metrics.