Current light-field salient object detection methods have difficulty in accurately distinguishing objects from com-plex backgrounds. In this paper, we believe that this problem can be mitigated by optimizing feature fusion and enlarging receptive field, and thus propose a novel transformer embedding network named TENet. The main idea of the network is to (1) selectively aggregate multi-features for fuller feature fusion; (2) integrate the Trans-former for larger receptive field, so as to accurately identify salient objects. For the former, firstly, a multi -modal feature fusion module (MMFF) is designed to mine the different contributions of multi-modal features (i.e., all-in-focus image features and focal stack features). Then, a multi-level feature fusion module (MLFF) is de-veloped to iteratively select and fuse complementary cues from multi-level features. For the latter, we integrate the Transformer for the first time and propose a transformer-based feature enhancement module (TFE), to pro-vide a wider receptive field for each pixel of high-level features. To validate our idea, we comprehensively eval-uate the performance of our TENet on three challenging datasets. Experimental results show that our method outperforms the state-of-the-art method, e.g., the detection accuracy is improved by 28.1%, 20.3%, and 14.9% in MAE metric, respectively.(c) 2022 Elsevier B.V. All rights reserved.