Recently, deep learning-based methods in the field of image matting have incorporated additional modules and complex network structures to capture more comprehensive image information, thereby achieving higher accuracy. However, these innovations inevitably result in a decrement of inference speed and higher computational resource consumption. In this paper, we propose a Transformer-based unified fusion network for image matting, denoted as FormerUnify. Compared to existing methods, it is able to achieve a more optimal balance between accuracy and efficiency. FormerUnify is built upon the classic encoder-decoder framework, with its centerpiece being the Unified Fusion Decoder. This decoder is composed of three essential layers: unify layer, fusion layer, and upsampling prediction head, all of which work in concert to unify and fuse the rich multi-scale features extracted by the encoder effectively. Furthermore, we couple the Unified Fusion Decoder with an advanced Transformer-based encoder, and optimize their integration to enhance their compatibility and performance. Experimental evaluations on two synthetic datasets (Composition-1K and Distinctions-646) and an real-world dataset (AIM-500) affirm that FormerUnify achieves rapid inference speed without compromising its superior accuracy.