In recent years, multi-human parsing has become a focal point in research, yet prevailing methods often rely on intermediate stages and lacking pixel-level analysis. Moreover, their high computational demands limit real-world efficiency. To address these challenges and enable real-time performance, low-latency end-to-end network is proposed. This approach leverages vision transformer and convolutional neural network in a dual-encoded network, featuring a lightweight Transformer-based vision encoder) and a convolution encoder based on Darknet. This combination adeptly captures long-range dependencies and spatial relationships. Incorporating a fuse block enables the seamless merging of features from the encoders. Residual connections in the decoder design amplify information flow. Experimental validation on crowd instance-level human parsing and look into person datasets showcases the WNet's effectiveness, achieving high-speed multi-human parsing at 26.7 frames per second. Ablation studies further underscore WNet's capabilities, emphasizing its efficiency and accuracy in complex multi-human parsing tasks. We present WNet, a low-latency end-to-end network for multi-human parsing that integrates vision transformer and Convolutional Neural Network in a dual-encoded structure (vision encoder and a convolution encoder). By adeptly capturing long-range dependencies and spatial relationships, WNet achieves real-time performance and high-speed parsing at 26.7 frames per second on crowd instance-level human parsing and look into person datasets. The inclusion of a fuse block for seamless feature merging, along with residual connections in the decoder, amplifies information flow, emphasizing WNet's efficiency and accuracy in complex multi-human parsing tasks. image