The Gauss-Seidel algorithm is widely used in the field of parallel computing as an iterative solver for linear system. However, it is challenging to exploit its fine-level parallelism on the emerging heterogeneous many-core architectures. For unstructured mesh problems, the missing of geometric information in matrices of unstructured mesh problems made the classical geometry-based parallel strategies fail. Instead, a block Gauss-Seidel/Jacobi algorithm for heterogeneous many-core architectures based on the algebra-based parallel strategy is proposed, which is used as the subdomain solver (preconditioner) in the DDM (Domain Decomposition Method). To balance between the parallel and numerical efficiencies, we combined the inherent high parallelism of Jacobi method and the good numerical converge rate of Gauss-Seidel method together to form the new algorithm. The block Gauss-Seidel/Jacobi algorithm can be regarded as small Gauss-Seidel iterations on the intra-thread level and a global Jacobi iteration on the inter-thread level. The proposed algorithm presents scalable parallelism with minimum inter-thread communication. Moreover, the unknowns were ordered in a node-by-node manner, meaning that all the N unknowns belongs to a mesh node are coupled to form a N×N block. Such ordering of unknowns does not affect the numerical performance but improves the convergence and scalability of our solver. For the problem solving the incompressible Navier-Stokes equations, the subdomain solver takes a 4×4 node block as a minimum unit. The Sunway TaihuLight supercomputer, consisted of 40960 homegrown SW26010 many-core processors with totally over 10 million cores, is exactly based on heterogeneous many-core architecture. In this paper, the block Gauss-Seidel/Jacobi algorithm was implemented and optimized based on SW26010 many-core architecture. It is known that communication is the bottleneck of heterogeneous many-core architecture. Therefore, a series of low-communication-complexity optimization strategies were designed to achieve a better numerical performance. To reduce the communication cost, we designed implementation-based optimization strategies such as multiple lines packed copying and computation-communication overlapping techniques using the small but fast LDM (Local Data Memory) on many-core processors. Moreover, a copying reduced variant of the block Gauss-Seidel/Jacobi algorithm is also proposed to alleviate the memory bandwidth bottleneck. Unlike the optimization strategies based on accurate computation, the copying reduced version leads to inaccurate preconditioner by neglecting portion of data that need to be communicated and synchronized. Omitting the less important data does not affect the numerical efficiency apparently but achieves much better parallel efficiency. In this work, we adopt the RAS (Restricted Additive Schwarz) method for inter-node parallel algorithm and the block Gauss-Seidel/Jacobi method for the inter- and intra-thread parallelism. The aerodynamic simulations of a high-speed train model and a car model on unstructured mesh were tested on the Sunway TaihuLight. Numerical results show that the proposed block Gauss-Seidel/Jacobi algorithm delivers at most a 4.16x speedup comparing to the sequential version. For the parallel efficiency, our algorithm achieves a 61% parallel efficiency as the number of processors increases from 1040 to 33280. Moreover, the proposed algorithms are not limited to be used as the subdomain solver in DDM, they can also be adopted as the iterative solver for linear systems and smoothers in the multigrid method. © 2019, Science Press. All right reserved.