The National Institute of Standards and Technology (NIST) approved XMSS as part of the post-quantum cryptography (PQC) development effort in 2018. XMSS is currently one of only two standardized PQC algorithms, but its performance limits its use. For example, the fastest record for some standardized parameters still takes more than a minute to generate a keypair. In this article, we present the first GPU implementation for XMSS and its variant XMSSMT. The high parallelism of GPUs is especially effective for reducing latency in key generation and improving throughput for signing and verifying. In order to meet various application scenarios, we provide three parallel XMSS schemes: algorithmic parallelism, multi-keypair data parallelism, and single-keypair data parallelism. For these schemes, we design custom parallel strategies that use more than 10,000 cores for all parameters provided by NIST. In addition, we analyze the availability of most previous serial optimizations and explore numerous techniques to fully exploit GPU performance. Our evaluations are made with the XMSSMT-SHA2_20/2_256 parameter set on a GeForce RTX 3090. The result shows the key generation latency is 3.20 ms, a speedup of 21,899x compared to the GPU ported version, which is also 54x speedup faster than the fastest work (174 ms). When 16384 tasks are executed, the throughput (task/s) for signing/verifying in the single-key and multi-key cases is 311,424/415,100 and 145,100/419,887, respectively. Compared to the throughput for signing/verifying (1695/ 4000) of the fastest work, we obtain a speedup of 184x/104x and 86x/105x in single-key and multi-key cases, respectively.