MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures

被引:0
|
作者
Zhuang, Donglin [1 ]
Zheng, Zhen [2 ]
Xia, Haojun [1 ]
Qiu, Xiafei [2 ]
Bai, Junjie [2 ]
Lin, Wei [2 ]
Song, Shuaiwen Leon [1 ]
机构
[1] Univ Sydney, Camperdown, NSW, Australia
[2] Alibaba Grp, Hangzhou, Peoples R China
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this work, we reveal that the kernel-by-kernel execution scheme in the existing machine learning optimizing compilers is no longer effective in fully utilizing hardware resources provided by the advances of modern GPU architectures. Specifically, such scheme suffers from severe non-computation overhead and off-chip memory traffic, making the optimization efforts from the state-of-the-art compiler techniques greatly attenuated on the newer generations of GPUs. To address this emerging challenge, we propose MonoNN, the first machine learning optimizing compiler that enables a new monolithic design and optimization space for common static neural network (NN) inference tasks on a single GPU. MonoNN can accommodate an entire neural network into a single GPU kernel, drastically reducing non-computation overhead and providing further fine-grained optimization opportunities from the newly formed monolithic optimization space. Most importantly, MonoNN identifies the resource incompatibility issue between various NN operators as the key design bottleneck for creating such a monolithic optimization space. Then MonoNN effectively tackles it by systematically exploring and exploiting the parallelism compensation strategy and resource trade-offs across different types of NN computations, and by proposing a novel schedule-independent group tuning technique to significantly shrink the extremely large tuning space. Finally, MonoNN provides a compiler implementation that incorporates our proposed optimizations and automatically generates highly efficient kernel code. Extensive evaluation on a set of popular production inference tasks demonstrates that MonoNN achieves an average speedup of 2.01x over the state-of-the-art frameworks and compilers. Specifically, MonoNN outperforms TVM, TensorRT, XLA, and AStitch by up to 7.3x, 5.9x, 1.7x and 2.9x in terms of end-to-end inference performance, respectively. MonoNN source code is publicly available at https://github.com/AlibabaResearch/mononn. [GRAPHICS] .
引用
收藏
页码:989 / 1005
页数:17
相关论文
共 1 条
  • [1] AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures
    Zheng, Zhen
    Yang, Xuanda
    Zhao, Pengzhan
    Long, Guoping
    Zhu, Kai
    Zhu, Feiwen
    Zhao, Wenyi
    Liu, Xiaoyong
    Yang, Jun
    Zhai, Jidong
    Song, Shuaiwen Leon
    Lin, Wei
    ASPLOS '22: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, 2022, : 359 - 373