INSTalytics: Cluster Filesystem Co-design for Big-data Analytics

被引:0
|
作者
Sivathanu, Muthian [1 ]
Vuppalapati, Midhul [1 ]
Gulavani, Bhargav S. [1 ]
Rajan, Kaushik [1 ]
Leeka, Jyoti [1 ]
Mohan, Jayashree [1 ,2 ]
Kedia, Piyus [1 ,3 ]
机构
[1] Microsoft Res India, Bengaluru, Karnataka, India
[2] Univ Texas Austin, Austin, TX 78712 USA
[3] IIIT Delhi, New Delhi, India
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We present the design, implementation, and evaluation of INSTalytics a co-designed stack of a cluster file system and the compute layer, for efficient big data analytics in large-scale data centers. INSTalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, INSTalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle. To achieve this, INSTalytics uses compute-awareness to customize the 3-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables INSTalytics to preserve the same recovery cost and availability as traditional replication. INSTalytics also uses compute-awareness to expose a new sliced-read API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes. We have built a prototype implementation of INSTalytics in a production analytics stack, and show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.
引用
收藏
页码:235 / 248
页数:14
相关论文
共 50 条
  • [1] INSTalytics: Cluster Filesystem Co-design for Big-data Analytics
    Sivathanu, Muthian
    Vuppalapati, Midhul
    Gulavani, Bhargav S.
    Rajan, Kaushik
    Leeka, Jyoti
    Mohan, Jayashree
    Kedia, Piyus
    ACM TRANSACTIONS ON STORAGE, 2020, 15 (04)
  • [2] Sports analytics and the big-data era
    Morgulev E.
    Azar O.H.
    Lidor R.
    International Journal of Data Science and Analytics, 2018, 5 (04) : 213 - 222
  • [3] Mayflower: Improving Distributed Filesystem Performance Through SDN/Filesystem Co-Design
    Rizvi, Sajjad
    Li, Xi
    Wong, Bernard
    Kazhamiaka, Fiodar
    Cassell, Benjamin
    PROCEEDINGS 2016 IEEE 36TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS ICDCS 2016, 2016, : 384 - 394
  • [4] Leveraging big-data for business process analytics
    Vera-Baquero, Alejandro
    Palacios, Ricardo Colomo
    Stantchev, Vladimir
    Molloy, Owen
    LEARNING ORGANIZATION, 2015, 22 (04): : 215 - 228
  • [5] Big-Data/Analytics Projects Failure: A Literature Review
    Reggio, Gianna
    Astesiano, Egidio
    2020 46TH EUROMICRO CONFERENCE ON SOFTWARE ENGINEERING AND ADVANCED APPLICATIONS (SEAA 2020), 2020, : 246 - 255
  • [6] Fog computing: a platform for big-data marketing analytics
    Hornik, Jacob
    Rachamim, Matti
    Graguer, Sergei
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2023, 6
  • [7] Efficient Embedding of Dynamic Languages in Big-data Analytics
    Salucci, Luca
    Bonetta, Daniele
    Binder, Walter
    2016 IEEE 36TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOPS (ICDCSW 2016), 2016, : 19 - 24
  • [8] Big-Data Analytics:Challenges,Key Technologies and Prospects
    Shengmei Luo
    Zhikun Wang
    Zhiping Wang
    ZTE Communications, 2013, 11 (02) : 11 - 17
  • [9] Application of Big-Data in Healthcare Analytics - Prospects and Challenges
    Rahman, Fuad
    Slepian, Marvin J.
    2016 3RD IEEE EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS, 2016, : 13 - 16
  • [10] Big-Data or Slim-Data: Predictive Analytics Will Rule with World
    Combs, Daniel
    Shetty, Safal
    Parthasarathy, Sairam
    JOURNAL OF CLINICAL SLEEP MEDICINE, 2016, 12 (02): : 159 - 160