INSTalytics: Cluster Filesystem Co-design for Big-data Analytics

被引：0

作者：

Sivathanu, Muthian ^{[1
]}

Vuppalapati, Midhul ^{[1
]}

Gulavani, Bhargav S. ^{[1
]}

Rajan, Kaushik ^{[1
]}

Leeka, Jyoti ^{[1
]}

Mohan, Jayashree ^{[1
,2
]}

Kedia, Piyus ^{[1
,3
]}

机构：

[1] Microsoft Res India, Bengaluru, Karnataka, India

[2] Univ Texas Austin, Austin, TX 78712 USA

[3] IIIT Delhi, New Delhi, India

来源：

PROCEEDINGS OF THE 17TH USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES | 2019年

关键词：

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

We present the design, implementation, and evaluation of INSTalytics a co-designed stack of a cluster file system and the compute layer, for efficient big data analytics in large-scale data centers. INSTalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, INSTalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle. To achieve this, INSTalytics uses compute-awareness to customize the 3-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables INSTalytics to preserve the same recovery cost and availability as traditional replication. INSTalytics also uses compute-awareness to expose a new sliced-read API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes. We have built a prototype implementation of INSTalytics in a production analytics stack, and show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.

引用

页码：235 / 248

页数：14

共 50 条

[1] INSTalytics: Cluster Filesystem Co-design for Big-data Analytics
Sivathanu, Muthian
Vuppalapati, Midhul
Gulavani, Bhargav S.
Rajan, Kaushik
Leeka, Jyoti
Mohan, Jayashree
Kedia, Piyus
ACM TRANSACTIONS ON STORAGE, 2020, 15 (04)
[2] Sports analytics and the big-data era
Morgulev E.
Azar O.H.
Lidor R.
International Journal of Data Science and Analytics, 2018, 5 (04) : 213 - 222
[3] Mayflower: Improving Distributed Filesystem Performance Through SDN/Filesystem Co-Design
Rizvi, Sajjad
Li, Xi
Wong, Bernard
Kazhamiaka, Fiodar
Cassell, Benjamin
PROCEEDINGS 2016 IEEE 36TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS ICDCS 2016, 2016, : 384 - 394
[4] Leveraging big-data for business process analytics
Vera-Baquero, Alejandro
Palacios, Ricardo Colomo
Stantchev, Vladimir
Molloy, Owen
LEARNING ORGANIZATION, 2015, 22 (04): : 215 - 228
[5] Big-Data/Analytics Projects Failure: A Literature Review
Reggio, Gianna
Astesiano, Egidio
2020 46TH EUROMICRO CONFERENCE ON SOFTWARE ENGINEERING AND ADVANCED APPLICATIONS (SEAA 2020), 2020, : 246 - 255
[6] Fog computing: a platform for big-data marketing analytics
Hornik, Jacob
Rachamim, Matti
Graguer, Sergei
FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2023, 6
[7] Efficient Embedding of Dynamic Languages in Big-data Analytics
Salucci, Luca
Bonetta, Daniele
Binder, Walter
2016 IEEE 36TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOPS (ICDCSW 2016), 2016, : 19 - 24
[8] Big-Data Analytics:Challenges,Key Technologies and Prospects
Shengmei Luo
Zhikun Wang
Zhiping Wang
ZTE Communications, 2013, 11 (02) : 11 - 17
[9] Application of Big-Data in Healthcare Analytics - Prospects and Challenges
Rahman, Fuad
Slepian, Marvin J.
2016 3RD IEEE EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS, 2016, : 13 - 16
[10] Big-Data or Slim-Data: Predictive Analytics Will Rule with World
Combs, Daniel
Shetty, Safal
Parthasarathy, Sairam
JOURNAL OF CLINICAL SLEEP MEDICINE, 2016, 12 (02): : 159 - 160

← 1 2 3 4 5 →