Improving the Performance of Data Mining by Using Big Data in Cloud Environment

被引:5
作者
Dahmani, Djilali [1 ]
Rahal, Sid Ahmed [1 ]
Belalem, Ghalem [2 ]
机构
[1] Univ Sci & Technol Mohammed Boudiaf USTO, Dept Math & Comp Sci, Oran, Algeria
[2] Univ Oran 1, Dept Comp Sci, Fac Exact & Appl Sci, Oran, Algeria
关键词
Big data; data mining; NoSQL; cloud computing; relational data;
D O I
10.1142/S0219649216500386
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
The volume of business data is increasing very quickly, most of these data are relational. The need to extract knowledge with Data Mining requires keeping all historical data. This complicates more and more the processing and storage of data, and requires further power and capacity which surpass the ability of any machine. So, using distributed environments like cloud computing becomes very useful to share storage and processing between multiple nodes. Unfortunately, data based on relational model cannot be easily used in cloud because of its rigidity and elasticity in such environments. To solve this issue, new big data systems appear such as NoSQL that make data easier to share and distribute in cloud environments. So, this is theoretically beneficial for data mining use case. However, in practice we need to prove it by evaluating performance for both multi-nodes NoSQL and mono-node relational. Also, in case of cloud, it is very interesting to know if performance is still proportionally increasing according to the number of nodes, and if there is an optimum number of nodes in which performance becomes nearly steady or starts dropping off. Motivated by this topic, we propose in this paper an approach to migrate relational data to an appropriate NoSQL system in cloud environment, and then evaluate their performance to capture some interesting results for Data mining. As experimentation, we use industrial data deployed in a data mining process of an oil and gas company. After migrating these data, we perform some experiments to compare and evaluate storage, processing and execution time. As objective, we verify data elasticity, run time performance, and try to find the optimum number of nodes.
引用
收藏
页数:18
相关论文
共 16 条
[1]   Consistency Tradeoffs in Modern Distributed Database System Design [J].
Abadi, Daniel J. .
COMPUTER, 2012, 45 (02) :37-42
[2]  
Agrawal R., 1994, P 20 INT C VER LARG, P478
[3]  
Brewer E.A, 2000, P 19 ANN ACM S PRINC, DOI DOI 10.1145/343477.343502
[4]  
Degroodt N, 2011, THESIS, P12
[5]  
Diackl BW, 2013, INT J ADV SCI TECHNO, V56, P3
[6]  
Goyal Sumit, 2014, International Journal of Computer Network and Information Security, V6, P20, DOI 10.5815/ijcnis.2014.03.03
[7]   Data management in cloud environments: NoSQL and NewSQL data stores [J].
Grolinger K. ;
Higashino W.A. ;
Tiwari A. ;
Capretz M.A.M. .
Journal of Cloud Computing: Advances, Systems and Applications, 2 (1)
[8]  
Heinrich L, 2012, THESIS, P21
[9]  
Khurana S., 2013, 2013 4 INT C COMP CO, P1
[10]   Will NoSQL Databases Live Up to Their Promise? [J].
Leavitt, Neal .
COMPUTER, 2010, 43 (02) :12-14