Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java

Nowadays with the technology revolution the term of big data is a phenomenon of the decade moreover, it has a significant impact on our applied science trends. Exploring well big data tool is a necessary demand presently. Hadoop is a good big data analyzing technology, but it is slow because the Job...

Full description

Saved in:
Bibliographic Details
Main Authors: Hoger Khayrolla Omar, Alaa Khalil Jumaa
Format: Article
Language:English
Published: Sulaimani Polytechnic University 2019-05-01
Series:Kurdistan Journal of Applied Research
Subjects:
Online Access:https://kjar.spu.edu.iq/index.php/kjar/article/view/265
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823861323525521408
author Hoger Khayrolla Omar
Alaa Khalil Jumaa
author_facet Hoger Khayrolla Omar
Alaa Khalil Jumaa
author_sort Hoger Khayrolla Omar
collection DOAJ
description Nowadays with the technology revolution the term of big data is a phenomenon of the decade moreover, it has a significant impact on our applied science trends. Exploring well big data tool is a necessary demand presently. Hadoop is a good big data analyzing technology, but it is slow because the Job result among each phase must be stored before the following phase is started as well as to the replication delays. Apache Spark is another tool that developed and established to be the real model for analyzing big data with its innovative processing framework inside the memory and high-level programming libraries for machine learning, efficient data treating and etc. In this paper, some comparisons are presented about the time performance evaluation among Scala and Java in apache spark MLlib. Many tests have been done in supervised and unsupervised machine learning methods with utilizing big datasets. However, loading the datasets from Hadoop HDFS as well as to the local disk to identify the pros and cons of each manner and discovering perfect reading or loading dataset situation to reach best execution style. The results showed that the performance of Scala about 10% to 20% is better than Java depending on the algorithm type. The aim of the study is to analyze big data with more suitable programming languages and as consequences gaining better performance.
format Article
id doaj-art-366ed084da514b2f849e09a8004eed94
institution Kabale University
issn 2411-7684
2411-7706
language English
publishDate 2019-05-01
publisher Sulaimani Polytechnic University
record_format Article
series Kurdistan Journal of Applied Research
spelling doaj-art-366ed084da514b2f849e09a8004eed942025-02-09T21:00:39ZengSulaimani Polytechnic UniversityKurdistan Journal of Applied Research2411-76842411-77062019-05-014110.24017/science.2019.1.2265Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and JavaHoger Khayrolla Omar0Alaa Khalil Jumaa1Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani | Kirkuk University, Kirkuk, IraqDatabase Technology Department, Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani, IraqNowadays with the technology revolution the term of big data is a phenomenon of the decade moreover, it has a significant impact on our applied science trends. Exploring well big data tool is a necessary demand presently. Hadoop is a good big data analyzing technology, but it is slow because the Job result among each phase must be stored before the following phase is started as well as to the replication delays. Apache Spark is another tool that developed and established to be the real model for analyzing big data with its innovative processing framework inside the memory and high-level programming libraries for machine learning, efficient data treating and etc. In this paper, some comparisons are presented about the time performance evaluation among Scala and Java in apache spark MLlib. Many tests have been done in supervised and unsupervised machine learning methods with utilizing big datasets. However, loading the datasets from Hadoop HDFS as well as to the local disk to identify the pros and cons of each manner and discovering perfect reading or loading dataset situation to reach best execution style. The results showed that the performance of Scala about 10% to 20% is better than Java depending on the algorithm type. The aim of the study is to analyze big data with more suitable programming languages and as consequences gaining better performance. https://kjar.spu.edu.iq/index.php/kjar/article/view/265Keywords: Big data, Data analysis, Apache Spark, Hadoop HDFS, Machine learning, Spark MLlib, Resilient Distributed Datasets(RDD).
spellingShingle Hoger Khayrolla Omar
Alaa Khalil Jumaa
Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java
Kurdistan Journal of Applied Research
Keywords: Big data, Data analysis, Apache Spark, Hadoop HDFS, Machine learning, Spark MLlib, Resilient Distributed Datasets(RDD).
title Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java
title_full Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java
title_fullStr Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java
title_full_unstemmed Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java
title_short Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java
title_sort big data analysis using apache spark mllib and hadoop hdfs with scala and java
topic Keywords: Big data, Data analysis, Apache Spark, Hadoop HDFS, Machine learning, Spark MLlib, Resilient Distributed Datasets(RDD).
url https://kjar.spu.edu.iq/index.php/kjar/article/view/265
work_keys_str_mv AT hogerkhayrollaomar bigdataanalysisusingapachesparkmllibandhadoophdfswithscalaandjava
AT alaakhaliljumaa bigdataanalysisusingapachesparkmllibandhadoophdfswithscalaandjava