Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java
Nowadays with the technology revolution the term of big data is a phenomenon of the decade moreover, it has a significant impact on our applied science trends. Exploring well big data tool is a necessary demand presently. Hadoop is a good big data analyzing technology, but it is slow because the Job...
Saved in:
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Sulaimani Polytechnic University
2019-05-01
|
Series: | Kurdistan Journal of Applied Research |
Subjects: | |
Online Access: | https://kjar.spu.edu.iq/index.php/kjar/article/view/265 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1823861323525521408 |
---|---|
author | Hoger Khayrolla Omar Alaa Khalil Jumaa |
author_facet | Hoger Khayrolla Omar Alaa Khalil Jumaa |
author_sort | Hoger Khayrolla Omar |
collection | DOAJ |
description | Nowadays with the technology revolution the term of big data is a phenomenon of the decade moreover, it has a significant impact on our applied science trends. Exploring well big data tool is a necessary demand presently. Hadoop is a good big data analyzing technology, but it is slow because the Job result among each phase must be stored before the following phase is started as well as to the replication delays. Apache Spark is another tool that developed and established to be the real model for analyzing big data with its innovative processing framework inside the memory and high-level programming libraries for machine learning, efficient data treating and etc. In this paper, some comparisons are presented about the time performance evaluation among Scala and Java in apache spark MLlib. Many tests have been done in supervised and unsupervised machine learning methods with utilizing big datasets. However, loading the datasets from Hadoop HDFS as well as to the local disk to identify the pros and cons of each manner and discovering perfect reading or loading dataset situation to reach best execution style. The results showed that the performance of Scala about 10% to 20% is better than Java depending on the algorithm type. The aim of the study is to analyze big data with more suitable programming languages and as consequences gaining better performance.
|
format | Article |
id | doaj-art-366ed084da514b2f849e09a8004eed94 |
institution | Kabale University |
issn | 2411-7684 2411-7706 |
language | English |
publishDate | 2019-05-01 |
publisher | Sulaimani Polytechnic University |
record_format | Article |
series | Kurdistan Journal of Applied Research |
spelling | doaj-art-366ed084da514b2f849e09a8004eed942025-02-09T21:00:39ZengSulaimani Polytechnic UniversityKurdistan Journal of Applied Research2411-76842411-77062019-05-014110.24017/science.2019.1.2265Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and JavaHoger Khayrolla Omar0Alaa Khalil Jumaa1Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani | Kirkuk University, Kirkuk, IraqDatabase Technology Department, Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani, IraqNowadays with the technology revolution the term of big data is a phenomenon of the decade moreover, it has a significant impact on our applied science trends. Exploring well big data tool is a necessary demand presently. Hadoop is a good big data analyzing technology, but it is slow because the Job result among each phase must be stored before the following phase is started as well as to the replication delays. Apache Spark is another tool that developed and established to be the real model for analyzing big data with its innovative processing framework inside the memory and high-level programming libraries for machine learning, efficient data treating and etc. In this paper, some comparisons are presented about the time performance evaluation among Scala and Java in apache spark MLlib. Many tests have been done in supervised and unsupervised machine learning methods with utilizing big datasets. However, loading the datasets from Hadoop HDFS as well as to the local disk to identify the pros and cons of each manner and discovering perfect reading or loading dataset situation to reach best execution style. The results showed that the performance of Scala about 10% to 20% is better than Java depending on the algorithm type. The aim of the study is to analyze big data with more suitable programming languages and as consequences gaining better performance. https://kjar.spu.edu.iq/index.php/kjar/article/view/265Keywords: Big data, Data analysis, Apache Spark, Hadoop HDFS, Machine learning, Spark MLlib, Resilient Distributed Datasets(RDD). |
spellingShingle | Hoger Khayrolla Omar Alaa Khalil Jumaa Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java Kurdistan Journal of Applied Research Keywords: Big data, Data analysis, Apache Spark, Hadoop HDFS, Machine learning, Spark MLlib, Resilient Distributed Datasets(RDD). |
title | Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java |
title_full | Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java |
title_fullStr | Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java |
title_full_unstemmed | Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java |
title_short | Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java |
title_sort | big data analysis using apache spark mllib and hadoop hdfs with scala and java |
topic | Keywords: Big data, Data analysis, Apache Spark, Hadoop HDFS, Machine learning, Spark MLlib, Resilient Distributed Datasets(RDD). |
url | https://kjar.spu.edu.iq/index.php/kjar/article/view/265 |
work_keys_str_mv | AT hogerkhayrollaomar bigdataanalysisusingapachesparkmllibandhadoophdfswithscalaandjava AT alaakhaliljumaa bigdataanalysisusingapachesparkmllibandhadoophdfswithscalaandjava |