Thesis Defense - University of Houston
Skip to main content

Thesis Defense

In Partial Fulfillment of the Requirements for the Degree of Master of Science

Haripriya Ayyalasomayajula

will defend her thesis

An Evaluation of the Spark Programming Model for Big Data Analytics


Abstract

The focus of companies like Google, Amazon etc. is to gain competitive business advantage from the insights drawn by processing petabytes of data. Big Data refers to data characterized by large volume, great variety and ubiquitous nature of its sources. MapReduce is a programming model that provides a highly scalable and efficient solution to analyze massive datasets on large scale commondity clusters. Though Hadoop, its open source implementation became a defacto for parallel processing of batch workloads, it is inefficient for iterative, incremental algorithms, adhoc queries and stream processing.

Apache Spark is a general-purpose cluster computing framework which supports in-memory data analytics. It preserves the merits offered by hadoop and overcomes its limitations. This thesis aims at evaluating the performance offered by the Spark programming model for Big Data Analytics. Code has been developed to perform analyses of historic data of air quality data set using Spark and MapReduce. It involved significant development effort and tuning the configuration parameters. It is observed that Spark offers a performance of upto 40% more than MapReduce.

Applying Machine Learning techniques to Big Data forms the core of data analytics. MLib is a scalable machine learning library, offered by the Spark eco-system. To extend our analyses, we perform clustering on the air-quality dataset and evaluate the performance, clustering quality and usability of K-Means Clustering algorithm implementation provided by Spark MLib library against that of Apache Mahout. We tried to develop code to evaluate Spark's ability to integrate with HBase as a data source. Though the initial test cases ran successfully with small dataset, due to insufficient documentation available currently, this is reserved for future work.


Date: Tuesday, April 28, 2015
Time: 11:00 AM
Place: PGH 550
Advisor: Prof. Edgar Gabriel

Faculty, students, and the general public are invited.