In Partial Fulfillment of the Requirements for the Degree of Master of Science
Siva Uday Sampreeth Chebolu
will defend his thesis
A General Summarization Matrix for Scalable Machine Learning Model Computation in the R Language
Data Analysis has become an essential task nowadays. Large datasets indeed have very big volume size and may require a parallel DBMS, Hadoop Stack or parallel clusters to analyze it. Conversely, we propose an alternative approach by using a lightweight language/system like R to compute Machine Learning models on such datasets. This approach eliminates the use of cluster/parallel systems in most of the cases and thus paving way for an average user to utilize it to great effect. Specifically,we aim at eliminating the physical memory, time and speed limitations, which is the case with current packages in R while working with a single machine. Having said that, it is well known R is a powerful language and is very popular for its data analytical ability. But sometimes R is significantly slow in such orders that does not allow flexible modifications and is cumbersome to make it fast and efficient. As a means of addressing the drawbacks mentioned thus far, we implemented this approach in two phases. The first phase is deals with the construction of a summarization matrix, Γ, from a one-time scan of the source dataset and is implemented in C++ by using the RCpp package. There are two forms of this Γ matrix, namely, Diagonal and Non-Diagonal Gamma, each of which is efficient for computing specific models. The second phase deals with using the constructed Γ Matrix to apply machine learning models like PCA, Linear Regression, Na ̈ıve Bayes, K-means and similar ones for analysis which is implemented in R itself. We bundled our whole approach into a R package, Gamma.
Date: Wednesday, April 17, 2019
Time: 9:45 AM
Place: PGH 218D
Advisors: Dr. Carlos Ordonez, Dr. Christoph F. Eick, Dr. Klaus Kaiser
Faculty, students, and the general public are invited.