In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
will defend his dissertation proposal
Incremental Data Summarization For Parallel In-database Supervised/Unsupervised Learning
Data summarization is an essential mechanism to accelerate analytic algorithms on large data sets. In this work we present a comprehensive data summarization matrix, namely the Gamma matrix, from which we can derive equivalent equations for many analytic algorithms. In this way, iterative algorithms are changed to work in two phases: (1) Incremental and parallel summarization of the data set in one pass; (2) Iteration in main memory exploiting the summarization matrix in many intermediate computations. We show our summarization matrix captures essential statistical properties of the data set and it allows iterative algorithms to work a lot faster in main memory. We show the two versions of the summarization matrix: full Gamma and diagonal Gamma can benefit statistical models, including PCA, linear regression, variable selection, naive beyesian classifier, and K-Means clustering. From a system perspective, we carefully study the efficient computation of the summarization matrix in various parallel database systems: the array DBMS SciDB, the columnar relational DBMS HP Vertica, and the in-memory row DBMS VoltDB. We propose general optimizations according to the data density and system-dependent optimizations for each platform. We will also present an experimental evaluation benchmarking system and algorithm performance.
Date: Monday, September 12, 2016
Time: 10:00 AM
Place: PGH 550
Advisor: Carlos Ordonez
Faculty, students, and the general public are invited.