MATH 4323 - Data Science and Statistical Learning - University of Houston

# MATH 4323 - Data Science and Statistical Learning

***This is a course guideline.  Students should contact instructor for the updated information on current course syllabus, textbooks, and course content*

Prerequisites: MATH 3339

Course Description: Course will deal with theory and applications for such statistical learning techniques as maximal marginal classifiers, support vector machines, K-means and hierarchical clustering. Other topics might include: algorithm performance evaluation, cluster validation, data scaling, resampling methods. R Statistical programming will be used throughout the course.

Textbook: While lecture notes will serve as the main source of material for the course, the following book constitutes a great reference:
• ”An Introduction to Statistical Learning (with applications in R)” by James, Witten et al. ISBN: 978-1461471370

Learning Objectives: By the end of the course a successful student should:
• Have a solid conceptual grasp on the described statistical learning methods.
• Be able to correctly identify the appropriate techniques to deal with particular data sets.
• Have a working knowledge of R programming software in order to apply those techniques and subse- quently assess the quality of fitted models.
• Demonstrate the ability to clearly communicate the results of applying selected statistical learning methods to the data.

Tentative Course Outline:
• Review: Task of Statistical Learning. Supervised and unsupervised learning. Most ubiquitous statistical learning techniques.
• Support Vector Classifier. Maximal margin classifier: separating hyperplane, support vectors. Non-separable case: support vector classifier.
• Support Vector Machines. Non-linear decision boundaries. Kernels. One-versus-one and one-vs-all classification for K > 2 classes. Evaluating quality of classification.
• Clustering Methods: K-Means. Within-cluster variation. Computing centroids. Multiple starts. Selecting K.
• Clustering Methods: Hierarchical. Agglomerative clustering. Linkage. Interpreting dendrogram. Choice of dissimilarity measure. Data scaling.
• Evaluation of Clustering Solution. Is this a good clustering? Variance explained. Between- and within-cluster variation. Silhouette coefficient.