Department of Computer Science at UH

University of Houston

Department of Computer Science

In Partial Fulfillment of the Requirements for the Degree of
Master of Science

Vasanthi Vuppuluri

Will defend her thesis

Textractor: A new approach for extracting N-grams, Collocations and Multi-Word Expressions

Abstract

There is so much knowledge available on the Internet now, which represents a great opportunity for automatic, intelligent text processing and understanding, but the major problems are finding the legitimate sources of information and overcoming rate limitations on search engine APIs. The work in this thesis describes methods that combine the knowledge of World Wide Web (WWW) and the power of Internet search with the knowledge extracted from dictionaries. This thesis presents Textractor, an un-supervised, domain independent general-purpose n-gram, collocation and multi-word expression (MWE) extraction software written in Python. It is modular and allows the user to choose from and compare different methods for identifying n-grams, collocations and MWEs including statistical, dictionary and Internet based. It shows that it is very hard to identify collocations based on statistical information from the given text document alone (although this might seem obvious some systems do use it) and that dictionary and Internet based techniques when combined properly can be very effective sources of collocations and MWEs without their respective drawbacks. This method can overcome the limitations of current Natural Language Processing techniques. Textractor can recognize collocations and MWEs even when the complete sentence is not present. It is currently designed to work with text in English but can easily be extended to other languages.

Date: Wednesday, April 22, 2015
Time: 9:30 AM
Place: PGH 550

Faculty, students, and the general public are invited.
Advisor: Prof. Rakesh M Verma