DSA Homework 2

Roger Jang

Due date: 20170403 23:59:59

N-gram Search (Information Retrieval)

Outlines

Problem definition

Google's Web 1T dataset is a big collection of n-grams (with n=2, 3, 4, and 5) of English that can be used for various applications, including machine translation, speech recognition, English usage correction, etc. In this assignment, we need to implement an efficient information retrieval method for querying a slim version of the Web 1T dataset.

Given a query of n-gram, you program should return its frequency based on the slim Web 1T. The given n-gram query can be a simple query such as "listen to music" or "in a position to". Additionally, it can contains wildcards and special characters to make the query more flexible, as explain next.

"-" (underscore) can be used to represent a word (token).
For instance, "listen _ music"
"*" (asterisk) can be used to represent zero or more words
For instance, "a * book"
"?" can be used as a prefix to denote a word as optional
For instance, "discuss ?about the issue"
"/" can be used to denote all alternatives
For instance, "in/at/on the afternoon"

To test these special characters, please try Linggle.

Please check out the recording and slides of this homework.

Suggested approach

The basic approach is based on inverted index. In other words, for a give word, we need to be able to find which n-gram it appears. More info can be found in the first three sections of Boolean retrieval:

Tutorials by Stanford NLP-Professor Dan Jurafsky & Chris Manning

To be more precise, here is a typical approach to the problem:

Preprocess
1. Extract all words from the n-gram dataset
2. Create a dictionary of all sorted unique words
3. Generate a posting for each word in the dictionary
Online processing
1. Expand the query until it contains only words or "_".
2. Extract words from the query
3. Retrieve each word (by binary search or the likes) and its posting from one of the n-gram sets
4. Combine postings to have the candidate output set
5. Generate final output by considering ordering.
6. Sort and print the output ordered by frequency

For query expansion, you need to follow the rules shown next (which may not be exactly the same as those used by Linggle):

First, expand "?" (which, if exist, always applear as the first character in a token). For instance, "abc ?x/y/z _ ?*/p" can be expanded into the following 4 queries:
- abc _
- abc _ */p
- abc x/y/z _
- abc x/y/z _ */p
Second, expand "/". For instance, "abc x/y/z _ */p" can be expanded into the following 6 queries:
- abc x _ *
- abc x _ p
- abc y _ *
- abc y _ p
- abc z _ *
- abc z _ p
Last, expand "*" into emptiness or combinations of "_". For instance, "give * a *" can be expaneded into the following 10 queries:
- give a
- give a _
- give a _ _
- give a _ _ _
- give _ a
- give _ a _
- give _ a _ _
- give _ _ a
- give _ _ a _
- give _ _ _ a

Note that it will be easier to write a recursive function for each of the above expansions.

Input/output formats

An typical example of input file is as follows, where each line is a query:

regarding our site to occupation does resolution with all eliminated _ _ the resulted from negligent understand that you _ _ alprazolam international free/on cartoon sex _ and _ states storing large _ of/for/with data

For each query, you only need to generate at most top-5 output with ascending order of frequency. If there is tie, list them according to alphabetical order of the n-grams. The output file corresponding to the previous input file is shown next:

query: regarding our site to output: 1 regarding our site to 1103 query: occupation does output: 1 occupation does 4704 query: resolution with all output: 1 resolution with all 6938 query: eliminated _ _ the output: 5 eliminated most of the 6744 eliminated many of the 4848 eliminated some of the 4467 eliminated all of the 4131 eliminated much of the 3933 query: resulted from negligent output: 1 resulted from negligent 3434 query: understand that you _ _ output: 5 understand that you do not 11830 understand that you can not 9507 understand that you are not 6594 understand that you are using 5366 understand that you want to 5175 query: alprazolam international output: 1 alprazolam international 1414 query: free/on cartoon sex _ output: 5 free cartoon sex video 10298 free cartoon sex free 7323 free cartoon sex pics 4177 free cartoon sex movie 3575 free cartoon sex film 2412 query: and _ states output: 5 and other states 139463 and the states 105148 and surrounding states 25879 and all states 17070 and some states 16149 query: storing large _ of/for/with data output: 1 storing large amounts of data 2578 Each output entry and its frequency are separated by a tab.

Requirements

Limits:
- Max time: 180 second.
- Max memory: 7 GB.
- Max submissions: 20 times each week, starting at "20170320 23:59:59" and ending at "20170403 23:59:59".
Program usage: Your program should take an input argument to indicate the directory which holds the n-gram files. The default file names are shown in the download links listed below. And the input file (which holds the queries) can be redirect to the program. An example follows. myProgram /tmp2/dsa2017_hw02/ < input00.txt

Datasets

The original Google Web 1T dataset is too big to process for machines with limited memory. Hence we shall use a subset for this project, as shown next. These files are already available under /tmp2/dsa2017_hw02/ of linux1 to linux13.
Open test sets
Hidden test sets
- input1000-hidden.txt, output1000-hidden.txt