DSA Homework 2
Due date: 20170403 23:59:59
N-gram Search (Information Retrieval)
Outlines
Problem definition
Google's Web 1T dataset is a big collection of n-grams (with n=2, 3, 4, and 5) of English that can be used for various applications, including machine translation, speech recognition, English usage correction, etc. In this assignment, we need to implement an efficient information retrieval method for querying a slim version of the Web 1T dataset.
Given a query of n-gram, you program should return its frequency based on the slim Web 1T. The given n-gram query can be a simple query such as "listen to music" or "in a position to". Additionally, it can contains wildcards and special characters to make the query more flexible, as explain next.
- "-" (underscore) can be used to represent a word (token).
For instance, "listen _ music"
- "*" (asterisk) can be used to represent zero or more words
For instance, "a * book"
- "?" can be used as a prefix to denote a word as optional
For instance, "discuss ?about the issue"
- "/" can be used to denote all alternatives
For instance, "in/at/on the afternoon"
To test these special characters, please try Linggle.
Please check out the recording and slides of this homework.
Suggested approach
The basic approach is based on inverted index. In other words, for a give word, we need to be able to find which n-gram it appears. More info can be found in the first three sections of Boolean retrieval:
Tutorials by Stanford NLP-Professor Dan Jurafsky & Chris Manning
To be more precise, here is a typical approach to the problem:
- Preprocess
- Extract all words from the n-gram dataset
- Create a dictionary of all sorted unique words
- Generate a posting for each word in the dictionary
- Online processing
- Expand the query until it contains only words or "_".
- Extract words from the query
- Retrieve each word (by binary search or the likes) and its posting from one of the n-gram sets
- Combine postings to have the candidate output set
- Generate final output by considering ordering.
- Sort and print the output ordered by frequency
For query expansion, you need to follow the rules shown next (which may not be exactly the same as those used by Linggle):
- First, expand "?" (which, if exist, always applear as the first character in a token). For instance, "abc ?x/y/z _ ?*/p" can be expanded into the following 4 queries:
- abc _
- abc _ */p
- abc x/y/z _
- abc x/y/z _ */p
- Second, expand "/". For instance, "abc x/y/z _ */p" can be expanded into the following 6 queries:
- abc x _ *
- abc x _ p
- abc y _ *
- abc y _ p
- abc z _ *
- abc z _ p
- Last, expand "*" into emptiness or combinations of "_". For instance, "give * a *" can be expaneded into the following 10 queries:
- give a
- give a _
- give a _ _
- give a _ _ _
- give _ a
- give _ a _
- give _ a _ _
- give _ _ a
- give _ _ a _
- give _ _ _ a
Note that it will be easier to write a recursive function for each of the above expansions.
Input/output formats
An typical example of input file is as follows, where each line is a query:
regarding our site to
occupation does
resolution with all
eliminated _ _ the
resulted from negligent
understand that you _ _
alprazolam international
free/on cartoon sex _
and _ states
storing large _ of/for/with data
For each query, you only need to generate at most top-5 output with ascending order of frequency. If there is tie, list them according to alphabetical order of the n-grams. The output file corresponding to the previous input file is shown next:
query: regarding our site to
output: 1
regarding our site to 1103
query: occupation does
output: 1
occupation does 4704
query: resolution with all
output: 1
resolution with all 6938
query: eliminated _ _ the
output: 5
eliminated most of the 6744
eliminated many of the 4848
eliminated some of the 4467
eliminated all of the 4131
eliminated much of the 3933
query: resulted from negligent
output: 1
resulted from negligent 3434
query: understand that you _ _
output: 5
understand that you do not 11830
understand that you can not 9507
understand that you are not 6594
understand that you are using 5366
understand that you want to 5175
query: alprazolam international
output: 1
alprazolam international 1414
query: free/on cartoon sex _
output: 5
free cartoon sex video 10298
free cartoon sex free 7323
free cartoon sex pics 4177
free cartoon sex movie 3575
free cartoon sex film 2412
query: and _ states
output: 5
and other states 139463
and the states 105148
and surrounding states 25879
and all states 17070
and some states 16149
query: storing large _ of/for/with data
output: 1
storing large amounts of data 2578
Each output entry and its frequency are separated by a tab.
Requirements
- Limits:
- Max time: 180 second.
- Max memory: 7 GB.
- Max submissions: 20 times each week, starting at "20170320 23:59:59" and ending at "20170403 23:59:59".
- Program usage: Your program should take an input argument to indicate the directory which holds the n-gram files. The default file names are shown in the download links listed below. And the input file (which holds the queries) can be redirect to the program. An example follows.
myProgram /tmp2/dsa2017_hw02/ < input00.txt
Datasets
- The original Google Web 1T dataset is too big to process for machines with limited memory. Hence we shall use a subset for this project, as shown next. These files are already available under /tmp2/dsa2017_hw02/ of linux1 to linux13.
- Open test sets
- Hidden test sets