DSA Homework 
Due date: 
N-gram Search (Information Retrieval)
Outlines
 Problem definition
Problem definition
Google's Web 1T dataset is a big collection of n-grams (with n=2, 3, 4, and 5) of English that can be used for various applications, including machine translation, speech recognition, English usage correction, etc. In this assignment, we need to implement an efficient information retrieval method for querying a slim version of the Web 1T dataset.
Given a query of n-gram, you program should return its frequency based on the slim Web 1T. The given n-gram query can be a simple query such as "listen to music" or "in a position to". Additionally, it can contains wildcards and special characters to make the query more flexible, as explain next.
- "-" (underscore) can be used to represent a word (token).
	
 For instance, "listen _ music"
- "*" (asterisk) can be used to represent zero or more words
	
 For instance, "a * book"
- "?" can be used as a prefix to denote a word as optional
	
 For instance, "discuss ?about the issue"
- "/" can be used to denote all alternatives
	
 For instance, "in/at/on the afternoon"
To test these special characters, please try Linggle.Please check out the recording and slides of this homework.
 Suggested approach
Suggested approach
The basic approach is based on inverted index. In other words, for a give word, we need to be able to find which n-gram it appears. More info can be found in the first three sections of Boolean retrieval:
Tutorials by Stanford NLP-Professor Dan Jurafsky & Chris Manning
To be more precise, here is a typical approach to the problem:
- Preprocess
	
	- Extract all words from the n-gram dataset
	
- Create a dictionary of all sorted unique words
	
- Generate a posting for each word in the dictionary
	
 
- Online processing
	
	- Expand the query until it contains only words or "_".
	
- Extract words from the query
	
- Retrieve each word (by binary search or the likes) and its posting from one of the n-gram sets 
	
- Combine postings to have the candidate output set
	
- Generate final output by considering ordering.
	
- Sort and print the output ordered by frequency
	
 
For query expansion, you need to follow the rules shown next (which may not be exactly the same as those used by Linggle):
- First, expand "?" (which, if exist, always applear as the first character in a token). For instance, "abc ?x/y/z _ ?*/p" can be expanded into the following 4 queries:
	
	- abc _
	
- abc _ */p
	
- abc x/y/z _
	
- abc x/y/z _ */p
	
 
- Second, expand "/". For instance, "abc x/y/z _ */p" can be expanded into the following 6 queries:
	
	- abc x _ *
	
- abc x _ p
	
- abc y _ *
	
- abc y _ p
	
- abc z _ *
	
- abc z _ p
	
 
- Last, expand "*" into emptiness or combinations of "_". For instance, "give * a *" can be expaneded into the following 10 queries:
	
	- give a
	
- give a _
	
- give a _ _
	
- give a _ _ _
	
- give _ a
	
- give _ a _
	
- give _ a _ _
	
- give _ _ a
	
- give _ _ a _
	
- give _ _ _ a
	
 
Note that it will be easier to write a recursive function for each of the above expansions. Input/output formats
Input/output formats
An typical example of input file is as follows, where each line is a query:
regarding our site to
occupation does
resolution with all
eliminated _ _ the
resulted from negligent
understand that you _ _
alprazolam international
free/on cartoon sex _
and _ states
storing large _ of/for/with data
For each query, you only need to generate at most top-5 output with ascending order of frequency. If there is tie, list them according to alphabetical order of the n-grams. The output file corresponding to the previous input file is shown next:
query: regarding our site to
output: 1
regarding our site to	1103
query: occupation does
output: 1
occupation does	4704
query: resolution with all
output: 1
resolution with all	6938
query: eliminated _ _ the
output: 5
eliminated most of the	6744
eliminated many of the	4848
eliminated some of the	4467
eliminated all of the	4131
eliminated much of the	3933
query: resulted from negligent
output: 1
resulted from negligent	3434
query: understand that you _ _
output: 5
understand that you do not	11830
understand that you can not	9507
understand that you are not	6594
understand that you are using	5366
understand that you want to	5175
query: alprazolam international
output: 1
alprazolam international	1414
query: free/on cartoon sex _
output: 5
free cartoon sex video	10298
free cartoon sex free	7323
free cartoon sex pics	4177
free cartoon sex movie	3575
free cartoon sex film	2412
query: and _ states
output: 5
and other states	139463
and the states	105148
and surrounding states	25879
and all states	17070
and some states	16149
query: storing large _ of/for/with data
output: 1
storing large amounts of data	2578
Each output entry and its frequency are separated by a tab.
 Requirements
Requirements
- Limits:
		
	- Max time: 180 second.
	
- Max memory: 7 GB.
	
- Max submissions: 20 times each week, starting at "20170320 23:59:59" and ending at "20170403 23:59:59".
	
 
- Program usage: Your program should take an input argument to indicate the directory which holds the n-gram files. The default file names are shown in the download links listed below. And the input file (which holds the queries) can be redirect to the program. An example follows. 
	myProgram /tmp2/dsa2017_hw02/ < input00.txt
 Datasets
Datasets
- The original Google Web 1T dataset is too big to process for machines with limited memory. Hence we shall use a subset for this project, as shown next. These files are already available under /tmp2/dsa2017_hw02/ of linux1 to linux13.
	
- Open test sets
	
- Hidden test sets
	
