DSA Final Project (2016 Spring)

Roger Jang

Due date: 2016/06/30 23:59:59

English Preposition Usage Checker

Outlines

Slides & recordings

Before proceed to problem definition, you might want to see my recording first. (Note that I've updated some of the slides for clarification after the recording.)

Problem definition

Sometimes it is hard for a non-native English speaker to determine the right preposition in an English sentence. To emphasized problem solving for real-world problems, in this project, we shall create an English preposition usage checker based on a slim version of Google's Web 1T dataset.

For simplicity, we shall only consider the most common 20 prepositions, including of, to, in, for, with, on, at, by, from, up, about, than, after, before, down, between, under, since, without, near. See here for more details.

The general procedure for the task of preposition usage checker involves the following steps:

Generate a candidate set of extended queries based on the given query.
Compute the frequency of each element in the candidate set.
List top-10 candidates (with nonzero frequencies) based on descending order of the frequency. (If the size of the candidate set is less than 10, then list all of them.)

We shall generate the candidate set based on the following two rules:

If the input query does not contain any preposition, such as "go listen music", then EDIT_a set is defined by insertion only:
1. Insert: "x go listen music", "go x listen music", "go listen x music", and "go listen music x", where x is an inserted preposition.
Then the candidate set is defined by EDIT_a(EDIT_a(query))+EDIT_a(query)+query.
If the input query does contain prepositions, such as "log in to check with", then we shall identify preposition sequences first. In the previous example, the preposition sequences are "in to" and "with". For a preposition sequence such as "in to", the EDIT_b set can be defined as follows:
1. Insert: "x in to", "in x to", "in to x"
2. Delete: "in", "to"
3. substitute: "in x", "x to"
where x is one of the 20 aforementioned prepositions. Given the EDIT_b sets for all preposition sequences in a query, then the candidate set is defined as the Cartesian product of all the EDIT_b sets together with the original non-preposition words. (If you do not know what Cartesian product is, please check out my recording.)

Remember that you also need to add the original query to the candidate set. Once we have the candidate set for the input query, we can use the slim version of Web 1T to obtain the corresponding frequency of each element of extended query. Our program should list the top-10 extended queries based on descending order of their frequencies.

Input/output formats

An typical example of input file is as follows:

have difficulty finding angry at me pleased at me worry for cancer

The output file corresponding to the previous input file is shown next:

query: have difficulty finding output: 3 have difficulty finding 30918 have difficulty in finding 4636 to have difficulty finding 1174 query: angry at me output: 2 angry with me 60929 angry at me 24354 query: pleased at me output: 3 pleased me 40402 pleased with me 10015 pleased for me 1067 query: worry for cancer output: 1 worry about cancer 2120 Each output entry and its frequency are separated by a tab.

Requirements & suggestions

For efficiency, the corpus should be stored in a hash table. However, you are welcome to use any other structures for the project, provided it is efficient for real-time response.
More about Google's Web 1T dataset:
- Natural Language Corpus Data (provided by Peter Norvig) introduces the dataset together with related applications.
- Linggle (designed by Jason Chang's NLP lab) is a comprehensive English usage check based on the dataset.
You can safely assume the given query has no more than 100 characters.
This number of queries in the hidden test set is also 1000.
The extended queries should be listed based on the descending order of frequency. If the frequency is the same, then it should be listed based on alphabetic order (aka dictionary order) of the extended queries.
The test set contain many queries. You will be awarded with a partial credit which is propertional to the number of queries that have been correctly answered.
The correctness of your answer accounts for 75% of the final score. The other 25% is based on the speed of your program. In other words, the score can be expressed as 0.75*A + 25*(1-rankRatio)*(A/100), where A is the score (between 0 and 100, inclusively) based on the correctness of your program and rankRatio is the rank of speed divided by no. of students. (Note that the final formula will be adjusted if something unexpected happens to destroy the fairness of the original formula.)
As usual...
1. Your program should take the input from the standard input and send the output to the standard output.
2. You can use any STL libraries.
3. Remember to use "diff" to check if your output is the same as our test dataset.

Slim corpora from Google's Web 1T dataset

The original Google Web 1T dataset is too big to process for machines with limited memory. Hence we shall use a subset for this project, as shown next. These files are already available under /tmp2/dsa2016_project/ of linux1 to linux13.

Open test sets

Submission guidelines

Please submit your code with GitHub just like homework, as described in the GitHub submission guide for DSA homework. You don't need to set any tag name for final project submission.
Please also follow Yen-Chieh's instructions and his post on FB.
Your directory structure (under GitHub repository) should be:
1. project/*.{cpp, h}: Your source code.
2. project/*.sh: Your shell scripts
You have to make sure your code can be compiled and run on CSIE R217 linux machines.