DSA Final Project (2016 Spring)
Due date: 2016/06/30 23:59:59
English Preposition Usage Checker
Outlines
Slides & recordings
Before proceed to problem definition, you might want to see my recording first. (Note that I've updated some of the slides for clarification after the recording.)
Problem definition
Sometimes it is hard for a non-native English speaker to determine the right preposition in an English sentence. To emphasized problem solving for real-world problems, in this project, we shall create an English preposition usage checker based on a slim version of Google's Web 1T dataset.
For simplicity, we shall only consider the most common 20 prepositions, including of, to, in, for, with, on, at, by, from, up, about, than, after, before, down, between, under, since, without, near. See here for more details.
The general procedure for the task of preposition usage checker involves the following steps:
- Generate a candidate set of extended queries based on the given query.
- Compute the frequency of each element in the candidate set.
- List top-10 candidates (with nonzero frequencies) based on descending order of the frequency. (If the size of the candidate set is less than 10, then list all of them.)
We shall generate the candidate set based on the following two rules:
- If the input query does not contain any preposition, such as "go listen music", then EDITa set is defined by insertion only:
- Insert: "x go listen music", "go x listen music", "go listen x music", and "go listen music x", where x is an inserted preposition.
Then the candidate set is defined by EDITa(EDITa(query))+EDITa(query)+query.
- If the input query does contain prepositions, such as "log in to check with", then we shall identify preposition sequences first. In the previous example, the preposition sequences are "in to" and "with". For a preposition sequence such as "in to", the EDITb set can be defined as follows:
- Insert: "x in to", "in x to", "in to x"
- Delete: "in", "to"
- substitute: "in x", "x to"
where x is one of the 20 aforementioned prepositions. Given the EDITb sets for all preposition sequences in a query, then the candidate set is defined as the Cartesian product of all the EDITb sets together with the original non-preposition words. (If you do not know what Cartesian product is, please check out my recording.)
Remember that you also need to add the original query to the candidate set. Once we have the candidate set for the input query, we can use the slim version of Web 1T to obtain the corresponding frequency of each element of extended query. Our program should list the top-10 extended queries based on descending order of their frequencies.
Input/output formats
An typical example of input file is as follows:
have difficulty finding
angry at me
pleased at me
worry for cancer
The output file corresponding to the previous input file is shown next:
query: have difficulty finding
output: 3
have difficulty finding 30918
have difficulty in finding 4636
to have difficulty finding 1174
query: angry at me
output: 2
angry with me 60929
angry at me 24354
query: pleased at me
output: 3
pleased me 40402
pleased with me 10015
pleased for me 1067
query: worry for cancer
output: 1
worry about cancer 2120
Each output entry and its frequency are separated by a tab.
Requirements & suggestions
- For efficiency, the corpus should be stored in a hash table. However, you are welcome to use any other structures for the project, provided it is efficient for real-time response.
- More about Google's Web 1T dataset:
- You can safely assume the given query has no more than 100 characters.
- This number of queries in the hidden test set is also 1000.
- The extended queries should be listed based on the descending order of frequency. If the frequency is the same, then it should be listed based on alphabetic order (aka dictionary order) of the extended queries.
- The test set contain many queries. You will be awarded with a partial credit which is propertional to the number of queries that have been correctly answered.
- The correctness of your answer accounts for 75% of the final score. The other 25% is based on the speed of your program. In other words, the score can be expressed as 0.75*A + 25*(1-rankRatio)*(A/100), where A is the score (between 0 and 100, inclusively) based on the correctness of your program and rankRatio is the rank of speed divided by no. of students. (Note that the final formula will be adjusted if something unexpected happens to destroy the fairness of the original formula.)
- As usual...
- Your program should take the input from the standard input and send the output to the standard output.
- You can use any STL libraries.
- Remember to use "diff" to check if your output is the same as our test dataset.
Slim corpora from Google's Web 1T dataset
The original Google Web 1T dataset is too big to process for machines with limited memory. Hence we shall use a subset for this project, as shown next. These files are already available under /tmp2/dsa2016_project/ of linux1 to linux13.
Open test sets
Submission guidelines
- Please submit your code with GitHub just like homework, as described in the GitHub submission guide for DSA homework. You don't need to set any tag name for final project submission.
- Please also follow Yen-Chieh's instructions and his post on FB.
- Your directory structure (under GitHub repository) should be:
- project/*.{cpp, h}: Your source code.
- project/*.sh: Your shell scripts
- You have to make sure your code can be compiled and run on CSIE R217 linux machines.