DSA Final Project (2016 Spring)

Roger Jang


Due date: 2016/06/30 23:59:59

English Preposition Usage Checker

Outlines

Slides & recordings

Before proceed to problem definition, you might want to see my recording first. (Note that I've updated some of the slides for clarification after the recording.)

Problem definition

Sometimes it is hard for a non-native English speaker to determine the right preposition in an English sentence. To emphasized problem solving for real-world problems, in this project, we shall create an English preposition usage checker based on a slim version of Google's Web 1T dataset.

For simplicity, we shall only consider the most common 20 prepositions, including of, to, in, for, with, on, at, by, from, up, about, than, after, before, down, between, under, since, without, near. See here for more details.

The general procedure for the task of preposition usage checker involves the following steps:

  1. Generate a candidate set of extended queries based on the given query.
  2. Compute the frequency of each element in the candidate set.
  3. List top-10 candidates (with nonzero frequencies) based on descending order of the frequency. (If the size of the candidate set is less than 10, then list all of them.)
We shall generate the candidate set based on the following two rules:
  1. If the input query does not contain any preposition, such as "go listen music", then EDITa set is defined by insertion only:
    1. Insert: "x go listen music", "go x listen music", "go listen x music", and "go listen music x", where x is an inserted preposition.
    Then the candidate set is defined by EDITa(EDITa(query))+EDITa(query)+query.
  2. If the input query does contain prepositions, such as "log in to check with", then we shall identify preposition sequences first. In the previous example, the preposition sequences are "in to" and "with". For a preposition sequence such as "in to", the EDITb set can be defined as follows:
    1. Insert: "x in to", "in x to", "in to x"
    2. Delete: "in", "to"
    3. substitute: "in x", "x to"
    where x is one of the 20 aforementioned prepositions. Given the EDITb sets for all preposition sequences in a query, then the candidate set is defined as the Cartesian product of all the EDITb sets together with the original non-preposition words. (If you do not know what Cartesian product is, please check out my recording.)

Remember that you also need to add the original query to the candidate set. Once we have the candidate set for the input query, we can use the slim version of Web 1T to obtain the corresponding frequency of each element of extended query. Our program should list the top-10 extended queries based on descending order of their frequencies.

Input/output formats

An typical example of input file is as follows:

have difficulty finding angry at me pleased at me worry for cancer

The output file corresponding to the previous input file is shown next:

query: have difficulty finding output: 3 have difficulty finding 30918 have difficulty in finding 4636 to have difficulty finding 1174 query: angry at me output: 2 angry with me 60929 angry at me 24354 query: pleased at me output: 3 pleased me 40402 pleased with me 10015 pleased for me 1067 query: worry for cancer output: 1 worry about cancer 2120 Each output entry and its frequency are separated by a tab.

Requirements & suggestions

Slim corpora from Google's Web 1T dataset

The original Google Web 1T dataset is too big to process for machines with limited memory. Hence we shall use a subset for this project, as shown next. These files are already available under /tmp2/dsa2016_project/ of linux1 to linux13.

Open test sets

Submission guidelines