For this assignment, you will use your knowledge of arrays, lists, sets, and strings to determine which two sentences are the most similar out of a collection of sentences. You will do this by determining the Jaccard similarity coefficient (another name for the Jaccard index) for each pair of sentences out of a collection of sentences. To do this, you will need to construct a "shingle set" for each sentence. Details of shingle set construction and computing the Jaccard coefficients are provided below. A suggested implementation outline is also suggested below.
Your program must accept the input file name as a command line parameter. You will not know the name of the file in advance, so you cannot hard code the file name. You must get it from the args[0] input parameter to the main() method.
The input file will be a text file that has one sentence on each line. In general, the sentences will have letters, numbers, whitespace, punctuation, and special characters. The number of sentences will vary and you will not know in advance how many there will be, so you cannot hard code how many sentences to try to read into your program. If you use a scanner to read in the lines, use your scanner's "hasNextLine()" method to determine if there is a next sentence to read before attempting to read it.
For each sentence, you must create a "shingle set" for it. This will be a set, so no duplicates will be allowed. This set will contain all unique 2-character combinations that appear in the sentence, including whitespace, punctuation, etc. You can get these combinations by marching from the start of the sentence (represented as a string) to the next-to-last character in the sentence, using the String class "substring( i, i+2)" instance method to extract each 2-character substring. You will need to make certain that your shingle set does not contain any duplicates. Because these substrings overlap each other, they are called "shingles". For example, for the sentence "Time flies.", the shingles are: { "Ti", "im", "me", "e ", " f", "fl", "ie", "es", "s." }. Notice that capitalized letters, spaces, punctuation, etc., are included.
The Jaccard index measures the degree of overlap of two finite sets. For two shingle sets (representing two sentences), the Jaccard index is simply the number of unique shingles they have in common, divided by the total number of unique shingles across both shingle sets. Mathematically, this is expressed in the equation:
Now, the values of J( A, B ) will range from a low of 0.0 (the sets are entirely different) to 1.0 (they are the same sets).
For example, consider the sets A = { "ab", "bc", "cd", "de" } and B = { "ab", "bc", "cd", "fe", "ef", "fg" }. Then, the number of shingles in their intersection is 3, since they have only 3 shingles in common: { "ab", "bc", "cd" }. Also, the number of unique shingles in their union is 7, since these are the unique elements across both sets: { "ab", "bc", "cd", "de", "fe", "ef", "fg" }. As a result, the Jaccard index for these two sets is:
J( A, B ) = 3 / 7 = .4288 (rounded to 4 decimal places).
You can find out more about the Jaccard Index by reading the associated Wikipedia page and other online sources on the topic.
The program must present output as shown in the following sample. The sections of the output are discussed separately below. See image.
(1) Input sentences: The program should output the heading "Input Sentences:" as shown, and then output each sentence, one on each line. The sentences should be numbered, starting with 0 (zero). It does not matter how long an input sentence is -- it should be reported on a single line of output.
(2) Shingle sets: The program should output the heading "Sorted Shingle Arrays:" as shown, and then output each shingle set, one on each line. The shingle sets should be numbered, starting with 0 (zero). It does not matter how long a shingle set is -- it should be reported on a single line of output. The individual shingles should be separate from one another by one space character. PLEASE NOTE: space characters in the sentence (and hence in the shingles) will make some of the shingles appear to be unevenly spaced, but that is not a program error. Thus, in shingle set 0, the first few shingles are " c", " h", and " i" (note the spaces that are part of each of these shingles).
(3) Jaccard similarity matrix:
The program should report the similarity values in matrix form, as shown, for each possible combination of two input sentences. Although not shown, the rows and columns each run from 0 to the largest sentence number, so that the matrix element in the row i and column j represents the similarity of sentences i and j.
The numerical values must be shown as decimal real numbers (doubles) shown to 4 decimal places. The entries on each row should be separated from each other by a single space. PLEASE NOTE: the value of the matrix element in row j column i should be the same as in row i column j. Also, the matrix elements along the main diagonal should be 1.0000, as shown.
(4) Most similar sentences:
The program should report the two sentences whose Jaccard index has the highest value, using the format shown. The sentence numbers for these two sentences should be reported, but it does not matter which is reported first. The Jaccard index for the two sentences must also be reported, as shown.
The following is a suggested implementation outline. This program will require multiple steps, and if you are having difficulty in breaking down the problem into manageable smaller pieces that can be developed separately, you may wish to follow this outline. You do not need to follow this outline -- you are free to implement the program in any manner you wish.
The following outline is actually the set of in-line code comments in your instructor's implementation of this assignment. Along with each numbered step, you will see in brackets the number of source lines of code ("SLOC"). The SLOC values given are not requirements for your implementation. They are there merely to give you an idea of approximately how much programming effort may be required for each step. Your implementation may be longer or shorter, and that's fine.
Programming hints are also provided
Step 1 [ 3 SLOC ]: Open a scanner to the file that is specified by the first (and only) command-line input parameter
Step 2 [ 4 SLOC ]: Read into a list as many sentences as there are in the file (Hint: if you use a scanner, use the hasNextLine() method to control a while loop to read in all the sentences)
Step 3 [ 4 SLOC ]: Output the sentences in order, numbering them starting with zero, one sentence per line of output
Step 4: [ 1 SLOC ]: Make an array for sorted shingle arrays (Hint: use a 2-D array of Strings)
Step 5 [ 9 SLOC ]: Write a separate private static method to convert a sentence into a set of shingles, using a fixed shingle size of 2, then convert the set into an array, sort it in lexicographic order, and return the sorted array (Hint: use a HashSet to ensure uniqueness, but you will need to convert it into an array to sort it; do that, then return the sorted array).
Step 6 [ 3 SLOC ]: Use the method developed in Step 5 to convert each sentence into a sorted array of shingles and place it into the sorted shingle array from Step 4
Step 7 [ 7 SLOC ]: Write a separate private static method to output the contents of a sorted shingle array on a single line, with a single space between the elements
Step 8 [ 4 SLOC ]: Output the shingle arrays in order, numbering them starting with zero, using one row of output for each array (Hint: use the method developed in Step 7)
Step 9 [ 1 SLOC ]: Create a 2-dimensional array of doubles to store the Jaccard coefficients
Step 10 [ 10 SLOC ]: Write a separate private static method to compute the size of the INTERSECTION of two arrays of strings (Hint: use the asList() method to convert one of the arrays to a List, and then use the list’s contains() method to check if that list contains each of the elements of the other array, counting as you go)
Step 11 [ 12 SLOC ]: Write a separate private static method to compute the size of the UNION of two arrays of strings (Hint: use ArrayList() to create a new list and put every element from one of the input arrays into it (do not use toList()); then, just add the elements from the other array that are not already in the list; return the size of the list when finished)
Step 12 [9 SLOC]: For each unique pair of sentences, compute the Jaccard coefficient using the methods developed in Steps 12 and 13, and store it in the appropriate locations in the Jaccard array
Step 13 [ 7 SLOC ]: Output the Jaccard similarity matrix, showing all values to 4 decimal places and separating the entries on a row by a single space
Step 14: Now declare some static class variables [ 3 SLOC ] and add some lines in step 12 [ 5 SLOC ] to keep track of the best matching sentences and report [ 5 SLOC ] which sentences they are and their Jaccard similarity coefficient