As described in the handout for Part 1, the overall aim of the assignment is to develop a program to build a lexicon and to find the words that match certain patterns.
Whereas for Part 1 we are only concerned with the correctness of the program, for Part 2 we are concerned with the efficiency. More specifically, you are required to do the tasks described below. Besides the information given in the tasks below, please refer to Part 1 of the Assignment for any other information you need.
Write a Java program called WordMatch.java. This program takes four command-line argu- ments. For example:
java WordMatch in1.txt out1.txt in2.txt out2.txt
1. The first is the name of a text file that contains the names of the text files from which the words are to be read to build the lexicon (The aim of this argument is to specify the input files)
2. The second is the name of the text files to which the words in the lexicon are to be written (The argument specifies the file that contains the words and their neighbors in the lexicon)
3. The third is the name of a text file that contains a number of matching patterns, one per line (The aim of this argument is to provide the matching patterns)
4. The fourth is the name of the text file that contains the result of the matching for the given patterns (The argument specifies the file that contains the result)
For this version, the efficiency with which the program performs various operations is a major concern.
For example, the files read in can be quite long and the lexicon of words can grow to be quite lengthy. Time to insert the words will be critical here and you will need to carefully consider which algorithms and data structures you use.
You can use any text files for input to this program. A good source of long text files is at the Gutenberg project (www.gutenberg.com) which is a project aimed to put into electronic form older literary works that are in the public domain. The extract from Jane Austen's book Pride and Prejudice used as the sample text file above was sourced from this web site. You should choose files of lengths suitable for providing good information about the efficiency of your program.
A selection of test files have been posted on LMS for your efficiency testing. You can consider additional test files if you wish.
As expected, the definition of a word, and the content of a query's result and display of this result are exactly the same as what described in Part 1.
Write a report about the WordMatch program and the classes that support it. The report has the following sections:
Consider the B-trees of order M. Assume that we have the following result, which we will refer to as Lemma 1.
Lemma 1: The barest B-tree of height H contains N = 2KH - 1 elements, where K = [M / 2].
Determine the upper bound for a B-tree of order 21 which has 1,000,000 = 106 elements.
You must give an integer value as the upper bound of the B-tree.
You are not allow to use the result given in the lecture regarding the upper bound for B-tree's height. Instead, you must work out the answer using Lemma 1 above.