In many situations, we may need to nd a word based on one or more letters in it. For example, when making or completing a crossword you may want to nd a word that has 4 letters, starts with J and ends with A. In this assignment, the program you will write will create a lexicon of words from various sources (that is, from various text les) and will allow the user to search for words that t particular patterns.
The operational denition of what is taken to be a word will be given later in this handout. For each word, we also keep the following information:
The assignment is divided into two parts. For part 1, you are required to write a simple version, where execution efciency is not taken into account. For part 2, you are required to write an optimized version where efciency, in addition to correctness, is the major concern.
With the exceptions of ArrayList and LinkedList, you are not permitted to use any of the classes in the Java Collections Framework, e.g. TreeMap, HashMap.
For this task, you are to design and implement the program LexiconTester.java. This program will
Further information about the task is given below.
The rst text le to be read in is the le sample1-pp.txt shown below:
Pride and Prejudice by Jane Austen Chapter 1 It is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters. "My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?" Mr. Bennet replied that he had not. "But it is," returned she; "for Mrs. Long has just been here, and she told me all about it." Mr. Bennet made no answer. "Do you not want to know who has taken it?" cried his wife impatiently. "YOU want to tell me, and I have no objection to hearing it." This was invitation enough.
The second text le to be read in (and its words added to the lexicon) is the le sample2-zoo.txt below:
You can see zebras and yaks at the zoo. Yes, zebras and yaks and wombats too. You will also meet a fat cat, wearing a hat, sitting on a mat, reading a map while drinking from a cup.
A word is a sequence of letters. Words will be treated in a case-insensitive manner. Numbers and punctuation must be removed. More specically,
The words from the created lexicon are to be written to le sample3-wordlist.txt. These words must be in sorted in alphabetical order.
The information on each word, which is also written to the text le, includes the frequency and the list of its neighbors. The list of neighbors must also be in alphabetical order.
The format must be as shown in the example below for a number of words.
a 11 [i]
about 1 []
acknowledged 1 []
all 1 []
also 1 []
and 6 []
answer 1 []
at 2 [it]
austen 1 []
be 2 [by, he, me]
been 1 []
bennet 3 []
but 1 []
by 1 [be, my]
can 1 [cat, man]
cat 1 [can, fat, hat, mat]
chapter 1 []
considered 1 []
cried 1 []
cup 1 []
For each word, rst the spelling of the word is displayed, followed by the frequency, and then followed by the words neighbors (in alphabetical order, inside a pair of square brackets).
For this task, you are to write the program WordMatchTester.java. This program will
Further information about the task is given below.
The patterns to match the words in the lexicon are to be included (hard-coded) as strings in the test program. A pattern consists of a sequence of letters, with no spaces within the sequence. In addition, there are two wildcard characters allowed. A ? symbol in the pattern can match with any one character in a word, while a * symbol can match with any number of characters in a word (zero or more).
For each pattern, the pattern and the words that match the pattern are to be displayed on the screen and write to the text le. Each matching word, together with its frequency, appears on a separate line. The words are in alphabetical order.
For example, the pattern:
ma?
may result in the words below being displayed to the screen
man 2
map 1
mat 1
may 1
The pattern:
?o?
may result in the following words being displayed. Note the words are displayed in lexicographic order.
for 1
not 2
too 1
you 5
zoo 1
The pattern:
Mr*
may result in the output:
mr 3
mrs 1
The pattern:
i*
may result in the output:
i 1
impatiently 1
in 3
invitation 1
is 5
it 5
The pattern
?ear*
may result in the output:
dear 1
heard 1
hearing 1
wearing 1
Any pattern that has no matches will result in the message
No words in the lexicon match the pattern
output to the screen and written to the le.
The patterns should be designed for a comprehensive testing of the correctness of what you have implemented. As a suggestion, you should include at least 10 test cases (i.e. patterns) and the reasons for including them must be documented in the program.