Web search engines are programs that search webpages for specified keywords turns a list of relevant webpages where the keywords were found. The tu in tasks for search engines are matching and ranking. The matching phase determines which webpages match a query and the ranking phase determines the relevance of each of the matches. Note that search queries can produce thousands or millions of hits (matches) and a search engine must be capable of picking the most relevant results (ranking).
The concept of an index is the most fundamental idea behind any search engine The index for a search engine works the same way as a book's index. The "pages" of the book are now webpages and search engines assign a different page number to every single page on the web. The search engine builds up an index by first making a list of all the words that appear in any page, making a note of the page number in which each word appears, and then sorts that list in alphabetical order. So if a query is made on the word "computer", the search engine can quickly jump to the entry for "computer in the word list and extract the pages for that entry.
Rather than build an index for webpages, this lab will build an index, or a crossreference listing, for text-based documents. The lab will display the number of times each word appears in the document, the line numbers in which each word appears and the position of each word in the line. The lab will also allow us to query the document for specific words and will display the number of times the word appears in the document, the line numbers in which the word appears and the position of the word in the line.
At the outset we do not know how many different words are in the text, nor do we know in how many different lines a word may appear. Our approach will be to use an object binary search tree whose nodes provide for the following fields:
Note that while the ObjectTreeNode class is built to hold a generic Object, you'll be creating a Word class whose objects will be placed in the ObjectTreeNodes of the object binary search tree.
Each Word object should provide for the following fields:
Here is a listing of the classes and interfaces that you will want to include in your project:
Xret ObjectBinarytree ObjectList
Word ObjectBinaryTreeInterface ObjectListInterface
LinePosition ObjectTreeNode ObjectListNode
Hash ObjectTreeNodeInterface ObjectListNodeInterface
Query TreeComparable Comparable
Specifically, your program should perform the following operations:
our 2 1-7 11-10
people 3 20-11 21-1 21-4
perish 1 21-7
dedicate men resolve vain
devotion not soldier war
gave people us
For each word searched, output the word, the number of times the word appears in the text, the line numbers in which the word appears and the position the word appears on each line.
To make this cross-reference table as simple and as useful as possible, we will not include all the words of the text in the index. The following words should be omitted:
a have that
after in the
all zis their
and it there
because it's to
every now was
for of were
from on which
had so with
To accomplish this task, we shall store each of the above words in a hash table of size 37 that resolves collisions by linear probing or chaining. After the hash table is created, we can select each word from the text, get its hash, and determine whether or not it appears in the hash table. If the word does appear in the hash table, we should simply ignore it: otherwise, we should store the word in ne binary search tree, incrementing the counter which keeps track of the number of times the word appears, and attaching a new node to the linear linked list which contains the line number in which the word appears along with its position within the line. Note that the line numbers stored in the linear linked list should be kept in increasing order for eventual output.
You may place the 27 unimportant words into a text file, read the words into your gram, and hash the words into their proper location within the hash table.
Note that these 27 unimportant words placed into the hash table are not the same as the Word objects that will be placed into the binary search tree!
Note that the hash table should be created before reading any data from the getty.txt file. The hash function signature should be as follows:
private int getHash (String s);
If you are using linear probing to resolve collisions, you should create a hash function that produces ten or fewer collisions and you should output the number of hash function collisions as well as the number of resolution collisions. The total of these two types of collisions should be ten or less. If you are using chaining to resolve collisions, your hash function should return five or fewer collisions and you should output the total number of collisions, the average chain size to two decimal places, as well as the maximum chain size.
Be sure to create your own Hash class rather than use the Hash class built into the Java library
For extra credit, develop a perfect hash function. For extra, extra credit, develop a minimal perfect hash function.
The following should be added to your program's output:
Be sure all output is sent to the terminal window as well as the csis.txt output file. Shown below are the contents of getty.txt.
Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth.
I've provided a document on the use of Javadoc that can be found in this link on Blackboard:
Docs | Java Review | Javadoc Program Documentation
In particular, note the following guidelines:
Each data structure in your project must implement an interface for that data re. The interface files must be included in the zip archive for your code.