Have you ever wondered ... how might I compare two websites that purport to present the same information? Or, what does the twitter feed history of a user, reveal about that user's motives?
For this programming assignment, you will implement a binary search tree (BST), whose nodes' values are the words in a plain text file that youll pre-process. Your program will then provide simple analysis capabilities of that BST, and hence an indirect analysis of the plain text data. Hint: Refer to chapter 17 of the Building Java Programs textbook for pseudocode (and in some cases actual code!) that you can - and should use. Also refer to lab 4, in which you built a BST.
The program is named AnalyzeTextStream, in the file AnalyzeTextStream.java.
The program is invoked via the command line, and requires at minimum 2 arguments. There are primarily three ways that a user can interact with the program.
The first argument is the name of a file which contains plain text (in other words, the file is not a .docx, .doc, etc. file). The plain text file should be in the same directory as your java code. A sample file is shown in Figure 1 left.
The program reads the plain text file, and builds a BST based on the words in the text file. The "less than" and greater than condition is based on lexicographical ordering (dictionary). Capitalization does NOT count - meaning, that a and A are considered equivalent and non-letter characters such as punctuation and digits are ignored. Dashes are considered the same as empty spaces. The order that each word is/was encountered in the file is also retained in each node; thus duplicate entries are NOT allowed as unique nodes in the tree, but duplicate entries are tallied (see program specifics, for details about the Node class). The BST in Figure 1 right, shows the tree that would be built from ASample.txt, where each subscript for a word specifies when in the file that word is/was encountered. For example, the word the is the 1st and 8th word in the file, hence the tree shows 1,8 for the node with the value the.
(a) sample input file
ASample.txt
The dog door was left
ajar and the 2 dogs ran
left and up!
(b) BST built when processing sample file shown in (a). The key (we've called it the value) of each node is the word, and the subscript is the placements(s) of that word in the file.
Figure 1: Sample input text file (a) and corresponding BST (b) see image.
If the second argument is the word ANALYZE, your program should perform an in- order traversal of the BST, and identify the three most frequently found words, each of which must have been encountered at least twice.
If the second (and any subsequent) argument(s), is/are words other than ANA- LYZE, then your program should output details about the nodes in the BST that corre- spond to the words(s) supplied. For each node value specified, the statistics include the following:
Three sample invocations are shown in Figure 2.
Figure 2: Sample Invocations
Invocation | Output |
java AnalyzeTextStream ASample.txt ANALYZE | and:2:7,12 left:2:5,11 the:2:1,8 |
java AnalyzeTextStream ASample.txt doors | doors:0 |
java AnalyzeTextStream ASample.txt was ran the cat | was:1:4 ran:1:10 the:2:1,8 cat:0 |
The output of the first invocation specifies that the words the, and and left are the most frequently-found words in the file, and that each of them appeared twice, and the specified locations.
The output of the second invocation specifies that the word doors appears zero times. When a word appears zero times, the placement of that word in the file (the number(s) after the second colon, should be left blank.
The third invocations indicates that the word was appeared once at position 4, the word ran appeared once at position 10, the word the appearsed twice at positions 1 and 8, and the word cat appeared zero times.
Error catching is NOT required. You do not have to check if a user specifies an incorrect number of command line arguments. Assume that the program is always invoked correctly.
The program implements the Node data structure shown in Figure 3, where al specifies an array list of integers. The array list's integers designate the order that the val in the Node was encountered in the text file. The size of the array list designates how many instances of the value were processed for insertion into the BST, and the integer values of the array list specify the order of insertion. For example, for the first sample invocation in Figure 2, the array list in the node with value the in the BST should contain 2 items : {1, 8} because the word the was the 1st and 8th word in the file.
Figure 3: The Node data structure
string val ArrayList al | Node left |
Node right |
Your BST must built using Node objects. Refer to the lecture notes, and refer to Lab 4 to get started.
You cannot use any other data structure other than a BST (made up of Node objects) for holding and tallying the integers that you process on-the-fly.
Each time that your program is invoked, the plain text input file can be processed (read, line-by-line), only once. For example, when invoked via the following:
java AnalyzeTextStream someFile.txt dog cat hello
your program should read the file someFile.txt only once, build a single BST, and query it three times, once each for the word dog, cat, and hello.
The functions to add a node, and to traverse your BST, must be the recursive forms.
Do NOT use (cut-and-paste) external source code. You may refer to textbooks, online resources, and/or chat with your peers, etc., but follow the Simpson's rule (refer to the course website). Plagiarism software will be used to analyze your code.