The objective of this programming assignment is to practice implementing hash functions and hash tables.
Hash collisions are bad, right? Well, maybe not always. For this project you are going to use a hash table to quickly identify anagrams; to do this, you need a hash function that collides when the two inputs are anagrams (good collisions) but doesn't collide too often when the inputs are not anagrams (bad collisions). For example, if the hash function is H, then it must be that H("parsley") and H(players) are equal, but it would be good if H(earned) and H(canters) are not.
Anagramming is slightly more subtle than it might first appear. In particular, capitalization and spaces do not count when determining if two words or phrases are anagrams. For example, the following should all have the same value:
H("Sesame Street") H("Tesserae Stem") H("Teases Me Set")
Think of it this way: if you convert each of the phrases to upper case and remove the spaces, they would comprise exactly the same letters; thus they are anagrams.
Where would one even begin to find such a hash function? Start with the textbook. An additive hash code has the property that simple anagrams ("pots", stop, tops) hash to the same value.
The hash table is an array of sets of strings (set< string> *m_ht) with the table m_ht allocated dynamically by the constructor. A location in the hash table, say m_ht[h] contains strings which are all anagrams of each other and have hash value h.
What if a bad collision occurs: there is some string S with hash value h but S is not an anagram of the entries in mt_h? Then we choose a different location at which to save S using linear probing.
As a small example, suppose we insert the strings "canters", scanter, and replays, all of which hash to three, and that our hash table has size six. First we insert canters by computing it's hash (3) and, checking that the set at index three is empty, insert it into the set:
m_ht[0] (empty) m_ht[1] (empty) m_ht[2] (empty) m_ht[3] { "canters" } m_ht[4] (empty) m_ht[5] (empty)
Now, to insert "scanter", we compute it's hash (3), but we see that the set at index three is non-empty. We have to check that scanter is actually an anagram of an element of the set using a secondary test. A secondary test is an exact but more computationally expensive test that we use when hashing alone can't give us a definite answer: we can't be sure that the two strings are anagrams even though they have the same hash value since it could just be a bad collision. It is up to you how to implement your secondary test. With this example, we can see that canters and scanter are anagrams, so we just insert scanter into the list at index three:
m_ht[0] (empty) m_ht[1] (empty) m_ht[2] (empty) m_ht[3] { "canters", "scanter" } m_ht[4] (empty) m_ht[5] (empty)
Finally we need to insert "replays". It also has hash value three, but our secondary test will show that it is not an anagram of canters, and so does not belong in the list at index three. We use linear probing to determine where to save the string which, in this case, is at index four:
m_ht[0] (empty) m_ht[1] (empty) m_ht[2] (empty) m_ht[3] { "canters", "scanter" } m_ht[4] { "replays" } m_ht[5] (empty)
In general, linear probing makes removal from hash tables more difficult, but for this project, we never remove anything, so we don't have to worry about that.
The end result after inserting a set of strings is a hash table in which:
1.If an entry m_ht[h] is non-empty, then it contains a set of anagrams.
2.If two strings S1 and S2 are anagrams, then they are in the same set m_ht[h] for some index h.
That is, our hash table partitions the input strings into sets of anagrams.
Your assignment is to complete the anagram hashing class Anahash. To get you started, you are provided with the following files:
First, you must design and implement a hash function that always gives equal hash values for strings that are anagrams of each other but minimizes the number of "bad" collisions. It may be easier to design and test the hash function outside of the class. Once you have a hash function, you can implement the hash tables operations (described below).
Requirement: Private helper functions must be declared in Anahash.h. No other modifications to Anahash.h are permitted!
Requirement: You must use a hash table data structure to store and organize the anagrams. The entries of the hash table must be sets of strings. The size of the hash table can be no more than HT_SIZE_LIMIT, a constant defined in Anahash.h.
Requirement: Two strings S1 and S2 are anagrams if they consist of exactly the same letters after conversion to upper case and removal of spaces. However, strings must be stored in the hash table with their orginal capitalization and spaces.
Requirement: Linear probing must be used to resolve "bad" collisions.
Requirement: Each non-empty entry m_ht[h] must contain strings which are anagrams of each other. If two strings S1 and S2 are anagrams, then they must be stored in the same entry m_ht[h] for some h.
Requirement: You may use the STL set container and the sort method of the algorithm library. No other STL containers or algorithms may be used. However, you may use standard types and function libraries such as string and cstring.
These are the member functions of the Anahash class.
Anahash();
This is the constructor for the Anahash class. It must be provided with htSize, the size of the hash table. If htSize is greater than HT_SIZE_LIMIT or less than 1, throw a range_error exception; otherwise, allocate the hash table (m_ht) and save the table size (m_htSize).
~Anahash();
Destructor for the Anahash class. All dynamically-allocated data must be deallocated.
bool insert( string line );
Insert line into the hash table. Return true if a new entry is created, and false if the string is already in the table.
bool search( string line );
Search for line in the hash table. Return true if found, false otherwise.
set< string> getAnagrams( string line );
If line is in the the hash table, return its set of anagrams (including line). If line is not in the hash table, return an emtpy set of strings.
void dump();
Dump all non-empty lists from the hash table. For each non-empty entry, print the index (hash value) h in hexadecimal followed by the strings in the m_ht[h].
Anahash.h
#ifndef _ANAHASH_H
#define _ANAHASH_H
#include< set>
#include< string>
#include< algorithm>
#include< exception>
using namespace std;
class Grader;
class Anahash {
public:
friend Grader;
// Maximum hash table size (2^16)
const unsigned int HT_SIZE_LIMIT = (1 < < 16);
// Create a Anahash object with hash table of size 'htSize'. If
// 'htSize' is greater than HT_SIZE_LIMIT or less than 1, throw a
// range_error exception.
Anahash( int htSize );
// Destructor
~Anahash();
// Insert 'line' into hash table. Return true if a new entry is
// created, false if the line is already in the table.
bool insert( string line );
// Search for 'line' in the hash table. Return true if found, false
// otherwise.
bool search( string line );
// If 'line' is in the the hash table, return its set of anagrams,
// including 'line'. If 'line' is not in the hash table, return an
// emtpy set of strings.
set< string> getAnagrams( string line );
// Dump all non-empty lists from the hash table. Print the hash
// value in hexadecimal and the strings in the list.
void dump();
private:
set< string> *m_ht; // Array of string sets
int m_htSize; // Size of hash table
// Hash function. Should hash anagrams to the same value (good
// collision) while minimizing the number of non-anagrams that hash
// to the same value (bad collisions).
static unsigned int anaHash( string line );
//********************************************************
//
// Other private helper functions should be declared here.
//
//********************************************************
};
#endif
test.cpp
#include "Anahash.h"
#include < iostream>
#include < fstream>
#include < string>
using namespace std;
int main() {
Anahash aH( 1 << 16 ); // Table of size 2^16
// Read the sample data an insert into the hash table.
string fileName = "anagrams_single.txt";
ifstream inFp( fileName );
if ( inFp.is_open() ) {
string line;
while ( getline( inFp, line ) ) {
if ( line.size() > 0 ) {
aH.insert( line );
}
}
inFp.close();
} else {
cerr << "File " << fileName << " could not be openedn";
}
// Use getAnagrams to retrieve the anagrams of "wired".
cout << "wired:n";
auto ana = aH.getAnagrams("wired");
for (auto itr = ana.begin(); itr != ana.end(); itr++)
cout << " " << *itr << endl;
cout << endl;
// Dump the contents of the hash table.
cout << "Full dump of table:n";
aH.dump();
return 0;
}