If you ever used a word processor, you must have encountered spell checking. Spell checkers are, in fact, among the earliest computer applications ever written. The first spell checkers became widely available on mainframes already in 1970s. Today, spell checkers are everywhere. Sophisticated web browsers, such as Firefox and Chrome, have spell-check support for user-written content, while Mac OS X has systemwide spell-checking. So, did you ever wonder how spell checking works?
First, a spellchecker program needs a dictionaryfile containinga listofcorrectly spelled words. A typi- cal dictionary file for an English language spell checker contains about 90,000 words (but do not worry, we will use much smaller dictionaries in this problem). The simplest spell checkers attempt to correct spelling errors in a given piece of text word-by-word, ignoring the context. Specifically, each word in the input text file (let us call it the input word) is compared against all the words in the dictionary file (called dictionary words). If the input word appears in the dictionary, it is assumed to be spelled cor- rectly. But what does a spell checker do if it is not in the dictionary? In this case, the spell checker scores each word in the dictionary against the input word; the dictionary word with the highest score is then returned as the correct spelling of the input word. Ties in this process (that is, two or more dictio- nary words that end up with the same score with respect to an input word) may be broken arbitrarily.
Given a pair {input word, dictionary word}, how does the spell checker compute the corresponding score spell-score(input word,dictionary word)? Spell checkers in use today employ sophisticated al- gorithms. However, in this problem, we will use a simplified approach described in what follows.
Given a sequence of letters X, we let |X| denote the number of letters in this sequence and call it the length of X. Thus if X = industry then |X| = 8, while if X = university then |X| = 10. Given a sequence of letters X, we say that another sequence of letters Y is a descendant of X, if Y can be obtained by deleting letters from X. For notational convenience, we say that X is also a descendant of itself. Thus, for example, the word industry has one descendant of length 8 (itself), 8 descen- dants of length 7 (ndustry, idustry, inustry, indstry, indutry, indusry, industy, industr),28descendantsoflength6, andsoon. Withthis, spell-score(X,Y) isdefinedasfollows: See image.
where descendant-score(X,Y) = length of the longest common descendant of X and Y
Forexample,thelongestcommondescendantofthewords industry and university haslength4 (in fact, there are three such common descendants isty, usty, and nsty), and therefore: See image.
Observe that spell-score(X,Y) = 1 if and only if X = Y, whereas spell-score(X,Y) = 0 if and only if the words X and Y have no letters in common. Thus spell-score(X,Y) appears to be a reason- able (normalized) measure of spelling similarity between the words X and Y.
Write a C program, called spell check.c, that reads text from an input file and a list of correctly spelled words from a dictionary file, then corrects the spelling of the text using the approach discussed above. That is, for each word X in the input file, the program should compute spell-score(X,Y) for all words Y in the dictionary file, then replace X by the dictionary word with the highest score (ties may be broken arbitrarily). Details regarding the program input and output are described below.
Theprogramshouldread itsinputfrom thefile input.txt. Thisfile consistsofasinglelinethatcon- tains one or more words and is terminated by a period. All the words contain only lowercase ASCII characters in the range [a-z] and are separated by a single blank. Here is a sample input.txt file:
this wil compil bt its behaviour is undefned by the standard.
The program should also read a dictionary from the file dictionary.dat. The first line in this file contains a single integer, which represents the number of words in the dictionary. You may assume that this number is at least 1 and at most 1024. This remaining lines contain a single dictionary word each. You may assume that all the dictionary words consist of lowercase ASCII characters in the range [a-z]. A sample dictionary.dat file is shown below:
16
be
by
is
it
but
its
not
the
this
thus
will
should
compile
standard
behavior
undefined
You do not need to check the validity of the input file or the dictionary file: you can assume that they conform to the specifications detailed above. You can furthermore assume that every word in the input file and in the dictionary file consists of at most 32 characters.
The program should writeits output to the file output.txt. It should print to output.txta single line that corresponds to the line in the input file input.txt with the spelling errors corrected.
Here is a sample output file output.txt, which results upon processing the input file input.txt and the dictionary file dictionary.dat given above:
this will compile but its behavior is undefined by the standard.