In this assignment, you will practice programming with the C language. Much of your code will come in the form of operating on C strings, although you will be dealing with some pointers to structs as well.
Your task is to write a type and a set of functions equivalent of a Java class that implements a tokenizer for C language numeric constants. The tokenizer should accept a string as a command-line argument. The string will contain one or more tokens, where each token is a either a oating-point constant, or an integer constant in hex, decimal or octal. The later denition of tokens will dene what begins each kind of token and how the tokens end. White space is dened as any sequence of blank (0x20), tab (0x09), vertical tab (0x0b), form-feed (0x0c), new-line (0x0a) and carriage return (0x0d) characters. The tokenizer should return the tokens in the string one token at a time, hence your program is called a tokenizer.
The complexity of this assignment means that you will have to plan out the behavior of your program before you write any code. You can plan out the behavior by writing a Finite State Machine for your tokenizer. A nite state machine is essentially a transition diagram drawn as a graph. The nodes in the graph are states and the arcs are transitions from one state to another. Labelled arcs imply transitions associated with the characters used as labels. The machine is in only one state at a time. The state it is in at any given time is called the current state.
There may be characters in the input that are neither part of tokens nor white space. Many of these special characters are either unprintable or have undesirable eects on program output. These special characters are called escape characters. Characters that are not part of any token or white space should be output in error messages from your program. In your output, we want the output of all escape characters (printable and otherwise) to be in bracketed hex of the form [0xhh]. So if the command line input string contains a start-of-text tab (0x02), your error message would represent the start-of-text as [0x02].
Your implementation needs to export the interface given in the attached tokenizer.c le. In particular, you need to dene the type needed to represent a tokenizer and three functions for creating and destroying tokenizer objects and getting the next token. Note that we have only dened the minimal interface needed for external code (e.g., our testing code) to use your tokenizer. You will likely need to design and implement additional types and functions.
Tokens may be separated by white space characters. Multiple white space characters may be next to each other (see second example above), and/or at the beginning and/or end of the token string. When this happens, your tokenizer should discard all white space characters.
There are dierent kinds of tokens. Your program must not only break the input string into tokens, it must identify the kind of token in the program output.
A decimal integer constant token is a digit (0-9) followed by any number of digits.
An octal integer constant token is a 0 followed by any number of octal digits (i.e. 0-7).
A hexadecimal integer constant token is 0x (or 0X) followed by any number of hexadecimal digits (i.e. 0-9, a-f, A-F).
A oating-point constant token is follows the rules for oating-point constants in Java or C.
Your implementation must not modify the original string in any way. Further, your implementation must return each token as a C string in a character array of the exact right length. For example, the token usr should be returned in a character array with 4 elements (the last holds the character to signify the end of a C string).
You may use string functions from the standard C library accessible through string.h (e.g, strlen()).
You should also implement a main() function that takes a string argument, as dened above. The string contains zero or more tokens. Your main() function should print out all the tokens in the argument string in left-to-right order. Each token should be printed on a separate line. Here is an example invocation of the tokenizer and its output.
tokenizer " 0700 1234 3.14159e-10 "
octal "0700"
decimal "1234"
float "3.14159e-10"
Keep in mind that coding style will aect your grade. Your code should be well-organized, wellcommented, and designed in a modular fashion. In particular, you should design reusable functions and structures, and minimize code duplication. You should always check for errors. For example, you should always check that your program was invoked with the minimal number of arguments needed.
Your code should compile correctly (no warnings and errors) with the -Wall and either the -g or -O ags. For example
$ gcc -Wall -g -o tokenizer tokenizer.c
should compile your code to a debug-able executable named tokenizer without producing any warnings or error messages. (Note that -O and -o are dierent ags.)
Your code should also be ecient in both space and time. When there are tradeos to be made, you need to explain what you chose to do and why.