Consider the following EBNF defining 26 token categories < id > through < comma >:
< letter > -> a | b | ... | z | A | B | ... | Z
< digit > -> 0 | 1 | ... | 9
< id > -> < letter > {< letter > | < digit >}
< int > -> {< digit >}+
< float > -> {< digit >}+ "." {< digit >}+
< floatE > -> < float > (e|E) [+|−] {< digit >}+
< add > -> +
< sub > -> −
< mul > -> *
< div > -> /
< or > -> "||"
< and > -> "&&"
< inv > -> !
< lt > -> "<"
< le > -> "<="
< gt > -> ">"
< ge > -> ">="
< eq > -> "=="
< neq > -> "!="
< assign > -> =
< LParen > -> "("
< RParen > -> ")"
< LBrace > -> "{"
< RBrace > -> "}"
< LBracket > -> "["
< RBracket > -> "]"
< semicolon > -> ";"
< comma > -> ","
< letter > and < digit > are not token categories by themselves; rather, they are auxiliary categories to assist the definitions of the tokens < id >, < int >, < float >, < floatE >.
The following is a DFA to accept the 26 token categories. see image.
The objective of this project is to implement a lexical analyzer that accepts the 26 token categories plus the following keywords, all in lowercase letters only:
if, else, while, returnVal, new, print
These keywords cannot be used as identifiers, but can be parts of identifiers, like "iff" and "delse". we assume that the identifiers and keywords are case-sensitive. The implementation should be based on the above DFA. Your lexical analyzer program should clearly separate the driver and the state-transition function so that the driver will remain invariant and only state-transition functions will change from DFA to DFA. The enumerated or integer type is suggested for representation of states.
The following keyword recognition method is adequate for this project.
1.Create 6 additional DFA states for the keywords.
2.The DFA initially accepts the keywords as identifiers.
3.Each time the DFA accepts an identifier, check if it is one of the keywords, and if so, move the DFA to the corresponding keyword state.
The lexical analyzer program is to read an input text file, extract the tokens in it, and write them out one by one on separate lines. Each token should be flagged with its category. The output should be sent to an output text file. Whenever invalid tokens are found, error messages should be printed, and the reading process should continue. To make grading efficient and uniform, the program is to read the input/output file names as external arguments to the main function.