[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnumed-devel] weekly journal reviews

From: Syan Tan
Subject: [Gnumed-devel] weekly journal reviews
Date: Fri, 30 Jun 2006 15:50:24 +0800

in the style of journal review sessions, where people intermittently

present what they've been reading in order to facilitate the transmission of

knowledge, I'd like to add another episode to the "how to write a regular _expression_ recognizer program",

which fits in as a useful thing to do in the lexer part of compiler composition,

if compiler composition is divided into the tasks of lexer, parser, syntax translation , code generation, code optimization,

Interestingly , the lexer needs a mini-parser, which is the regular _expression_ parser,

and it converts a regular _expression_ into a binary tree representation of the regular _expression_.

The tree can then be traversed in pre-order, to build a state transition table representing a deterministic

finite automaton, or a non-deterministic finite automaton (NFA).

The  NFA  build  procedure is quite straightforward, and is called Thompson's algorithm.

It basically says what state transitions to do for a given terminal or non-terminal node in the parse tree.

Terminal nodes have literal characters that need to be matched, e.g. (a|b)*abb  the characters are a and b.

1. For these terminal symbols, the state transition diagram is

initial-state  --- symbol ---> final-state


2. For the operation union or alternate ( | ) ,

the state transition diagram is


initial-state --> e-transition-->  node1-state-diagram --> e-transition--> final-state

 |__________> e-transition ->  node2 -state-diagram--> e-transition ____|

this shows that the union state diagram has 2 possible transitions out of the initial state,

and for the algorithm to work, the symbol for transition is the e symbol ( or empty symbol),

this makes the diagram non-deterministic , because for the one symbol (e) , there is more than one transition.


3. for the concatenate operation,  which is one symbol followed by another e.g. abb   is a concat b concat b

 the algorithm suggests getting the   Si ---symbol --> Sf      diagrams of each symbol, but have only one state

between transitions , so that the final state of a preceding state is merged with the initial state of the next state.

Actually, this may not be necessary for the algorithm to work, and maybe just inserting an e transition between

symbol state transitions would work just as well, although I haven't tested it.


4. for the kleene-star operation,  which is zero or more repititions of the single child node, the

transition diagram is


  |                                                              |
Si --->S1---> child diagram --> S2 --->   Sf  

            |                                      |

              <--------- e --------------<                                   

 the outer e-transition is the zero repetition, and the inner e -transition which forms a loop, is the 1 or more repetition.


 5. the bracket operation is not in the algorithm, and is implemented by avoiding a pop off the stack, when

building a parse tree.

In order to build the state transition table, the initial and final states of a particular node in the parse tree

can be stored as attributes of the node ;  this way,  operator nodes, can access the initial and final states

of child nodes to do the thompson algorithm state transitions.

Once the tree has been traversed pre-order, the initial state of the tree's root should be

the initial state of the NFA recognizer for the particular regular _expression_.

In order to see whether the NFA "recognizes" a string,  

the nfa algorithm  depends on finding the e-closure of the current set of states.

The e-closure set is the set of states that can be reached by doing no transition, or 1 or more

e transitions from a state.  

The algorithm starts with the initial set as being the e-closure of the root nodes initial state.

Then the current character of the string being checked is used to move from the states

of the current set of states, to another set of states, using the state transition table from

computing thompson's algorithm.

Then from this set of states,

the e-closure set is found  ( using the transition table again),

and this set is used in the processing the next character.

At the end of the string, if the e-closure set includes the final state of the root node,

then the NFA accepts the string.  Because the e-closure operation includes a no transition

from the current set of states, this repeated e-closure only makes the set grow, and

it needs to grow to include the final accepting state.


 Attached is the regular _expression_ tree parsing program modified to accept the . wildcard character,

and a nfa building program using thompson's algorithm.


The book says that the nfa accepting algorithm can be implemented using two stacks and

a boolean vector indexed by states that have been seen.  One stack has the input set of states,

which are individually popped, and a symbol move lookup done on the popped state, and

if the next state isn't marked seen in the boolean vector, is added to the other stack.

The other stack is then popped to examine for the reachable states by e-transitions for each

state on the stack, and then put on the other stack. When the string processing has finished

the bit vector can be examined to see if the root node's final state has been seen.






Attachment: nfa.h
Description: Binary data

Attachment: parsenode.h
Description: Binary data

Attachment: regex_parse1.c
Description: Binary data

Attachment: nfa.c
Description: Binary data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]