Parser

Parser applies grammar rules of a formal language to an input stream of tokens. If successful, the parsing process signifies that the input indeed represents correct sequence from the language; the output of parsing can be an AST (abstract syntax tree) of the grammatical structure of language elements or the result of structured evaluation described by the sequence.

Recursive Descent

Recursive descent is a top-down algorithm consisting of mutually recursive operations, such as traversal of a tree structure or parsing.

«recdesc.ijs»=
NB. recdesc - recursive descent parser
NB. 10/30/06 Oleg Kobchenko
NB. 04/05/07 Oleg Kobchenko literate
NB. 10/09/07 Oleg Kobchenko consistent with recdesc2

«grammar»

«lexer utilities»

NB. =========================================================
NB. common single class
coclass 'pcalc'

NB. tokens
'S_SPACE S_NUM S_ADD S_SUB S_MUL S_DIV S_LPAR S_RPAR S_ERR'=: i. 9
S_EOF=: _1

destroy=: codestroy

NB. =========================================================
NB. Regex Lexer, Token Pump and Accessors

«lexer»

«parser class»

«token pump»

«grammar rules»

«runner»

«public interface»

«test»

Simple Calculator

As a simple example, we will describe the language of an arithmetic calculator with traditional precedence rules:

«grammar»=
NB. The following grammar rules are used:
NB.
NB. factor ::= ('+'|'-')* ( num | '(' expr ')' )
NB. term   ::= factor ( ('*'|'%') factor )*
NB. expr   ::= term ( ('+'|'-') term )*
NB. line   ::= expr EOF

such that it will be able to evaluate the following expressions

«accept test»=
NB. accept
calc '11+22'
calc '5+2*10'
calc '(5+2)*10'
calc '(11+22)/-(3.0*2/2)'
calc '(11+22)*+(-1-2)'

and raise errors on such expressions

«reject test»=
NB. reject
calc '22+3/'
calc '22+3/(1+)'
calc '1+abc/2'

Lexer

Using Regex Lexer.

«lexer utilities»=
NB. =========================================================
NB. regex lexer utilitites
require 'regex'
coclass 'z'

lxdefine=: 1 : '''(?x)'' , m : 0'
lxresult=: ((1 i.~[:*@,(1 1,:_ 1)&(];.0)) , {.)"2
lxmatches=: lxresult@rxmatches
lxfrom=: >@(([:' ['&,,&']')&.>)@rxfrom
lxview=: lxmatches (":@[ ,. }."1@[ lxfrom ]) ]

Lexer will reside in the same class as parser, pcalc. We define the following tokens

«lexer»=
NB. =========================================================
NB. lexer part

LEX1=: noun lxdefine
( \s+     )|# 0 space
( [0-9.]+ )|# 1 number
( \+      )|# 2 add
( -       )|# 3 sub
( \*      )|# 4 mul
( /       )|# 5 div
( \(      )|# 6 lpar
( \)      )|# 7 rpar
( .+      ) # 8 error
)

The following test cases

«lexer test»=
LEX1_pcalc_ lxview '5+2*10'
LEX1_pcalc_ lxview '(3.0*2/2)'

will produce such scanning results

   LEX1_pcalc_ lxview '5+2*10'
1 0 1 [5] 
2 1 1 [+] 
1 2 1 [2] 
4 3 1 [*] 
1 4 2 [10]
   LEX1_pcalc_ lxview '(3.0*2/2)'
6 0 1 [(]  
1 1 3 [3.0]
4 4 1 [*]  
1 5 1 [2]  
5 6 1 [/]  
1 7 1 [2]  
7 8 1 [)]  

Parser Class

The class will also contain creation, destruction and maintenance routines. Each instance will accept an input string, stored in STR. The list of tokens from scanner will be in TOK.

«parser class»=
NB. =========================================================
NB. creation

create=: 3 : 0
  STR=: y
  TOK=: LEX1 lxmatches STR
  I=: _1
  if. (#TOK)>ie=. S_ERR i.~{."1 TOK do.
    'characters' unexpect ie
  end.
  next''
)

The static cover verb run, calls the top grammar rule line.

«runner»=
NB. =========================================================
NB. runner

run=: 3 : 0
  p=. ''
  try. 
    p=. y conew 'pcalc'
    r=. line__p'' 
  catch. 
    smoutput 13!:12 ''
    r=. i.0 0 
  end.
  if. #p do. destroy__p'' end.
  r
)

Verb calc is the public interface for the parser. It accepts a string and returns the result.

«public interface»=
NB. =========================================================
NB. public interface
calc_z_=: run_pcalc_

Token pump and accessor verbs will be defined as follows. Note, at the end of tokens, an implicit token EOF is inserted.

«token pump»=
NB. =========================================================
NB. token pump and accessors

next=: 3 : 0
  I=: I+1
  if. I<#TOK do. SYM=: {.I{TOK else. SYM=: S_EOF end.
  1
)

str=: 3 : 0
  if. y>:#TOK do. 'EOF' return. end.
  's b n'=. y{TOK
  (b,:n)];.0 STR
)
pos=: 3 : 0
  if. y>:#TOK do. _1 return. end.
  1{y{TOK
)
val=: 3 : '0".str y'

unexpect=: 'symbol'&$: : (4 : 0)
  y=. {.y,I
  error=. 'Unexpected ',x,' "',(str y),'" at ',":pos y
  error assert 0
)

Grammar Rules

First pump interface verbs

«grammar rules»=
NB. =========================================================
NB. parser part

accept=: 3 : '0:`next@.(SYM&=) y'
expect=: unexpect`1:@.accept

Finally the grammar rules

«grammar rules»=
NB. =========================================================
NB. grammar rules

NB. factor ::= ('+'|'-')* ( num | '(' expr ')' )
factor=: 3 : 0
  s=. 1
  while. 1 do.
    if.     accept S_SUB do. s=. s*_1
    elseif. accept S_ADD do. 
    elseif.              do. break. end.
  end.

  if.     accept S_NUM  do. r=. val I-1
  elseif. accept S_LPAR do. r=. expr ''
          expect S_RPAR
  elseif.               do. 'factor' unexpect '' end.

  s*r
)

NB. term ::= factor ( ('*'|'%') factor )*
term=: 3 : 0
  r=. factor ''
  while. 1 do.
    if.     accept S_MUL do. r=. r * factor''
    elseif. accept S_DIV do. r=. r % factor''
    elseif.              do. break. end.
  end.
  r
)

NB. expr ::= term ( ('+'|'-') term )*
expr=: 3 : 0
  r=. term ''
  while. 1 do.
    if.     accept S_ADD do. r=. r + term''
    elseif. accept S_SUB do. r=. r - term''
    elseif.              do. break. end.
  end.
  r
)

NB. line ::= expr EOF
line=: 3 : 0
  r=. expr ''
  expect S_EOF
  r
)

Testing

The following are some test results,

«test»=
NB. =========================================================
Note 'Tests'
«lexer test»

«accept test»

«reject test»
)

which is what we expect.

   calc '11+22'
33
   calc '5+2*10'
25
   calc '(5+2)*10'
70
   calc '(11+22)/-(3.0*2/2)'
_11
   calc '(11+22)*+(-1-2)'
_99
   calc '22+3/'
Unexpected factor "EOF" at _1
   calc '22+3/(1+)'
Unexpected factor ")" at 8
   calc '1+abc/2'
Unexpected characters "abc/2" at 2

Further Enhancements

The resulting parser gets the job done, but it's not very flexible in terms of extensibility for different token sourses and output targets. So the following enhancements can be considered. (See recdesc2.ijs in Downloads above.)

accept=: 3 : '0:`next__lex@.sym__lex y'
expect=: 3 : 'unexpect__lex`1:@.accept y'

term=: 3 : 0
  r=. factor ''
  while. 1 do.
    if.     accept S_MUL do. r=. r '*' binop__cons factor''
    elseif. accept S_DIV do. r=. r '%' binop__cons factor''
    elseif.              do. break. end.
  end.
  r
)

NB. =========================================================
NB. Tree Consumer, static verbs
coclass 'pCalcConsTree'

binop =: 1 : (':';'<x,m;y')
unop  =: 1 : '<m;y'
val   =: <

eval_z_=: 'pCalcConsEval'&run_pCalcParse_
tree_z_=: 'pCalcConsTree'&run_pCalcParse_

Here are some results of the tree consumer and new error messages.

   tree '5 + 2*10'
+-+-+--------+
|5|+|+-+-+--+|
| | ||2|*|10||
| | |+-+-+--+|
+-+-+--------+
   
   tree '22+3/(1+)'
|Unexpected factor ")" at 8: assert
|   error     assert 0

JSON Parser Example

JSON (JavaScript Object Notation) is a lightweight data-interchange format based on declarative syntax for JavaScript structural constant definitions. Besides primitive string and numeric constants it has lists [elm1, elm2, ...] and hash tables {key1 : val1, key2 : val2, ...}, whose elements are recursively defined.

For lexer here we use J ;: Words primitive, which almost perfectly suits our needs, with a few limitations: strings must be enclosed in apostrophes '', colon must be preceded by space, etc.

Both examples have the following grammar. A notable feature is that the bracketed structures are split between two entries.

NB. Val  ::= '[' List
NB.        | '{' Hash
NB.        | token
NB. List ::= ']'
NB.        | Val (',' Val)* ']'
NB. Hash ::= '}'
NB.        | Pair (',' Pair)* '}'
NB. Pair ::= Val ':' Val

json.ijs uses a simpler token pump based only on the verb token.

json1.ijs uses the accept/expect pattern as in pcalc above.

Lists are stored as boxed lists and hash tables as boxed two-row tables of keys and values. For example, for input

T1=: 0 : 0
  [123, 'ab c', {qq : zz, asb :[12,34]},
    [3, 5, [], {a :[]}, zz]]
)

the result is

   p=. conew'ppar1'
   >parse__p T1
+---+------+------------+-------------+
|123|'ab c'|+--+-------+|+-+-++---+--+|
|   |      ||qq|asb    |||3|5||+-+|zz||
|   |      |+--+-------+|| | |||a||  ||
|   |      ||zz|+--+--+||| | ||+-+|  ||
|   |      ||  ||12|34|||| | ||| ||  ||
|   |      ||  |+--+--+||| | ||+-+|  ||
|   |      |+--+-------+|+-+-++---+--+|
+---+------+------------+-------------+
   codestroy__p''

See Also


CategoryLiterate

Essays/Recursive Descent Parser (last edited 2009-09-08 04:02:04 by RicSherlock)