xml/sax - XML parser based on Expat library
SAX (Simple API for XML) parser addon. There is both flat API and object oriented, SAX-like interface. Binaries for Windows, Linux x86 and Darwin PPC included.
Based on Expat 2.0.0, see http://expat.sourceforge.net/
See also: examples in test folder in SVN; change history.
Contents
Installation
Use JAL/Package Manager or download the xml_sax archive from j601/addons and extract it into the ~addons/xml/sax folder (or ~addons/xml for j504).
Usage
SAX (Simple API for XML) is originally a Java framework by David Megginson derived from expat processing model. This paradigm results in systematically faster XML processing than DOM, as the SAX stream has a tiny memory footprint. See http://www.saxproject.org/.
SAX parsing works within the push model, i.e. the API calls you. You provide the callback functions by overriding the base class, see saxclass definition. For the XML nodes events, these functions are called on.
A higher-level visitor design pattern can be obtained if you define verbs with names of elements of interest and a prefix and call then from start/endElement. This would be similar to wd calling on event verbs.
In your class you maintain the state and selectively process the events. The event for text between tags is called characters. It is demoed in the table and rss examples.
In rss example, a simple stack of nested elements is maintained in the S list. Then characters processes the text accroding to the current context.
You can pass the result for process in the output of endDocument, which is the last event called.
Examples
These are listings and results of some examples found in the test folder.
sax_test2.ijs
NB. object oriented sax parser specialization
NB. extended to use attributes and levels
require 'xml/sax'
saxclass 'psax2'
showattrs=: (''"_)`(' ' , ;:^:_1@:(([ , '='"_ , ])&.>/"1))@.(*@#)
startDocument=: 3 : 0
L=: 0
)
startElement=: 4 : 0
smoutput (L#' '),'[',y,(showattrs attributes x),']'
L=: L+1
)
endElement=: 3 : 0
L=: L-1
smoutput (L#' '),'[/',y,']'
)
NB. =========================================================
cocurrent 'base'
TEST1=: 0 : 0
<root><test a="11"/><test b="12"/></root>
)
0 : 0 NB. Test
process_psax2_ TEST1
process_psax2_ fread jpath '~addons/xml/sax/test/chess.xml'
)process_psax2_ TEST1 [root] [test a=11] [/test] [test b=12] [/test] [/root]
sax_test3.ijs
NB. object oriented sax parser specialization
NB. extended to use text characters
require 'xml/sax'
saxclass 'psax3'
showattrs=: (''"_)`(}.@;@:((',' , [ , '='"_ , ])&.>/"1))@.(*@#)
startDocument=: 3 : 0
L=: 0
IGNOREWS=: 1
)
startElement=: 4 : 0
smoutput (L#' '),'',y,'(',(showattrs attributes x),') {'
L=: L+1
)
endElement=: 3 : 0
L=: L-1
smoutput (L#' '),'}'
)
characters=: 3 : 0
smoutput (L#' '),y
)
NB. =========================================================
cocurrent 'base'
TEST3=: 0 : 0
<body><p a="11">s123</p>Between<q b="12" c="3">z456</q></body>
)
TEST5=: 0 : 0
<body><p>Case & Co<q c="3&4">z "num"</q></p>5&6</body>
)
0 : 0 NB. Test
process_psax3_ TEST3
process_psax3_ TEST5
process_psax3_ fread jpath '~addons/xml/sax/test/table.xml'
)process_psax3_ TEST3 body() { p(a=11) { s123 } Between q(b=12,c=3) { z456 } }
table.ijs
NB. using element character content
NB. inter-tag and surrounding whitespace is ignored
require 'xml/sax format'
saxclass 'ptable'
endElement=: 3 : 0
if. y-:'tr' do. TD=: '' [ TR=: TR,TD end.
)
characters=: 3 : 'TD=: TD,<y'
startDocument=: 3 : 'TR=: empty TD=: i.0 [ IGNOREWS=: 1'
endDocument=: 3 : 'TR'
NB. =========================================================
cocurrent 'base'
TEST4=: 0 : 0
<table><tr> <td>0 0 </td> <td> 0 1</td> </tr>
<tr> <td>1 0 </td> <td> 1 1</td> </tr></table>
)
0 : 0 NB. Test
process_ptable_ TEST4
process_ptable_ fread jpath '~addons/xml/sax/test/table.xml'
)process_ptable_ TEST4 +---+---+ |0 0|0 1| +---+---+ |1 0|1 1| +---+---+
rss.ijs
NB. using element character content
NB. selective processing based on element hierarchy position
NB. 06/06/06 Oleg Kobchenko - added jwiki rss
require 'xml/sax format'
saxclass 'prss'
cl=: <;._2
startDocument=: 3 : 0
S=: ''
HOST=: ''
)
startElement=: 4 : 0
S=: S,<y
if. y-:'item' do. smoutput '' end.
s2=. _2{.S
if. s2-:cl'dc:contributor rdf:Description ' do.
HOST=: x getAttribute 'wiki:host' end.
)
endElement=: 3 : 0
S=: }:S
)
characters=: 3 : 0
s2=. _2{.S
if. s2-:;:'channel title' do. smoutput 'Channel: ',y elseif.
s2-:;:'channel description' do. smoutput fold y elseif.
s2-:;:'channel pubDate' do. smoutput 'Date: ',y elseif.
s2-:;:'item title' do. smoutput 'Topic: ',y elseif.
s2-:;:'item description' do. smoutput fold y elseif.
s2-:;:'item link' do. smoutput 'URL: ',y elseif.
s2-:cl'item dc:date ' do. smoutput 'Date: ',y end.
s3=. _3{.S
if. s3-:cl'dc:contributor rdf:Description rdf:value ' do.
smoutput 'Contributor: ',y,' at ',HOST end.
)
NB. =========================================================
cocurrent 'base'
TEST3=: 0 : 0
<channel><title>qq</title><pubDate>1/1/2006</pubDate></channel>
)
0 : 0 NB. Test
process_prss_ TEST3
process_prss_ fread jpath '~addons/xml/sax/test/cnn.rss'
process_prss_ fread jpath '~addons/xml/sax/test/jwiki1.rss'
)process_prss_ TEST3 Channel: qq Date: 1/1/2006
chess.ijs
NB. chess -- a more complete example of custom parser
NB. transforms XML chess board into a J character matrix
require 'xml/sax viewmat'
saxclass 'pchess'
COLORS=: ;:'whitepieces blackpieces'
PIECES=: ;:'pawn rook night bishop queen king'
SYMBOLS=: 'PRNBQKprnbqk'
startElement=: 4 : 0
e=. <y
if. 2>C=. COLORS i.e do. COLOR=: C*6 return. end.
if. 6>P=. PIECES i.e do. PIECE=: SYMBOLS{~COLOR+P return. end.
if. -.'position'-:y do. return. end.
r=. <:0". x getAttribute 'row'
c=. 'abcdefgh'i.x getAttribute 'column'
empty BOARD=: PIECE (<r,c) } BOARD
)
startDocument=: 3 : 0
BOARD=: '. '{~ ~:/~2|i.8
)
endDocument=: 3 : 0
|.BOARD
)
NB. =========================================================
cocurrent 'base'
0 : 0 NB. Test
process_pchess_ fread jpath '~addons/xml/sax/test/chess.xml'
viewbmp jpath'~addons/xml/sax/test/chess.bmp'
)process_pchess_ fread jpath '~addons/xml/sax/test/chess.xml' . . . . q . . . k B . . p . . .P P. p . . .P. . . .P. PP. . . R K
stop.ijs
NB. interrupt on found data or error
NB. sax_test2 extended to stop parsing.
NB. Note: end element event is still handled
require 'xml/sax'
saxclass 'pstop'
showattrs=: (''"_)`(' ' , ;:^:_1@:(([ , '='"_ , ])&.>/"1))@.(*@#)
startDocument=: 3 : 0
L=: 0
V=: 'not found'
)
startElement=: 4 : 0
smoutput (L#' '),'[',y,(showattrs attributes x),']'
if. y-:,'p' do.
select. x getAttribute 'n'
case. ,'b' do. stop '' [ V=: x getAttribute 'v'
case. _1 do. stop 1001;'Attribute "n" missing'
end.
end.
L=: L+1
)
endElement=: 3 : 0
L=: L-1
smoutput (L#' '),'[/',y,']'
)
endDocument=: 3 : 0
smoutput 'Value of n=b is ',":V
)
NB. =========================================================
cocurrent 'base'
TEST4=: 0 : 0
<body><p n="a" v="11"/><p n="b" v="22"/><p n="c" v="33"/></body>
)
TEST4a=: 0 : 0
<body><p n="a" v="11"/><p n="c" v="33"/></body>
)
TEST4b=: 0 : 0
<body><p n="a" v="11"/><p v="22"/><p n="c" v="33"/></body>
)
0 : 0 NB. Test
process_pstop_ TEST4
process_pstop_ TEST4a
process_pstop_ TEST4b
)process_pstop_ TEST4 [body] [p n=a v=11] [/p] [p n=b v=22] [/p] Value of n=b is 22 process_pstop_ TEST4a [body] [p n=a v=11] [/p] [p n=c v=33] [/p] [/body] Value of n=b is not found process_pstop_ TEST4b [body] [p n=a v=11] [/p] [p v=22] [/p] |xml error 1001 at (1 23): Attribute "n" missing: assert | (assert~error)0
prajg.ijs
I would like to add to Oleg's excellent examples with a bit of code I recently used to process large XML namespace documents generated by a Cognos namespace utility. The following script blows through large namespace documents and builds a parent child symbol table. The simplicity of this code is in stark contrast to the ugly industrial XML it processes. Don't be deceived by Oleg's terse examples this is a very powerful and useful utility. JohnBaker
NB. Finds all user superclasses to root in Cognos namespace report XML.
NB. John Baker J6.01 2007/06/07 uses Oleg's SAX addon
require 'xml/sax format'
saxclass 'prajg'
startDocument=: 3 : 0
S=: '' NB. element path
PCTAB=: 0 2$'' NB. parent child table
P=: '' NB. parents
CHILDUC=: ;: 'ChildrenUserClasses Userclass'
NSUC=: ;: 'NamespaceReport Userclass'
MBRU=: ;: 'Members User'
)
startElement=: 4 : 0
S=: S,<y
s2=. _2{.S
if. s2 -: CHILDUC do.
class=. x getAttribute 'name'
PCTAB=: PCTAB,({:P),<class
P=: P,<class
elseif. s2 -: MBRU do.
user=. '**user: ',x getAttribute 'name'
PCTAB=: PCTAB,({:P),<user
elseif. s2 -: NSUC do.
class=. x getAttribute 'name'
P=: P,<class
end.
)
endElement=: 3 : 0
S=: }:S
NB. pop parent when ChildrenUserClasses ends
if. y-:'ChildrenUserClasses' do. P=: }:P end.
)
NB.return parent child table as symbols
endDocument=: 3 : 0
s: PCTAB
)
NB.===================================
cocurrent 'base'
See Also
Authors
OlegKobchenko, see also http://olegykj.sourceforge.net/

