Differences between revisions 17 and 18
 ⇤ ← Revision 17 as of 2013-11-12 21:11:23 → Size: 22915 Editor: DevonMcCormick Comment: ← Revision 18 as of 2013-11-12 22:04:11 → ⇥ Size: 22973 Editor: DevonMcCormick Comment: Deletions are marked like this. Additions are marked like this. Line 8: Line 8: {{{ Line 24: Line 25: }}} Line 186: Line 187: The first part of the assigment of newpt: newpt=. ((]{~[:?#)y)+x*_1 1{~?D\$2  uses a tacit function (]{~[:?#) which generates a random number ( ? ) based on the length ( # ) of its right argument and uses this to select ( {~ ) an element (row). This is applied to the right argument y which is the list of points in the cluster, so we randomly choose one of our existing points.This random point is added to the latter part of this expression - x*_1 1{~?D\$2 in which we apply an offset determined by the left argument x multiplied by a random combination of ones and negative ones set by _1 1{~?D\$2. This multiplication essentially moves x points in a random direction from the original point. The D is in there to support generalization to more than two dimensions.So, to return to our long-awaited explanation for the magic number 3 in our invocation of the main function ( (3;PTS) aggStruc 20\$10 ), this is the random distance we travel to find an empty neighborhood in which to release our new point. So, if we pick a larger number for this argument, we'll pick release points, on average, further away from the cluster. This means the random walk will have to go on longer to have good chance of hitting the cluster. The first part of the assigment of newpt: {{{ newpt=. ((]{~[:?#)y)+x*_1 1{~?D\$2 }}} uses a tacit function {{{ (]{~[:?#) }}} which generates a random number ({{{ ? }}}) based on the length ({{{ # }}}) of its right argument and uses this to select ({{{ {~ }}}) an element (row). This is applied to the right argument y which is the list of points in the cluster, so we randomly choose one of our existing points.This random point is added to the latter part of this expression - {{{ x*_1 1{~?D\$2 }}} in which we apply an offset determined by the left argument x multiplied by a random combination of ones and negative ones set by {{{ _1 1{~?D\$2 }}}. This multiplication essentially moves x points in a random direction from the original point. The D is in there to support generalization to more than two dimensions.So, to return to our long-awaited explanation for the magic number 3 in our invocation of the main function ({{{ (3;PTS) aggStruc 20\$10 }}}), this is the random distance we travel to find an empty neighborhood in which to release our new point. So, if we pick a larger number for this argument, we'll pick release points, on average, further away from the cluster. This means the random walk will have to go on longer to have good chance of hitting the cluster.

tab-delimited table, scientific visualization, diffusion-limited aggregation, 3D interactive scatterplot, finite-state machine, finite-state automata, Raspberry Pi, finding J code

Location
Heartland

## Meeting Agenda for NYCJUG 20130611

```0. Beginner's regatta: see "numericMatToTabDelimitedCharacterVector.doc" and
"Learn Visualization, Young Coder.doc".

1. Show-and-tell: diffusion-limited aggregation: see "Recap of Work on DLA.doc".

2. Beginner's regatta re-visited: see "Primitive 3D Interactive Scatterplot
Using d3.doc".

3. Show-and-tell re-visited: see "Parsing CSV Files with a Finite State
Machine.doc".

4. Learning, teaching and promoting J, et al.: see "Newbie Roadblocks.doc"```

## Proceedings

We looked at a simple example of turning a table of numbers into a tab-delimited character array suitable for import into something like Excel, and considered the pros and cons of scientists as programmers. After an introduction to an approach to coding diffusion-limited aggregation, we looked at how we might extend this into higher than two dimensions and considered the question of how to look at a three-dimensional set of points interactively in a browser. Then we had a tutorial on using J's finite state automata routines, then finished up with a discussion of what roadblocks J newbies face.

## Show-and-tell re-visited

### Recap of Work on Diffusion-limited Aggregation

The follow excerpts from our exploration of how the "release policy" shapes the resulting shape of a diffusion-limited aggregation (DLA).

#### Basic Code

We start with this basic code, all of which can be found here. First, load some sub-routines, define our own namespace dla, and create an initialization routine which defines a few globals:

```   load 'coutil'
cocurrent 'dla'

init=: 3 : 0
D=: y            NB. How many dimensions?
RNI=: D dimnbr 1 NB. Relative neighbor indexes: 1-away in D-space
NB.EG  init 2       NB. Initialize 2-dimensional space.
)```

Here's the main routine, which aggregates points to the global PTS:

```NB.* aggStruc: randomly walk a point y times to aggregate it to cluster:
NB. (>0{x) away from random point in >1{x.
aggStruc=: 4 : 0"(_ 0)
point=. (>0{x) release >1{x [ ctr=. _1
while. y>ctr=. >:ctr do. point=. walk point
if. PTS check_collision point do. y=. ctr [ addPt point end. end.
ctr
NB.EG (5;PTS) aggStruc 1e2\$1e2
)```

The basic sub-routines:

```NB.* check_collision: 1 iff x in neighborhood of points y.
check_collision=: 4 : '1 e. x e. RNI+"1 y'

NB.* dimnbr: x-dimensional y-nearest neighbor offsets.
dimnbr=: 13 : '(x\$0)-.~,/>{x\$<y-~i.>:+:y'
NB. RNI=: (D\$0)-.~(_1 0 1){~(D\$3)#:i.3^D     NB. Relative neighbor offsets

NB.* walk: walk one random step.
walk=: 3 : 'y+RNI{~?#RNI'
Finally, an early version of the nub of the problem: the function that "releases" a new particle.
NB.* releaseIfOpen: release new particle x away to find open neighborhood.
releaseIfOpen=: 4 : 0
while. 1 e. PTS e. RNI+"1 newpt=. ((]{~[:?#)y)+x*_1 1{~?D\$2 do. end.
newpt
)```

Make this our release policy (for now):

`release=: releaseIfOpen`

The main routine randomly moves the newly-released point until it either encounters the existing cluster and sticks, or finishes a pre-set number of moves without encountering the cluster, at which point it disappears.

We'll explain in more detail how the release policy works shortly but first let's see how to run the code. If we run the above lines of J, then move back into the base namespace, and make the dla namespace transparently available here, we can initialize our cluster.

```   coclass 'base'
coinsert 'dla'
init 2
0 0
\$PTS
1 2```

So, we start with one point at the origin; its shape is 1x2 because it's one point in two dimensions: we plan to concatenate new points to the bottom of this matrix (as can be seen by the definition of addPt above).

The arguments to the main routine are a parameter of 3 - to be explained shortly - the existing list of points and a right argument of how long our random walks should be before we give up. We'll arbitrarily try 20 walks of length 10 each:

```   (3;PTS) aggStruc 20\$10
4 10 10 10 10 10 10 1 10 10 10 10 3 10 10 7 10 9 10 10```

The result is how long each walk went on. We see that most of them went the maximum - which means they didn't encounter the cluster (or encountered it only on the last step) - but a few ended early, hence became part of the cluster.

So, how many points do we now have and what are they?

```   \$PTS
7 2
|:PTS
0 _1 _2 _2  0 _1 _2
0 _1 _2  0 _2 _3  1```

We have 7 points which we show transposed simply because it's a more efficient use of display space.

Let's try the same thing again:

```   (3;PTS) aggStruc 20\$10
10 10 1 10 6 10 10 10 3 10 10 10 2 3 10 1 10 4 4 10
(\$PTS);|:PTS
+----+--------------------------------------------+
|16 2|0 _1 _2 _2  0 _1 _2 _3  1 _4 _1 0 _2 1  2 _3|
|    |0 _1 _2  0 _2 _3  1  2 _2  1  2 3 _4 1 _1 _3|
+----+--------------------------------------------+```

This time we see that more points stuck. Again, we display the shape and the values as a vector of boxed items just to make more efficient use of our display.

How about a picture of these points? To do this, re-enter the namespace and define a new utility function bordFill, then return to the base namespace. This new function takes a left argument of how many empty cells we want to border our points and a right argument of our point co-ordinates.

```   coclass 'dla'

NB.* bordFill: fill 0-mat w/1 according to y, having border of x cells.
bordFill=: 4 : '(1)(<"1 y-"1 x-~<./y)}0\$~2\$(>:+:x)+(>./y)-<./y'
NB.EG viewmat 1 bordFill PTS
coclass 'base'```

We use this in conjunction with the general "viewmat" utility to display the points:

`   viewmat 1 bordFill PTS`

Which looks like this: ...

#### Explaining First Release Policy

Our first release function is very crude and its limitations will become apparent as the cluster grows but it's fine for a start. Here's most of the code:

`while. 1 e. PTS e. RNI+"1 newpt=. ((]{~[:?#)y)+x*_1 1{~?D\$2 do. end.`

This is a while loop where all the work gets done in the conditional part. Here, we check if any of the immediate neighbors of the new, randomly chosen point newpt, are in the existing set of points. If so, we do nothing and try again until this is false. Once we exit the loop, we return newpt: a neighborless point near the cluster.

Let's examine this line in detail. The first part after while. is

`1 e. PTS e. RNI+"1 newpt`

which evaluates true if one or more points are in the set formed by RNI+"1 newpt. RNI is one of the three globals from our initialization: it looks like this (transposed for more efficient display):

```   |:RNI
_1 _1 _1  0 0  1 1 1
_1  0  1 _1 1 _1 0 1```

Arranging these eight points around an origin,

```   3 3\$<"1]1 1 1 1 0 1 1 1 1#^:_1]RNI
+-----+----+----+
|_1 _1|_1 0|_1 1|
+-----+----+----+
|0 _1 |0 0 |0 1 |
+-----+----+----+
|1 _1 |1 0 |1 1 |
+-----+----+----+```

we can see that they are the two-dimensional offsets around a given point. So, if we add each of these (one-dimensional) elements of RNI to a single point like newpt, we get the co-ordinates of all of the neighbors of newpt, e.g. for (5,5) shown here arranged two-dimensionally with the point itself inserted into the middle:

```   (13 : '(<y)(<1 1)}3 3\$<"1]1 1 1 1 0 1 1 1 1#^:_1]RNI+"1 y') 5 5
+---+---+---+
|4 4|4 5|4 6|
+---+---+---+
|5 4|5 5|5 6|
+---+---+---+
|6 4|6 5|6 6|
+---+---+---+```

The first part of the assigment of newpt:  newpt=. ((]{~[:?#)y)+x*_1 1{~?D\$2  uses a tacit function  (]{~[:?#)  which generates a random number ( ? ) based on the length ( # ) of its right argument and uses this to select ( {~ ) an element (row). This is applied to the right argument y which is the list of points in the cluster, so we randomly choose one of our existing points.

This random point is added to the latter part of this expression -  x*_1 1{~?D\$2  in which we apply an offset determined by the left argument x multiplied by a random combination of ones and negative ones set by  _1 1{~?D\$2 . This multiplication essentially moves x points in a random direction from the original point. The D is in there to support generalization to more than two dimensions.

So, to return to our long-awaited explanation for the magic number 3 in our invocation of the main function ( (3;PTS) aggStruc 20\$10 ), this is the random distance we travel to find an empty neighborhood in which to release our new point. So, if we pick a larger number for this argument, we'll pick release points, on average, further away from the cluster. This means the random walk will have to go on longer to have good chance of hitting the cluster.

So, if we choose a much bigger number than 3, we might see something like this:

```   (20;PTS) aggStruc 20\$10
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
\$PTS
16 2```

We added no new points because a random walk of 10 steps is not long enough to travel (more or less) 20 points to where the cluster is. If we are willing to walk a lot more, say on the order of 20^2 steps, it's a different story:

```   (20;PTS) aggStruc 20\$400
400 288 400 400 400 400 309 400 400 400 400 400 400 247 400 400 400 400 400 400
\$PTS
19 2```

However, as you might guess, it takes more time to walk further, i.e. starting with the original 16 points and timing 20 walks of 10 versus 20 walks of 400:

```   6!:2 '(3;PTS) aggStruc 20\$10'
0.006241296
\$PTS
30 2
\$PTS_dla_=: 16{.PTS_dla_
16 2
4!:55 <'PTS'   NB. Ensure local copy doesn't shadow one in "dla" namespace.
1
6!:2 '(20;PTS) aggStruc 20\$400'
0.22534233
\$PTS
18 2```

So, we see that the longer walks take many times as long and add fewer points. In fact, this set of parameters will often add no points.

#### Concentrating on the Perimeter

Fortunately, it's easy to see how to address the problem of the growing number of points in the cluster reducing the efficiency of our process: instead of giving all the points as an argument to aggStruc, why don't we give only the perimeter points? This will reduce the number of false hits. How do we know which points are on the perimeter? One simple way is to look at only the most recently added points since we are, by the nature of our aggregation, adding to the perimeter.

Guessing that the most recent 10% of points is a reasonable bunch to try, let's extract those:

```   \$PERIMPTS_dla_=: PTS{.~->.10%~#PTS
78 2```

Looking at how these compare to the whole set:

`   viewmat 2 (<"1]1+PERIMPTS-"1 <./PTS)}1 bordFill PTS`

The manipulation we do to PERIMPTS is necessary to map the origin-centered co-ordinates to positive only co-ordinates. This is a replication of what bordFill does internally as well.

This looks good as the perimeter is indicated by the pink points around the edge. Let's try using this as our argument to aggStruc and see how it affects our timings.

```   do1&>3\$<20\$10
1005.5137 995.09202 564.62669
do1&>3\$<20\$10
|attention interrupt: releaseIfOpen
|   point=.(>0{x)    release>1{x[ctr=._1
\$PTS
831 2```

The points/sec initially goes back up but quickly declines, then the code freezes up and has to be interrupted. A look at the points and their perimeter makes the problem clear:

`   viewmat 2 (<"1]1+PERIMPTS-"1 <./PTS)}1 bordFill PTS`

We see that we quickly lose our original perimeter. Let's see how costly it is to re-calculate the perimeter with each call to the main routine:

```   do1_dla_=: 3 : '(sv-~#PTS)%tm=. 6!:2 ''(2;PTS{.~->.10%~#PTS) aggStruc y'' [ sv=. #PTS'
NB.  Inserted the perimeter calculation   ^^^^^ here ^^^^^^
do1&>5\$<20\$10
1009.626 1297.3126 1809.8097 1466.0253 1293.6868
Increasing the number of attempts per call:
do1&>4\$<30\$10
1578.9208 857.49015 1561.9932 1310.9005```

… This seems to be working fine though there does seem to be a gradual decrease in efficiency.

```   \$PTS
1358 2
do1&>(50\$10);4\$<100\$10
892.7828 859.53069 827.92105 809.63688 852.42751```

This looks pretty good but there is an aesthetic problem: This DLA looks a bit "chunky" compared to other examples we've seen.

Our policy of releasing very near the existing cluster has a tendency to fill in the gaps between branches of the cluster. Also, since we pick release points randomly in relation to the perimeter, sometimes we release within the existing cluster as indicated by the interior pink points on the lower part of this picture. What we'd really like to do is to pick release points only outside the perimeter. One way to do this is shown here as well in the following section.

A simple measure of how densely packed the above-pictured DLA is this:

```   (13 : '(#y)%~*/\$0 bordFill y') PTS
2.5402844```

This shows us that 1 in every 2.5 points in the bounding rectangle is part of the cluster. If we add more points, getting up to almost 62,000 using these parameters, this sparsity ratio goes up to 4.07.

#### New Release Policy

(The following method is also elaborated here in the context of an earlier version of the DLA code which works on a fixed-size, boolean matrix rather than an explicit set of points.)

First, we define a verb to generate all the immediate neighbors of a point set not in that set.

`neigh2=: 13 : 'x-.~~.,/y+"1/RNI'   NB. Empty neighbors of these`

We use this to get the nearest neighbors:

```   \$NP=: PTS neigh2 PTS
26386 2```

Then we expand the neighborhood by finding the next closest set of neighbors to these, then their neighbors, and so on five times:

```   6!:2 'NP=: PTS neigh2^:5]NP'
0.27077715```

This gives us an envelope six deep around our original points. We find the edge of this envelope by isolating those outermost neighbors (those with neighbors not in the set of neighbors):

```   \$edges=. ~.PTS-.~NP -.~ ,/NP+"1/RNI
1029 2
viewmat 2 (<"1]1+edges-"1 <./PTS,edges)}1 bordFill PTS,edges```

Here's what the edge of the envelope looks like in relation to our existing cluster:

This looks like a good perimeter set on which to release points but it's a bit thin. Let's thicken it by finding its neighbors two levels deep:

```   \$edges=. ~.PTS neigh2^:2]edges
5138 2```

Giving us this fatter release area:

Adjust our top-level calling routine to use this perimeter points global:

```   do1_dla_=: 4 : '(sv-~#PTS)%tm=. 6!:2 ''(x;PPTS) aggStruc y'' [ sv=. #PTS'
PPTS_dla_=: edges
1 do1&>10\$<10\$25
12.281323 21.910126 23.042773 79.635403 23.813149 22.408149 34.180425 48.340221 21.374501 50.795483```

Our initial points/second are a bit low but seem to be improving. Let's try some more:

```   \$PTS
12590 2
viewmat 2 (<"1]1+PPTS-"1 <./PTS,PPTS)}1 bordFill PTS,PPTS
1 do1&>20\$<20\$25
24.342477 61.743025 41.15663 27.630435 34.954674 34.740216 60.580126 52.822625 57.748258 22.060313 22.321549 52.2986 21.649241 40.913895 67.019374 34.597901 40.404442 52.186518 39.882113 42.753015```

One problem we'll have if we try to generate too many points using one perimeter is that the cluster will eventually "invade" the perimeter. If we don't do something to adjust it, this will lead to unsightly clumps of aggregation within the perimeter band.

After running this enough times to add a significant number of points, we look at the cluster with its perimeter:

This is interesting because we can plainly see the effect of the new release policy on the appearance of the cluster. The inner, denser portion was formed using our flawed "releaseIfOpen" method. The outer, sparser part of the cluster was formed using our new perimeter-based release policy.

If we generate a cluster solely based on this newer release policy, we get something like the following with a sparsity measure of 6.7, which makes this less dense than the clusters generated by our earlier policies.

Here’s an example of a DLA generated with this modified code.

#### Multi-Dimensional Change

To extend this code to multiple dimensions, it turns out we need make only one change to our working code. Instead of this initialization:

```init=: 3 : 0
D=: y            NB. D-dimensions
RNI=: D dimnbr 1 NB. Relative neighbor indexes: 1 cell around center.
PTS=: ,:y\$0      NB. Start point list w/Origin.
NB.EG  init 2       NB. Initialize 2-dimensional space.
)```

We do this (changes bolded):

```NB. Re-define initialization routine to allow multi-D:
init_dla_=: 3 : 0
D=: y                 NB. D-dimensions
RNI=: D dimnbr 1      NB. Relative neighbor indexes: 1 cell around center.```

`   neigh2=: 13 : 'x-.~~.,/y+"1/RNI'   NB. Empty neighbors of these`

```   PTS=: ,:y\$0           NB. Start point list w/Origin.
NB.EG  init 3            NB. Initialize 3-dimensional space.
)```

Here’s an example of running this code.

```   load 'DiffLimAgg.ijs'
init_dla_ 3           NB. Initialize in three dimensions
0 0 0
do1_dla_              NB. Grow the cluster and show stats
3 : 'tm,(#PTS),(sv-~#PTS),(tm%~sv-~#PTS),(usus nn),(#nn)%~nn+/ . =>./nn [ tm=. 6!:2 ''nn=. growMany y''[sv=. #PTS'
6!:2 'smoutput do1_dla_ 125;100;5 3'
0.149511 22 21 140.458 3 125 110.06 33.8833 0.79
0.193546
NB. Grew to 22 points in about 1/5 second...
ptsDensity
# % [: */ >./ - <./
%ptsDensity PTS_dla_  NB. About 1 filled cell per 13 empty ones
13.3636
6!:2 'smoutput do1_dla_ 125;1000;5 3'
0.284263 780 758 2666.54 1 125 54.096 45.3134 0.242
0.284411
(%ptsDensity PTS_dla_),#PTS_dla_
35.5808 780```

These results look plausible but how can we display them? This is what we do in the next section.

### Primitive 3D Interactive Scatterplot Using d3.js

First, we generate three-dimensional points in J using Diffusion-limited Aggregation.

```   load '~Code/DiffLimAgg.ijs'
init_dla_ 3
0 0 0
do1_dla_ 20;100;5 3
0.0334947 5 4 119.422 4 20 19.66 1.89214 0.96
\$PTS_dla_
5 3```

After generating more points for this cluster, we end up with 5,255 of them. Looking into the Javascript module “ScatterPlot3D.js” gives us an idea of the format into which we need to put these points. We got this code from this "Data Explorer" site

```function ScatterPlot3D(userConfig)
{ DexComponent.call(this, userConfig,
{ // The parent container of this chart.
'parent'           : null,
// Set these when you need to CSS style components independently.
'id'               : 'ScatterPlot3D',
'class'            : 'ScatterPlot3D',
// Our data...
'csv'              :
{ // Give folks without data something to look at anyhow.
'header'         : [ "X", "Y", "Z" ],
'data'           : [[0,0,0],[1,1,1],[2,4,8],[3,9,27]]
},
'width'            : 400,
'height'           : 400,
'xoffset'          : 20,
'yoffset'          : 0
});
this.chart = this;
}```

```   3{.j2n&.> pts=. PTS_dla_   NB. “j2n” converts J numbers to common character
+-+--+-+                      NB. representation.
|0|0 |0|
+-+--+-+
|0|-1|1|
+-+--+-+
|0|-2|2|
+-+--+-+```

Since we’re only looking at the first three points for now anyway, concentrate only on these.

```   ]pp=. 3{.pts
0  0 0
0 _1 1
0 _2 2
;}.&.>;&.><"1 (<'],['),.~',',&.>j2n&.>pp
0,0,0],[0,-1,1],[0,-2,2],[

}:_1|.;}.&.>;&.><"1 (<'],['),.~',',&.>j2n&.>pp
[0,0,0],[0,-1,1],[0,-2,2]```

Now that we have a piece of code to properly bracket the numbers in our J table, name it and use it to write the bracketed points to a file.

```   cvt2d3Array=: 3 : '}:_1|.;}.&.>;&.><"1 (<''],[''),.~'','',&.>j2n&.>y'

(cvt2d3Array pts) fwrite 'egDLA3DPts2.txt'
5546```

Right now, I just cut-and-paste the contents of this file into an HTML shell I adapted from an example I found illustrating the "Data Explorer" Dex (which uses d3.js).

Here’s a sample screenshot of what we end up with:

We can rotate these points in real-time in any modern browser. I'd like to make the whole experience more seamless, perhaps by using JHS to generate the code directly from a J table or by incorporating JavaScript to read in data that J outputs.

There's still a lot of work to do in this world of multi-dimensional data rendering as there appears to be little understanding of good ways of dealing with data in more than two dimensions. For instance, much of my work of adapting the "Dex" HTML was removing an ugly and useless alternate rendering of these 3D points in what’s called a “parallel coordinate view”, an example of which is shown here on the left.

The chart on the left renders the points on the right by drawing sets of lines linking (x,y,z) sets. I don't see the point of this muddy and confusing representation but I've seen it elsewhere as well but only one case where it seemed to illuminate anything.

## Attachments

NYCJUG/2013-06-11 (last edited 2013-11-12 22:04:11 by DevonMcCormick)