35. Performance: Measurement & Tips

J lets you express your ideas tersely, but it is up to you to make sure they are good ideas. Since each keystroke in a J sentence can summon up billions of machine cycles, you must make sure that you don't force the interpreter into a bad choice of algorithm. This will be hard at first, when you are struggling to figure out what the interpreter is doing, never mind how it is doing it; fortunately the interpreter gives you tools to measure the performance of your code.

Timing Individual Sentences

If you run the JforC script with

load 'system\packages\misc\jforc.ijs'

it will define the verb Ts. Ts stands for 'time and space' and it tells you how long it takes to run a given J sentence, and how much space the interpreter used during the execution. For example:

a3 =. i. 1000

Ts '+/\ a3'

4.3581e_5 5248

We define a noun a3, and we calculate the running total of its items. It took 0.00004 seconds to create the 1000-item running total, and used 5248 bytes. We could have done the whole operation in one line with Ts '+/\ i. 1000', but the monad i. uses time and space too, so if we want to find out only what is used by +/\, we make sure that's all we measure.

We can use Ts to start to understand what can make J programs slow. Let's define a verb to do the addition operation:

sum =: dyad : 'x + y'"0

sum is an exact replacement for dyad +, having the same rank and function. Replacing + with sum does not change the result of a sentence:

+/\ i. 7

0 1 3 6 10 15 21

sum/\ i. 7

0 1 3 6 10 15 21

But the performance is quite different, as we can measure:

a10 =. i. 10

1000 Ts '+/\ a10'

2.68191e_5 1280

Because +/\ is so fast, we give Ts a left argument to report the average time over 1000 runs. If we just ran the sentence once, the result would be corrupted by small timing variations introduced by the operating system. sum/\ is not so fast so we run it only once:

Ts 'sum/\ a10'

0.00181867 3648

Quite a difference: in this running total sum seems to be about 50 times slower than + . Let's just try adding a list to itself (remember that u~ y is equivalent to y u y):

1000 Ts '+~ a10'

2.68191e_5 896

100 Ts 'sum~ a10'

0.00033021 2560

Yes, sum is definitely slower than +, though only by a factor of 10 or so this time. Why should it be slower? The answer is, Because it deals with atoms. Since J verb-definitions are not compiled, but interpreted line-by-line on each execution, every single time we add two numbers with sum, the interpreter has to parse 'x + y' and perform the addition. Why, it's a miracle that it only slows down by a factor of 10! The lesson is that if you define verbs with small rank, the interpretive overhead will be significant.

Still, that doesn't fully explain why sum/\ is so much slower than +/\ . Let's investigate further by increasing the size of the operand:

a20 =. i. 20

1000 Ts '+/\ a20'

2.68191e_5 1280

Ts 'sum/\ a20'

0.00728641 3648

+/\ is unchanged when we move to a list of 20 items--the operation is so fast that time is being spent starting the verb rather than running it--but sum/\ slows down noticeably. Interesting; let's try bigger and bigger operands:

a40 =. i. 40

1000 Ts '+/\ a40'

2.76572e_5 1408

Ts 'sum/\ a40'

0.0299561 4160

a100 =. i. 100

1000 Ts '+/\ a100'

2.76572e_5 1664

Ts 'sum/\ a100'

0.185741 5184

a400 =. i. 400

1000 Ts '+/\ a400'

3.77143e_5 3200

Ts 'sum/\ a400'

3.00367 11328

Holy cow! On a 400-item list, sum/\ is 80000 times slower than +/\! What happened?

Recall what monad sum/\ is really doing. It applies monad sum/ to the first item of the list; then to the list made of the first 2 items; then the list made of the first 3 items; and so on. At each evaluation of monad sum/, the dyad sum verb is interleaved between the items and the result is evaluated right-to-left. The problem is, the interpreter doesn't analyze sum to know that it is associative--that x sum (y sum z) is the same as (x sum y) sum z--so it doesn't know that it can use the result from one subset as an input to the operation for the next subset, and it winds up performing every single addition: for the 400^th item it adds all 400 numbers together. That's why its time increases as the square of the length of the list.

Monad +/\ is fast because the interpreter knows that dyad + is associative, and therefore it reuses the result from one subset as input to the next, producing each item of the result with a single addition.

Well then, can we give a hint to the interpreter that sum is associative? Alas, no, but we have another trick up our sleeve. Consider monad sum/\., which applies monad sum/ to successive suffixes. If the interpreter is clever, it will notice that if it starts with the smallest suffix--the one made up of just the last item--and processes the suffixes in order of increasing size, it will always be evaluating x sum (previous suffix result), and right-to-left evaluation implies that the result of the previous suffix can always be used as the right operand to each application of monad sum, without needing any knowledge of associativity. Let me tell you, this interpreter is nothing if not clever, and that's just what it does. All we have to do is to convert our sum/\ into a variant of sum/\. . The way to do that is simple: we reverse the order of the items, apply sum/\., and reverse the order again:

sum/\.&.|. i. 7

0 1 3 6 10 15 21

This arises enough to be a standard J idiom: use it whenever you need to apply an associative verb on prefixes. It's much faster:

Ts 'sum/\.&.|. a400'

0.014805 59264

Still not as fast as +/\, but the suffix version uses time proportional to the number of items rather than the square of the number of items.

Use Large Verb-Ranks! and Integrated Rank Support

'Think big' is a watchword not just for program design, but for coding as well. Starting a primitive has a small cost, but if you start a primitive for each atom of a large array, the cost will add up. To reduce the time spent starting primitives, apply them to the biggest operands possible. This means, Use as large a verb-rank as you can. See what a difference a tiny change can make:

a =. i. 100000 10

Ts 'a -@+ a'

3.96384 4.19552e6

Ts 'a -@:+ a'

0.12801 8.3895e6

These two verbs produce identical results, but -@+ is 30-fold slower than -@:+ on this large operand. The reason is that -@+ has rank 0 (taken from the rank of +), while -@:+ has infinite rank. Rank 0 means that each pair of atoms is fed individually through the verb. So, when -@+ is executed, two primitives are started for each pair of atoms, one to add and the other to change the sign. Execution of -@:+ requires only two primitive-starts for the entire array.

You do not need to worry much about the ranks at which individual primitives are applied, because of an important feature of J called integrated rank support. When a verb with integrated rank support is used as the u in u"n, the resulting verb runs with a single primitive-start and the application of the verb on the proper cells is handled within the primitive. So,

100 Ts 'a + a'

0.0623949 4.19501e6

100 Ts 'a +"0 a'

0.248846 4.19526e6

100 Ts 'a +"1 a'

0.0681035 4.19526e6

100 Ts 'a +"2 a'

0.0626361 4.1952e6

All these forms produce identical results. The weak dependence of the speed on the rank is typical of a verb with integrated rank support. Fastest execution is achieved when the verb is used alone, but the form u"n still runs fast, and the higher the rank, the less the loop-control overhead. The Special Code page referred to in the previous section includes the long list of the primitives with integrated rank support. You will see there that u/, u/\, and the like are also taken care of.

The practical effect of integrated rank support is that you don't need to worry much about using the largest possible rank for primitives. In compounds and verbs that you write, you do need to keep the rank high:

Ts '(<a:;1) { a'

0.00939758 525568

Ts '1 {"1 a'

0.00952329 525184

Integrated rank support in dyad { gives the two forms equal performance. Look what happens when we replace the { by a user-defined verb with the same function:

from =. {

Ts '(<a:;1) from a'

0.00953335 525760

Ts '1 from"1 a'

0.365966 525696

from lacks integrated rank support, even though it is defined to have the same function as {, and it suffers when it is applied to each 1-cell. This is a good reason for you to learn the J primitives and not replace them with mnemonic equivalents.

Tips For Coding

Here is a list of things to keep in mind, starting with the most important. Some of these tips will not be understandable until you have mastered tacit programming.

Avoid Boxing, Especially Small Boxes

An unboxed array is stored packed into sequential memory locations. An array of boxes is stored as sequential headers, where each header points to the contents of the box; the contents of each box is stored in its own memory area.

Operations on the contents of an array of boxes require much more complicated addressing than operations on an unboxed array, and are therefore slower. In addition, the storage required for the headers can dwarf that for the data if the contents of each box are small.

For these reasons, you should try to use unboxed arrays when speed is paramount, and if you need boxing, make the contents of each box as large as possible. In C terms, you should use a structure of arrays rather than an array of structures. Most C programmers find the array of structures more pleasing. In C there is no penalty for using it, but J is different.

Use the Dyad i. Family: Dyad i. i: e. -., Monad ~. ~: u/.

Dyad i. is the most highly polished code in the J interpreter. It has more algorithms, based on operand size and type, than you'd ever code yourself. Use it whenever you can. Other primitives also use this fast code: these are dyads e. i: -. and monad ~. ~: u/ .

Use u&.> (u each) To Operate Inside Boxes

If you need to open boxes, do something to the contents, and box them up again, the way to do it is with the form u&.> which bypasses most of the overhead of creating a list of boxes. This form is recommended for either monadic or dyadic u . It is even worth using if one of the operands is an unboxed scalar, as the unboxing will have no effect and the reboxing will be fast.

&.> is given the name each in the J startup scripts.

Use In-Place Assignment

Special forms of dyad m} and dyad , modify or extend an array without copying the array. Use these whenever possible. The forms are:

name =. x m} name

name =: x m} name

name =. name , x

name =: name , x

Use Compounds Recognized by the Interpreter

The interpreter recognizes a great many compounds and has special code to perform the compound functions. For example, we have learned that u@:v y gives the same result as u v y, but it does not follow that the two forms are identical: +/@:, y is faster than +/ , y . How do you know what forms are handled with special code?

An appendix to the Dictionary gives a list of special code in the interpreter (press F1 to bring up help; then click on 'Dic' at the top of the page to bring up the Contents page; the appendices are listed at the end of the contents). There we see that there is special code for f/@:, so we know to use that form. Similarly, farther along we see that x i.&1@:< y has special coding, so we know to prefer that form over (x < y) i. 1 . This list changes from release to release, so you should review it occasionally. A few of the more important ones are:

Combine enfile with another operation: ($,) and u/@,

Use x ($,) y rather than x $ , y and u/@, y rather than u/ , y . In each case the interpreter can avoid making a copy of the data for the , operation.

Use unboxed x in x { y

If the x operand of x { y is a rank-2 array (in which each item specifies one cell of y), you should code x (<"1@[ { ]) y rather than (<"1 x) { y . The result is the same, but the first form creates the result directly without computing the boxed lists of x .

Search a constant list using preinterpreted m&i.

If you are going to do many searches through the same list of data m, you will do well to create a search verb

name =: m&i. NB. m&e. and -.&n work too, and some others

The compound m&i. is interpreted when name is assigned, and the interpreter builds an index for m and saves it as part of the verb, so that it doesn't have to be calculated each time the verb is used.

Note that it doesn't help you to use m&i. in a line in an explicit verb, because each line in an explicit verb is interpreted afresh each time it is executed.

Use i.&1@:u to find the first item where x u y is true

The natural way to find the first item where x u y is true is (x u y) i. 1 . The inefficiency in this code is that the entire x u y is calculated before the search is made for the first 1, which is needless computation for all the items after the first 1.

The form x i.&1@:u y produces the same result, but the interpreter can see what you are up to, and it stops computing u when the first 1 is found.

x i.&1@:u y always gives the correct answer, but it performs the fast abbreviated search only when the operands have rank less than 2 and u is one of = ~: < <: > >: E. e.

Variations of this form produce related results. i: rather than i. searches for the last occurrence; 0 rather than 1 searches for an item that is false. In place of i.&1, +/ counts the occurrences, +./ tests whether any of the results is true, *./ tests whether all the results are true, and I. gives all the indices at which the result is true.

The version with E. is the verb to use to find one string in another: x i.&1@:E. y finds the starting position of the first occurrence of the string x in the string y .

(u i. 1:) is an alternative way to write i.&1@:u .

Shining a Light: The J Performance Monitor

A magnet makes it easy to pick up a needle, but it won't much help you find a needle in a haystack. Likewise, being able to time and tune individual sentences will not suffice to let you improve the performance of a large J program. A large program spends most of its time executing a small subset of its code, and any improvements you make to other areas are simply wasted effort. I remember a case where a 20,000-line assembler-language program was spending 30% of its time executing a single machine instruction--and that instruction turned out to be unnecessary! What you need is a tool that will direct your attention to the areas where a speedup will really matter.

The J Performance Monitor will show you how much time is spent executing each line of your application. You can run the Lab on the Performance Monitor to see all the facilities available, or you can jump right into timing your code with the simple sequence

load 'jpm'

Do this once to load the tool. Then, for each timing run, execute

start_jpm_ 1e7

357142

The operand of start_jpm_ is the size in bytes of the trace buffer, and the result is the number of trace entries that can fit in the buffer. A trace entry is added for each line executed, and for entry and exit of explicit definitions (i. e. verbs defined with verb define).

run the code you want to time

viewtotal_jpm_ ''

J will display a popup window with information about the time spent in each verb. An example display is

+---------+------+--------+--------+-----+----+---+

|name |locale|all |here |here%|cum%|rep|

+---------+------+--------+--------+-----+----+---+

|accpay |base |0.001435|0.000829| 57.8| 58 |1 |

|intrep |base |0.000213|0.000213| 14.8| 73 |1 |

|accint |base |0.000393|0.000147| 10.2| 83 |1 |

|stretch |base |0.000142|0.000142| 9.9| 93 |1 |

|intexpand|base |0.000105|0.000105| 7.3|100 |1 |

|[total] | | |0.001435|100.0|100 | |

+---------+------+--------+--------+-----+----+---+

The columns contain the following information:

name the name of the verb

locale the locale the verb was running in

all the amount of time spent in this verb including time spent in verbs called by this verb

here the amount of time spent in this verb but not including time spent in verbs called by this verb

here% the here time as a percentage of total time

cum% cumulative total of here%

rep the number of times the verb was executed

You should focus your attention on the here column. If you see a verb that is taking longer than you think it should, double-click on its name to look at the details of its execution. Double-clicking on accpay will pop up another window showing

+--------+--------+---+----------------------------------+

|all |here |rep|accpay |

+--------+--------+---+----------------------------------+

|0.000041|0.000041|1 |monad |

|0.000040|0.000040|1 |[8] if. 4~:#y do. |

|0.000000|0.000000|0 |[9] 'imm frq int pay' return. end.|

|0.000054|0.000054|1 |[10] 'm f i p'=.y |

|0.000116|0.000116|1 |[11] len=.$p=.f#p%f |

|0.000724|0.000131|1 |[12] j=.}.len accint f intrep i |

|0.000322|0.000322|1 |[13] r=.j*+/\p%m}.1,(m-1)}.j |

|0.000137|0.000137|1 |[14] (len$(-f){.1)#r |

|0.001435|0.000841|1 |total monad |

+--------+--------+---+----------------------------------+

We see that line 13 takes the most time. Clicking on the column heading will sort the lines using that column as a key, making it easy for you to concentrate on the individual lines that are taking the most time.

You should be aware of one quirk. The Performance Monitor will account for the time spent in every line of a named verb that is explicitly defined (i. e. defined using verb define or 3 : or 4 :). Other verbs are accounted for only as a whole, not line-by-line. You may be surprised to find that a verb defined by

opaqueverb =: verb define"0

)

will not be treated line-by-line. Yes, there is an explicit definition, but opaqueverb is not that definition: opaqueverb is a compound produced by the rank conjunction " . If you want to look inside opaqueverb, you need to define it like so:

opaqueverb =: limpidverb"0

limpidverb =: verb define

)

The J Performance Monitor makes it easy to give your code a good finish by pounding down the nails that are sticking up. As of J6.01 there are a few quirks you need to work around: you cannot have a verb with the same name as a locale; you must close a detail window before you create a new one; and time spent in explicit modifiers is not correctly accounted for.