Pyramigram; Sum of the Bits

Recreational APL
Eugene McDonnell
September 1980

Puzzle

Pyramigram

The following puzzle originated with Linda Alvord, of Scotch Plains Fanwood High School in Scotch Plains, New Jersey. Ms. Alvord conducted APL contests for her students for several years and permitted nonstudents to enter the contests as well. An engaging thing about her contests was that she awarded an official-looking medal to the winner. I can remember how proud I was to receive my first medal. (To appreciate this story you should realize that at the time I had been using APL for a dozen years and was working in IBM’s prestigious APL design group.) Well, the day after I received the medal in the mail at home, my colleague Paul Berry was on the same commuter train to our office in Philadelphia. With great pride I showed my medal to Paul, who looked at it and said, “Oh yes, my son Michael has two of those.” Talk about put-downs!: Michael Berry (now in the Boston office of I.P. Sharp Associates) was then a high-school junior! In any event, here is the puzzle:

Write a function pg which takes a scalar integer argument from 1 to 26 and produce a character matrix containing a pattern like this:

         q
        w q
       q e w
      r q w e
     q e r t w

This example could have been generated by pg 5 , since it has 5 rows. In each row r appear r randomly selected and randomly ordered letters, once each. The letters are separated by single spaces, with equal numbers of spaces at either end of the row. The most interesting feature, however, is the sets of letters in adjacent rows differ only in one randomly selected letter.

Entries should be submitted, if possible, in direct-definition form. They will be judged on the basis of clarify, wit, power, and brevity. Send your entries to the Recreational APL Editor at the address shown inside the front cover, or to EEM on the I.P. Sharp or STSC mailbox systems.

Sum of the Bits

Charles Brenner gave a talk on APL at the IBM Research Center in Yorktown Heights, New York in 1968, when APL succeeded in taking that institution by storm, making the first several hundred converts to the power and beauty of APL. Brenner, who is now a consultant in Los Angeles, was then working in the APL development group and, among other things, was the original implementer of dyadic transpose. The talk he gave was on the internal workings of APL. Many in the audience had come to learn the secrets behind APL’s astounding rapidity of response, and its ability to support such a large number of users. You must remember that time-sharing, which had begun in 1958, had in 1968 still not reached maturity; with few exceptions, systems were slow, prone to breakdown, and couldn’t support many users. APL was one of the notable exceptions and, as a home-grown product, was enormously popular at IBM’s Research Center.

Brenner surprised some in the audience by making the remark that what impressed him about APL was not how well it worked, but how poorly. From his knowledge of the internal workings of the system, he knew tha this original APL had been developed as a series of general routines which handled all cases; almost no attempt had been made to optimize special cases. This was especially true in the case of Boolean data. Although the storage of Boolean data had been optimized, so that one Boolean element took just one bit of memory, the accessing of this data was quite inefficient.

One frequent problem is to count the number of objects having a certain property. Expressions for doing this are quite simple to write in APL. For example, to detemine how many numbers in a list l are negative, one writes +/l<0 . The expression l<0 produces a list of the same size as l , with a 1 where the corresponding element of l is negative, and 0’s elsewhere; +/ gives the sum of this list, the count of the number of negative elements in l .

All of this is straightforward, and illustrative of the notational power of APL. It is easy to write, but in this early implementation of APL, it was quite inefficent to execute. Each Boolean element was fetched one at a time, converted to an integer, and accumulated toward the total. The general fetch routine was given the address of the element to be fetched, the type of this element, and the type of conversion desired. When the list to be summed was fairly long, this could be painfully slow. Despite this, I should point out that one of the reasons for the great early success of APL was that the implementers had strenously resisted the temptation to show how smart they could be. They chose to get a solid working system going, rather than to tackle the problem of putting together the most highly optimized system, and in doing this they showed good instincts.

By 1969, however, this product was being employed commercially by the first APL time-sharing service organizations, who were under strong incentive to show their customers that their services were commercially viable. Two of these services were those of STSC and I.P. Sharp Associates. In 1969 these two companies shared a common computer in Toronto, and though subsequently STSC installed its own computer in Bethesda, Maryland they continued for several years maintained identical systems, sharing development efforts. In that era Al Rose, a vice president of STSC, called on Roger Moore, a vice president of I.P. Sharp, to improve the speed of Boolean summation. Moore had reasons of his own for wanting to do this, so he agreed. His analysis of the problem was faily subtle, and he managed to gain more efficiency in more areas than the original problem called for, but we will concentrate here on the essentials of his solution of the basic problem.

The heart of the technique was exploitation of the Translate instruction, peculiar to the architecture of the IBM System/360 line of computers. The Translate instruction is able to replace as many as 256 eight-bit bytes of data in storage by the same number of other bytes. Each byte of the first argument is used to select a byte from the second argument, which then replaces the byte of the first argument. Considering the elements of the two arguments as integers in the range 0-255, one could define the Translate instruction by the expression a←b[a] , where the length of a is 256 elements or less, and the length of b is at least ⌈/a . The reason Translate is in the System/360 set of instructions is to convert data from one transmission code to another, but that is not what Moore used it for. Instead, he put into b elements which gave the count of the number of bits in the corresponding elements of a . For example, if the entry in a were 01110110 (which, considered as a base-2 number is 118), entry 118 in b would be 00000101 , since there are five 1-bits in the element of a , and 00000101 , considered as a base-2 number, has the value 5.

With this translate table in hand, it was a relatively simple matter to take a Boolean array, translate the bytes of its rows to the corresponding bit count, and then sum the translated bytes. Suppose, for example, that a 32-element Boolean vector v is represented in the computer memory by the four bytes:

   11010110 00100011 01111101 00101101

These would be translated to the four bytes:

   00000101 00000011 00000110 00000100

which are the binary representations of the numbers 5, 3, 6, and 4, respectively. These four numbers could be summed by a Load and three Adds, to give the desired sum 18. Thus with one Translate, one Load, and three Adds it is possible to do what otherwise took 32 Translations, one Load, and 31 Adds. Actually Moore’s strategy was more sophisticated than this, so that he was able to handle ^/ , ∨/ , ≠/ , and =/ with the same code, and his code is considerably faster than even this analysis would indicate, but that isn’t part of the present story.

So successful was this technique that when IBM’s new high-performance VS APL system was announced in 1975, the Plus-Reduction on the I.P. Sharp and STSC systems was still twice as fast as that of VS APL. This is despite the fact that the VS APL Boolean +/ is done by a rather clever algorithm devised by Leonard Lyon of IBM’s Palo Alto Scientific Center; it is, however, data dependant, requiring as many Adds as there are 1’s in the data. A curious fact is that the VS APL algorithm also uses the Translate instruction, but for an expected reason: the elements of a Boolean array are stored by VS APL in reverse order within a byte! For example, if a Boolean vector had the elements 11100110 , these would be stored as 01100111 . This way of storing Boolean data hand its origin, I understand, in an attempt to optimize certain aspects of the APL microcode for the IBM Model-145 System/370 computer, and although that machine and its microcode are obsolescent, the reversed bits still plague VS APL.

Moore was not the first one to use the table-lookup technique for bit counting, however. Arthur Samuel in the 1950’s had been one of the early workers in the field of artificial intelligence. In the course of his work he had devised a computer program that, beginning with a rudimentary ability to play the game of checkers and some learning strategies, eventually developed into a better-than-average checkers player. Samuel used 32 of the 36 bit positions in the word of the IBM 704 computer to represent the playing squares of a checkerboard, and used 1’s in the appropriate bit positions to represent the pieces. Various words were used to represent attacking pieces, kings, and so forth. Samuel writes that “Bit counting is done by a table-lookup procedure in a closed subroutine of 16 executed instructions (408 microseconds). This requires a 256-word table which is generated at the start by a 13-word program.” [1].

This technique of fast adding of bits is thus fairly well-known. Not so well known is the fact that it is possible to speed up Boolean -/ in a similar way.

The phrase “alternating sum”, which appear to have originated with K.E. Iverson [2], describes a process in which successive terms are alternately added or subtracted from the developing sum. In APL the alternating sum of a vector v can be obtained by -/v . A simple way to compute the alternating sum is to sum every other term beginning with the first, and then subtract from this the sum of every other term beginning with the second. For example:

   -/3 1 4 1 5 9 2 6 5 3 5 8 9 7 9
   (+/ 3 4 5 2 5 5 9 9)-+/1 1 9 6 3 8 7
   42-35
   7

A fast algorithm for alternating sum over Boolean arrays for machines like the IBM System/360 may also be devised by using the Translate-instruction strategy. The difference is that, instead of storing corresponding sums of argument bytes, one stores corresponding alternating sums. For negative alternating sums, one stores the 2’s-complement form of the number. For example, the alternating sum of 00000001 is ¯1 , and the 2’s complement form of this number is 11111111 . It is a relatively straightforward task to write an assembly-language routine to translate argument bytes to their equivalent alternating-sum bytes, to load a byte at a time into a register, extend the byte to the left with the leading bit (thus forming a positive or negative machine integer), and accumulate the resulting value toward the final alternating sum.

It is difficult but not impossible to imagine a situation where one could want to accomplish an alternating sum over a Boolean vector. For example, suppose we had a vector v , and two propositions p and q . We wish to know how many elements of v satisfy p v , less the number of elements that satisfy q y . We could write (+/p v)-+/q v . However, with the insight you have just gained from reading this column, you can see immediately that a single reduction would suffice, namely -/,(p v),[1.5]q v .

References

[1]		Samuel, A. “Some studies in machine learning, using the game of checkers”, IBM journal of research and development 3 3 (July 1959).
[2]		Iverson, K.E. Personal communication.

First appeared in APL Quote-Quad, Volume 11, Number 1, 1980-09.

created:	2009-10-03 22:10
updated:	2018-03-29 20:10