IndexOf, A 30Year Quest I will talk about indexof, both the original APL\360 definition which looks for scalars, and the extended definition which looks for major cells. Other related functions such as membership will also be discussed. This ground has been wellcovered [0,1,2]; even so, I hope to present material which is new or significant, perhaps even new and significant. Indexorigin is 0 throughout.
Unless otherwise stated, benchmarks are on random arguments
with a million items.
The best general method is hashing. A C program for x indexof y is shown. It is for x and y from a primitive datatype but isn’t too far from a program for general datatypes. Essentially, the hash table contains indices into x and collision resolution is by linear probing [0, page 518]. When looking for a key k (x[i] or y[i]), start at location h computed by the hash function HF . Inspect successive hash table entries until either finding the key (x[*h]==k) or finding a free entry (*h<0). The hash table wraps around at the end. At the end of the inspection (when the while loop terminates),
For our purposes, it’s important that the code be compact
and that homologs are aligned.
Staring at the two lines of code, one realizes that there are two “selfies”. The first selfie is when x and y are the same array pointer, which happens in finding the nub or in the key operator. In that case the two DO loops are almost identical and can be combined, with a significant speedup. Timings indicate that no APL other than J currently exploits this selfie.
Another selfie is that, for 4byte integers for some functions in the indexof family, under some conditions, the hash table can store the x values themselves instead of indices to the values. Under what conditions? When you know there is a value M not in x , you can use it to indicate “free entry”. A small speedup obtains. Timings indicate that no APL, not even J, currently exploits this selfie.
Members of the indexof family have succinct and efficient implementations.
Defining some macros makes the code more compact and more
readable: HI for hash table with indices; HS
for hash table with the values themselves.
The hash table size should be at least 2×n . With such a hash table size the average number of comparisons is 3 for a key in x and 1.5 for a key not in x . See Figure 44 in Section 6.4 of Knuth [0]. With a hashing implementation,
it’s essential that the hash table is available for analysis,
regarding both the hash table size (as here) and the hash function.
The ideal hash function depends on every argument bit,
produces a random integer, and is fast.
On 4byte integers, p times x where p is a prime
is pretty good. (I learned the idea from Arthur Whitney.)
The computation ignores integer overflow.
8. Hash Function — Cyclic Redundancy Check What about hashing arguments with an arbitrary number of bits? In the early early 1980’s, I worked with something called the Timeplex terminal multiplexer. It used packet transmission with a CRC checksum [3] on every packet. CRC makes a good hash function for indexof, because the probability of CRCs being equal on unequal arguments is small. Moreover, CRC can be implemented by a shift register with feedbacks, so that it is computable in real time — better than linear time because the result is available a constant delay after the last argument bit is fed in. At the J Conference in 2012, during an “evening seminar” at The Spotted Dick pub on July 24, the subject of indexof came up. I mentioned that I have been waiting since the 1980s for the hardware people to give us an instruction to do CRC. (You’d think that with a billion transistors on the chip they can afford to spare a few to implement it.) Joey Tuttle had an iPad; the pub had WiFi; I Googled “CRC instruction” and discovered that my 30year wait was over [4]. CRC32 is available in SSE4.2 [5,6,7], circa 2008. Unfortunately, only CRC32 is available but not CRC64. 32 bits are sufficient as a checksum on a transmission packet but not as a hash function result. (They shoulda asked me. ☺) Cryptographic hash functions such as SHA1
[8],
if they become fast enough and widely available enough,
would be excellent for indexof hashing.
When the arguments are smallrange integers, you can do better than hashing. How small is “small”? Here, the range is small if it’s less than the size of the hash table, about 2×n . That is, the same memory that would have been used as a hash table, is used as pigeonholes [9]. A smallrange algorithm for indexof is readily coded in C,
and can be modelled succinctly in J or APL.
The table shows how much faster is the smallrange algorithm compared to hashing, a factor of 2 to 7.5. The second column, the rangefinding v hashing column, indicates that rangefinding from scratch takes 1 to 2% of the time to do hashing. That is, for an investment of 1 to 2% to find the range there is a potential gain of a factor of 2 to 7.5. In our experience, on modern systems timing differences of under 5% are difficult to detect. The last column refers to the earlier discussion about putting indices
(HI) into the hash table v putting the integers
themselves (HS) into the hash table.
For HS to work, we need a sentinel value M not
in the data to indicate that the hash table entry is free;
if the range is known, M
can be _2147483648 or 2147483647 .
So even if the arguments are not smallrange
we can still benefit from knowing the range.
The benefit is smaller (a factor of 1.2) but still worthwhile.
11. SmallRange Table Initialization When the lengths of x and y are small relative to the range, the time to initialize the table becomes a relatively significant cost. This is a common problem in array programming: small arguments are trickier because there is less wiggle room, a smaller number of elements over which to amortize the costs. To avoid the initialization cost, exercise 2.12 of Aho, Hopcroft, and Ullman’s classic The Design and Analysis of Computer Algorithms [10] is applicable. The hint is part of the statement of the exercise and gives away the game.
The technique is sufficiently wellknown that the first ten results of
Googling
“Exercise 2.12 sparse” talk about it (4,950,000 results on 20140724).
With all this talk about the implementation let’s not lose sight of the fact that indexof is a superior toolofthought. I’ll give you two examples. APL systems typically optimize the inner product and dot equal (*./ . = in J and ∧.= in APL). For example, Peter Teeson did it in SHARP APL [11] in the 1970s and more recently Dyalog APL also has special code for it. But the inner product is equivalent to two applications of the extended indexof (one a selfie), followed by an outer product, and is faster when computed that way. The ratio of 14.6 indicates that the inner product has not been optimized much in J. The (5.11) is what the ratio would be if indexof were sped up in Dyalog APL with CRC32. The last row indicates that the comparison remains much the same if the times for the transpose are excluded. To my knowledge, in over 45 years of APL systems no one thought
to implement and dot equal using indexof.
They might have with the extended definition.
Inverted tables are a common efficient way to store data. Last year, we investigated a 28line user function that (as it turned out) computed indexof on inverted tables. The extended definition of ⍳ makes it easier to see a terse and faster equivalent function [12] . We eventually decided to implement inverted table indexof
as an “ibeam” (⌶ AKA foreign conjunction).
It uses CRC32 hashing and is 10 to 40 times faster
than the user’s function.
In order to use the faster smallrange algorithms, we need to know the range. To find the range, there is standard code that requires 2×n comparisons. In school we learnt that there is a way to find the range using 1.5×n comparisons. Which do you think is faster on modern systems? Put that way, you’ve got to know that the standard
coding is faster. But how much faster?
Answer: Factor of 2.
15. Range Finding with Vector Instructions Intel machines with SSE4.1 [7,13] (2008 or later) has max and min vector instructions. It works with 16 byte registers and so processes 4byte integers four at a time. Thanks to Jay Foad of Dyalog Ltd., the Dyalog APL interpreter now has a fast rangefinding utility. The compiled code is a tight loop of 5 machine instructions
and the factors are greater than one expects from working 4
or 8 or 16atatime.
We close with a programming puzzle. Given a vector of 1byte integers, can you find its maximum faster than the standard C program? No use of multicore, vector instructions, loop unrolling, or other similar tricks. The puzzle originated as follows.
In the 2013 Dyalog Conference, I had a slide
giving relative timings on the new key operator
[14].
It shows that key with tally
takes less than a factor of 2 over finding
the maximum.
This is fast; unbelievably fast even.
The puzzle arose from investigations into
this timing.
References
Presented at the Jsoftware Conference 2014, 20140725.
