Index-Of, A 30-Year Quest

Roger Hui



0. The Problem

I will talk about index-of, both the original APL\360 definition which looks for scalars, and the extended definition which looks for major cells. Other related functions such as membership will also be discussed. This ground has been well-covered [0,1,2]; even so, I hope to present material which is new or significant, perhaps even new and significant.

Index-origin is 0 throughout. Unless otherwise stated, benchmarks are on random arguments with a million items.
 

1. Hashing

The best general method is hashing. A C program for x index-of y is shown. It is for x and y from a primitive datatype but isn’t too far from a program for general datatypes.

Essentially, the hash table contains indices into x and collision resolution is by linear probing [0, page 518]. When looking for a key k (x[i] or y[i]), start at location h computed by the hash function HF . Inspect successive hash table entries until either finding the key (x[*h]==k) or finding a free entry (*h<0). The hash table wraps around at the end.

At the end of the inspection (when the while loop terminates),

•   If looking for x[i] and a free hash table entry is found, put the index i into the hash table.
If looking for y[i] and a free hash table entry is found, the index is xn , otherwise it is the content of the hash table entry where the while loop terminated.

For our purposes, it’s important that the code be compact and that homologs are aligned.
 

2. Hashing — Selfie I

Staring at the two lines of code, one realizes that there are two “selfies”.

The first selfie is when x and y are the same array pointer, which happens in finding the nub or in the key operator. In that case the two DO loops are almost identical and can be combined, with a significant speed-up.

Timings indicate that no APL other than J currently exploits this selfie.
 

3. Hashing — Selfie II

Another selfie is that, for 4-byte integers for some functions in the index-of family, under some conditions, the hash table can store the x values themselves instead of indices to the values. Under what conditions? When you know there is a value M not in x , you can use it to indicate “free entry”. A small speed-up obtains.

Timings indicate that no APL, not even J, currently exploits this selfie.
 

4. Hashing Code I

Members of the index-of family have succinct and efficient implementations.
 

5. Hashing Code II

Defining some macros makes the code more compact and more readable: HI for hash table with indices; HS for hash table with the values themselves.
 

6. Hash Table Size

The hash table size should be at least 2×n . With such a hash table size the average number of comparisons is 3 for a key in x and 1.5 for a key not in x . See Figure 44 in Section 6.4 of Knuth [0].

With a hashing implementation, it’s essential that the hash table is available for analysis, regarding both the hash table size (as here) and the hash function.
 

7. Hash Function

The ideal hash function depends on every argument bit, produces a random integer, and is fast. On 4-byte integers, p times x where p is a prime is pretty good. (I learned the idea from Arthur Whitney.) The computation ignores integer overflow.
 

8. Hash Function — Cyclic Redundancy Check

What about hashing arguments with an arbitrary number of bits?

In the early early 1980’s, I worked with something called the Timeplex terminal multiplexer. It used packet transmission with a CRC checksum [3] on every packet. CRC makes a good hash function for index-of, because the probability of CRCs being equal on unequal arguments is small. Moreover, CRC can be implemented by a shift register with feedbacks, so that it is computable in real time — better than linear time because the result is available a constant delay after the last argument bit is fed in.

At the J Conference in 2012, during an “evening seminar” at The Spotted Dick pub on July 24, the subject of index-of came up. I mentioned that I have been waiting since the 1980s for the hardware people to give us an instruction to do CRC. (You’d think that with a billion transistors on the chip they can afford to spare a few to implement it.) Joey Tuttle had an iPad; the pub had WiFi; I Googled “CRC instruction” and discovered that my 30-year wait was over [4].

CRC32 is available in SSE4.2 [5,6,7], circa 2008. Unfortunately, only CRC32 is available but not CRC64. 32 bits are sufficient as a checksum on a transmission packet but not as a hash function result. (They shoulda asked me. ☺)

Cryptographic hash functions such as SHA-1 [8], if they become fast enough and widely available enough, would be excellent for index-of hashing.
 

9. Small-Range

When the arguments are small-range integers, you can do better than hashing. How small is “small”? Here, the range is small if it’s less than the size of the hash table, about 2×n . That is, the same memory that would have been used as a hash table, is used as pigeonholes [9].

A small-range algorithm for index-of is readily coded in C, and can be modelled succinctly in J or APL.
 

10. Small-Range Timings

The table shows how much faster is the small-range algorithm compared to hashing, a factor of 2 to 7.5. The second column, the range-finding v hashing column, indicates that range-finding from scratch takes 1 to 2% of the time to do hashing. That is, for an investment of 1 to 2% to find the range there is a potential gain of a factor of 2 to 7.5.

In our experience, on modern systems timing differences of under 5% are difficult to detect.

The last column refers to the earlier discussion about putting indices (HI) into the hash table v putting the integers themselves (HS) into the hash table. For HS to work, we need a sentinel value M not in the data to indicate that the hash table entry is free; if the range is known, M can be _2147483648 or 2147483647 . So even if the arguments are not small-range we can still benefit from knowing the range. The benefit is smaller (a factor of 1.2) but still worthwhile.
 

11. Small-Range Table Initialization

When the lengths of x and y are small relative to the range, the time to initialize the table becomes a relatively significant cost. This is a common problem in array programming: small arguments are trickier because there is less wiggle room, a smaller number of elements over which to amortize the costs.

To avoid the initialization cost, exercise 2.12 of Aho, Hopcroft, and Ullman’s classic The Design and Analysis of Computer Algorithms [10] is applicable.

The hint is part of the statement of the exercise and gives away the game. The technique is sufficiently well-known that the first ten results of Googling “Exercise 2.12 sparse” talk about it (4,950,000 results on 2014-07-24).
 

12. Tool of Thought I

With all this talk about the implementation let’s not lose sight of the fact that index-of is a superior tool-of-thought. I’ll give you two examples.

APL systems typically optimize the inner product and dot equal (*./ . = in J and ∧.= in APL). For example, Peter Teeson did it in SHARP APL [11] in the 1970s and more recently Dyalog APL also has special code for it. But the inner product is equivalent to two applications of the extended index-of (one a selfie), followed by an outer product, and is faster when computed that way.

The ratio of 14.6 indicates that the inner product has not been optimized much in J. The (5.11) is what the ratio would be if index-of were sped up in Dyalog APL with CRC32. The last row indicates that the comparison remains much the same if the times for the transpose are excluded.

To my knowledge, in over 45 years of APL systems no one thought to implement and dot equal using index-of. They might have with the extended definition.
 

13. Tool of Thought II

Inverted tables are a common efficient way to store data. Last year, we investigated a 28-line user function that (as it turned out) computed index-of on inverted tables. The extended definition ofmakes it easier to see a terse and faster equivalent function [12] .

We eventually decided to implement inverted table index-of as an “i-beam” ( AKA foreign conjunction). It uses CRC32 hashing and is 10 to 40 times faster than the user’s function.
 

14. Range Finding

In order to use the faster small-range algorithms, we need to know the range. To find the range, there is standard code that requires 2×n comparisons. In school we learnt that there is a way to find the range using 1.5×n comparisons. Which do you think is faster on modern systems?

Put that way, you’ve got to know that the standard coding is faster. But how much faster? Answer: Factor of 2.
 

15. Range Finding with Vector Instructions

Intel machines with SSE4.1 [7,13] (2008 or later) has max and min vector instructions. It works with 16 byte registers and so processes 4-byte integers four at a time. Thanks to Jay Foad of Dyalog Ltd., the Dyalog APL interpreter now has a fast range-finding utility.

The compiled code is a tight loop of 5 machine instructions and the factors are greater than one expects from working 4- or 8- or 16-at-a-time.
 

16. Range Finding Puzzle

We close with a programming puzzle. Given a vector of 1-byte integers, can you find its maximum faster than the standard C program? No use of multicore, vector instructions, loop unrolling, or other similar tricks.

The puzzle originated as follows. In the 2013 Dyalog Conference, I had a slide giving relative timings on the new key operator [14]. It shows that key with tally takes less than a factor of 2 over finding the maximum. This is fast; unbelievably fast even. The puzzle arose from investigations into this timing.
 

References

[0] Knuth, D.E., The Art of Computer Programming, Volume 3, Sorting and Searching, Addison-Wesley, 1973.
[1]   Bernecky, R., Speeding Up Dyadic Iota and Dyadic Epsilon, Proceedings of the APL Congress 73, 1973-08-22 to -24, P. Gjerløv, H.J. Helms, and Johs Nielsen, Editors, North-Holland Publishing Company, 1973.
[2]   Hui, R.K.W., Hashing for Tolerant Index-Of, APL 2010 Conference Proceedings, 2010-09-16.
[3] Wikipedia, Computation of CRC, 2014.
[4]   Kankowski, P., Benchmarking CRC32 and PopCnt Instructions, 2010.
[5] Intel, Intel 64 and IA-32 Architectures Developer’ Manual: Volume 2A, 2014-06; CRC32 — Accumulate CRC32 Value, page 3-191.
[6] Microsoft, SSE 4 Instructions, Visual Studios 2008, Micrsoft Developer Network, 2008; Compiler Instrinsics _mm_crc32_u8, _mm_crc32_u16, _mm_crc32_u32, and _mm_crc32_u64.
[7] GNU, 6.55.14 X86 Build-in Functions, GNU Online Docs, 2014.
[8] Wikipedia, Cryptographic hash function, 2014.
[9] Wikipedia, Pigeonhole principle, 2014.
[10] Aho, A.V., Hopcroft, J.E., and Ullman, J.D., The Design and Analysis of Computer Algorithms, Addison-Wesley, 1974, Exercise 2.12.
[11] Teeson, Peter H., ∧.= , SHARP APL, 197x.
[12] Hui, R.K.W., Rank & Friends, Dyalog User Conference, 2013-10-22.
[13] Microsoft, SSE 4 Instructions, Visual Studios 2008, Micrsoft Developer Network, 2008; Compiler Instrinsics _mm_min_epi32 and _mm_max_epi32.
[14]   Hui, R.K.W., Primitive Performance, New Constructs: Tally with Key I, Dyalog User Conference, 2013-10-22.


Presented at the Jsoftware Conference 2014, 2014-07-25.

created:  2014-02-09 11:20
updated: