A Perfect Square Root Routine Abstract To characterize a much larger effort,
the design and implementation of a square root routine
for IPSA APL is described.
The routine is perfect in the sense that,
if the result can be represented exactly,
the exact result is given.
If the result cannot be represented exactly,
it is rounded to the nearest
representable floating point number.
The use of APL in designing and testing
this routine is emphasized.
Introduction Over the past several years I have been involved in making extensive additions and changes to those portions of the IPSA APL system that deal with the floating point arithmetic data format and instructions of the IBM System/370 architecture. These are itemized in the following paragraphs. I redesigned the floating point to character conversion routine to make it more efficient and more accurate. Efficiency was gained by using fixed point instructions in place of floating point instructions. The improvement in accuracy now assures that, in IPSA APL, with maximum precision set, if the digits of a displayed number are reentered, the resulting internal value will be identical to that which gave rise to the displayed number. I redesigned the routine for converting numbers from character form to use fixed point instructions in place of floating point instructions. The incentive to do this arose from the microcomputer. A major dissatisfaction with the implementation of IPSA APL on the IBM PC was the slow conversion to and from floating point numbers. I had already speeded up the output conversion process, and now took the opportunity to speed up the input conversion process as well. I revised the matrix divide routine to allow higher rank arguments, so that one can now invert all the matrices along the last two dimensions of a rank three or higher rank array. It was speeded up by from five to thirty per cent, depending on the size of the array and on the hardware. Doug Forkes of our Toronto development group and I jointly added complex arithmetic to IPSA APL to make it more useful for those engineering, scientific, and mathematical applications that require it. My part in this effort was to design and write all of the complex scalar function routines. Lastly, I redesigned and rewrote all of the real elementary function routines. The collection of new routines is smaller than the one it replaced, and most routines are more accurate; many are faster. When I first thought of preparing a paper for this conference, I intended it to be a complete survey of all the above. In working on an outline, however, I realized that if I were to do justice to it, the paper would far exceed the boundaries permitted. I decided that the best way to give a sense of the work would be to concentrate on a part that was small enough to fit in a conference paper, and yet significant enough that it would properly represent the whole. I selected the redesign of the square root routine. The square root function is not a primitive part of APL [1], but nonetheless many APL systems, including IPSA APL, include a square root routine. This is for several reasons. First, the square root occurs with great frequency in mathematics. Next, if one implements the power function, ⍺*⍵ , which is part of APL, solely in terms of its general definition, *⍵×⍟⍺ , one does not usually get accurate enough square roots. Lastly, the square root is used by many APL primitives, for example matrix divide, arcsine, and arccosine. Several aspects of the redesign of the square root
routine may be of interest to the general reader.
Those involved in implementations of APL may find it
useful to read of the modelling and testing
techniques that were used in this work.
They are similar enough to the modelling
and testing that was done for all the other work
mentioned to convey the nature of the entire process.
Theory Heron, the Alexandrian scientist (possibly as early as the second century B.C. or as late as the third century A.D.) devised a method for approximating square roots which is still in use today. The method is an iterative one, and exploits the fact that any particular estimate y of the value of the square root r of a positive number x , if it is not equal to r , may be improved by computing the average of y and x÷y , that is, y←(y+x÷y)÷2 If the estimate y is positive, this method will eventually converge to the square root. For example, if the number x whose square root we desire is 9 , and if our initial estimate y of its square root is 2 , a better y is given by: (2+9÷2)÷2 3.25 The successive values obtained in this process (with maximum precision set) are: 2 3.25 3.0096153846153846 3.0000153600393216 3.0000000000393214 3 Typically, the number of accurate digits in y about doubles at each step. The closer the initial estimate y is to the eventual result, the fewer will be the number of iterations required. Much has been written on the subject of choosing best estimates for starting values, and, as you will see, a great deal of thought went into this aspect of the new square root routine being described here. Instead of having the central part of the square root routine deal with numbers across the entire range of allowable arguments, we use a device which transforms all arguments into the interval from 0.0625 to 16 . For System/370 floating point numbers this corresponds to numbers having hexadecimal exponents 0 and 1 . Those arguments with even hexadecimal exponents are transformed into the interval 0.0625 to 1 , and arguments with odd hexadecimal exponents are transformed into the interval from 1 to 16 . The square root of the transformed argument is computed, and the final result is obtained by multiplying this by a certain power of 16 . The entire process of computing the square root r of a vector of positive numbers x having the value 144 2304 36864 may be summarized as follows:
Transforming the arguments in this manner simplifies the development of the initial approximation and has no effect on the accuracy of the result, since the transformation deals only with the exponent of the argument, and leaves the fraction part unaltered. The primary concern of the rest of this paper
is with the details of step 4,
the calculation of the square root
of the reduced argument.
Design My first APL model, though guided by highly reputable sources [COD, HAR], did not succeed in producing a function with great accuracy. To test this version, I wrote an APL function which generated random positive numbers, found the square root of these using the APL model, and then compared the argument with the square of the result. In generating thousands of samples, I found hundreds of cases of disagreement, with the difference between argument and the square of the result being as large as seven units in the last place (a unit in the last place is called an ulp). I discussed my dissatisfaction with my colleague Doug Forkes, of our Toronto development group, who suggested a different approach. The key to his idea was to provide two different choices for a starting approximation, based on the exponent of the transformed argument. He introduced me to the theory of least maximum approximations, or minmax approximations. This theory can be used to develop approximations that have the property that, for a given range of arguments, the maximum error is minimized [KEL]. With this technique, Forkes determined the coefficients for two straightline, or linear, polynomials, which minimized the maximum relative error. One covered the range from 0.0625 to 1 ; the other covered the range from 1 to 16 . He noticed that the coefficients for the two polynomials were related: those for the lower range were 2 8÷9 , and those for the upper range were 8 2÷9 , where the elements are given in the order: linear term, constant term. The function he designed using this choice of two linear polynomials gave much improved results over my earlier attempts, but still lacked perfection. However, it suggested two ways in which it might be improved. First, I thought that, if two linear polynomials gave better results than one, perhaps two quadratic (or even higher order) polynomials might give even more precise results. I developed the vectors of coefficients for two quadratic polynomials covering the same ranges as did Forkes’s linear polynomials, and found that, indeed, the rate of convergence of the algorithm using two quadratic initial estimates was better than the one using two linear initial estimates. However, the additional instructions required at each step to develop the next quadratic approximation actually slowed the process down. Discarding this approach, I next asked the question, whether partitioning the argument into a larger number of subranges covering the interval from 0.0625 to 16 might not require fewer iterations, without increasing the instruction count inordinately. To help you understand what follows, let me show you what Forkes had done in terms of S/370 hexadecimal floating point numbers. First, the argument had been reduced so that it was in the range 0.0625 through 16 , or, in hexadecimal, to a range which had 0.1 as its lower boundary, and f.fff… as its upper boundary. Forkes separated the arguments into the two ranges:
I planned to carry this several steps further, and use not only the exponent, but also part of the leading fraction digit, to establish the subranges. Thus I set up:
I computed eight linear minmax polynomials, one for each of the above ranges, and found that these, like those that Forkes had computed, were also simply related. The coefficients I found were:
A glance at these will show that the vector of constant terms is the reversal of the vector of linear terms. (Furthermore, successive pairs of terms have the quotient 2*.5 , but I found no easy way to make use of this fact.) Because the two vectors have the same elements,
it is only necessary to store the vector
of linear coefficients,
since the constant coefficients can be
derived from it by simple means.
Furthermore, the coefficients could be stored
as fourbyte floating point constants,
because of considerations of the precision
of the initial estimates obtained
with these approximations.
Error analysis and precision The maximum relative error (mre) of a given approximation is determined as a byproduct of the minmax process. When the maximum relative error is minimized, one knows what this maximum is. For each of the eight linear polynomials found as described above, the mre was 0.007453 . The minimum precision of an approximation in base⍺ digits corresponding to a given mre ⍵ can be computed with the formula minprec:⍺⍟⍵ . Given an initial estimate with mre 0.007453 , we can see that we are assured an initial approximation having 2.13 decimal, or 1.77 hexadecimal digits. For example, suppose we are trying to find the square root of 1 . We determine an initial estimate by evaluating the linear polynomial 1⊥ 0.41730069 0.59015230 , getting the result 1.00745299 , which is, as predicted, within 0.007453 of the exact answer. The mre at each stage of the iterative process for square root can be determined by using the formula nextmre: 0.5x(⍵*2)÷1+⍵ , where ⍵ is the mre from the preceding stage. The nextmre function can be derived from Heron’s formula by replacing the argument and result by expressions involving the error terms, and solving for the new error term. If we look at the nextmre function, we see that the next mre tends to be quadratically smaller than the current mre, since the numerator term is less than one and quadratic, while the denominator is greater than one. This implies that the number of digits of precision about doubles at each step. The precision function minprec can be used with the nextmre function to give us the mre and the number of decimal and hexadecimal digits of precision at each step of the square root process:
Since a S/370 long floating point number has fourteen hexadecimal digits, and a S/370 short floating point word has six hexadecimal digits, this table tells us that a) only three iterations will be needed to guarantee the required precision of fourteen hexadecimal digits and b) we can use short (fourbyte) floating point data to store the coefficients for the initial and firstapproximation coefficients, since neither exceeds the six hexadecimal digits of precision given by this S/370 data format. This theoretical understructure was used to
create an APL model for a new square root routine.
The model showed quite good properties.
In particular, it promised to lead
to a faster square root routine
because it used only three iterations,
as opposed to four.
It was also more accurate than the
first version of the Forkes algorithm.
Obtaining perfect accuracy My method of testing this particular design was a bit unusual. Instead of, as before, comparing the square of the result with the argument, I used the fact that my test machine, an IBM 4381, had a builtin square root instruction which, according to the manual [IBM1] gives exact results if possible, and otherwise gives the nearest representable floating point number. I decided to use my random argument generator to provide arguments for my APL model to evaluate, and then compare the results the model gave with the result of using the builtin square root instruction, keeping track of disagreements between the two. I tested more than 20,000 arguments, and found that there were three cases of disagreement. The problem was the same for all of the arguments which gave disagreement, so I’ll focus on just one of them, hexadecimal 0.4a78b9af084bba , or decimal 0.29090462229969108 . The results of my investigation astonished me.
The answer given by the 4381 square root instruction, when squared, was one ulp less than the argument; the answer given by my algorithm, when squared, was exactly equal to the argument. It looked as if the 4381 instruction was wrong. I explored more deeply. I looked into the available IBM documentation for the 4381 square root instruction [IBM2], and saw that a key part of its algorithm, used at the very end, was the application of something called the Tuckerman rounding test. This test asserts that r is the correctly rounded square root of x if and only if: (r×(rulp)) < x and x ≤ (r×(r+ulp)) I puzzled over this assertion for a while, and conjectured, in the face of the evidence I had, that the relation in the first part of the test should be ≤ ; instead of < . If this change were made, then the results I got, which were obviously right, would pass the test, and the 4381 results, which were obviously wrong, would fail it. In order to build my confidence in the accuracy of this speculation, I decided to investigate the Tuckerman rounding test by going to the man who I was fairly certain was its progenitor. I knew Bryant Tuckerman to be a mathematician at IBM’s Watson Research Center in Yorktown Heights, New York. He had been an early associate of John von Neumann at the Institute for Advanced Studies in Princeton, New Jersey, and in fact had left his mark on APL. It was Tuckerman who was responsible for determining that 7*5 , or 16807 , was the best seed for the random number generator in APL\360 [MCD]. I telephoned him and asked my question about the Tuckerman rounding test (which he acknowledged as his). He assured me, however, that there was no doubt about the statement of the test in the manual: it was correct. This gave me pause. I decided to go back to my model and to my testing procedure, and explore it for possible misbehavior. I decided to stop my reliance on comparing the square of the result with the argument, and instead to use the same method of testing that I would later employ in all the other mathematical routines I would be designing. This exploited a workspace that permitted extended precision computations in any number base. It also included high precision routines for most of the elementary functions, and high precision values for such constants as e and pi. Using this workspace, with the number base set to 16 , I went through the following exercise: The number whose square root I wanted was (in hexadecimal):
Using the extended precision square root function in my high precision workspace, I computed the square root of this number to be:
So that the most accurate square root of (a) is the number given by the part of (b) outside of brackets. This is because the 15th fraction digit is 7 , so rounding doesn’t change the preceding digit 1 to 2 . (A hexadecimal number should be rounded to the next higher number only if the first omitted digit is 8 or higher.) The number (b) is the result given by the 4381 square root instruction. Moreover, squaring the number given by the part of (b) outside of brackets gives us:
and this is less than (a) by 0. …89fe d970… . If we take the number one ulp higher than (b), namely
(the result given by my algorithm) and square it, we get
which is further from (a) than (c) is, since the part in brackets, 8a27 9650 is greater than 89fe 0970 . The difference is small, but in the quest for perfection it is significant. IBM’s S/370 floating point arithmetic truncates its results, so that the part shown in brackets in (b c e) is discarded. This gives the anomalous situation that the square of the most accurate square root (b) is further from the original argument (a) than is the square of a less satisfactory value (d), whose square (e) is exactly (a). After staring at this for a while,
I finally understood what was happening.
I saw that what was obvious was nonetheless wrong,
and I apologized (mentally) to Bryant Tuckerman
and to IBM’s 4381 square root instruction.
I then revised my model to include the Tuckerman
rounding test at the end.
With this done, and with my testing changed to
incorporate a comparison with a high precision,
best possible square root result,
I now had the happy situation that my model
gave the same results as did the 4381.
APL model The APL model which I used to produce the assembly language version may be of some interest. The thing I’d like to call the reader’s attention to in the model is the way in which I am able to duplicate exactly the facilities of the underlying machine architecture (in this case the IBM S/370), typically by providing an APL function, appropriately named, for a single S/370 instruction. Thus, the APL function ⍺ srl ⍵ gives as result the effect of shifting the integer argument ⍵ right by ⍺ bits. For example, 3 srl 65 ←→ 8 . This has real importance for the system programmer who desires to be able to test at the APL level and thus have high confidence in the design before ever writing a line of assembly code. The prototype for this technique is, of course, the classical description of S/360 by Falkoff, Sussenguth, and Iverson [FAL]. I don’t give the details of these functions in this paper, since many of them tend to be highly complex in their machine specificity, and not very pretty to look at. I’d be happy to communicate their details to anyone interested, by private communication. A linebyline explication of the APL model is given immediately below. Customarily, in this kind of work, I set the index origin and the comparison tolerance to 0 , in order to reflect the conditions that pertain when the assembled code is executing on the machine. z←sqrt ⍵;⎕io;⎕ct;e;s;m;c;i;j;y0;y1;y2;y3 ⎕io←⎕ct←0 Two exceptional cases have to be handled. After setting the result variable z to the argument, anticipating that it might be zero, I test the sign of the argument. If it is in fact zero, I exit immediately, with zero as the result. If it is negative, the actual code will signal that the complex square root routine is needed. In the model, I simply force a domain error, by an illegal branch to ¯0.5 . z←⍵ ⍎(1≠×⍵)/'→0.5××⍵' The first step in the argument transformation process is to obtain the 16’s exponent of the number. Since at this point I know the argument is positive, I merely extract the first two hexadecimal digits of the number, in character form, convert these to numeric form, and subtract the bias of 64 which is applied to the exponents of S/370 numbers. This gives me e . The utility function xfd applies to any array of numbers and gives as result a character array of one higher dimension, at the end, of length 16 , representing the hexadecimal form of the number. For example, the result of xfd 2304 is, in hexadecimal, 4330000000000000 . The utility function htd applies to a character vector of length 8 or less, and which must only include characters in the set '01234567B9abcdef' . It gives as result the equivalent integer. In effect, it translates such vectors into S/370 4byte integer values. e←¯64+htd 2↑xfd ⍵ The exponent s is developed, which will be used immediatety to transform the argument into the desired range, and will also be used at the very end to produce the final result. Instead of what is here, I could have written s←1 srl e , but the formulation given is completely equivalent. s←⌊e÷2 Dividing the argument by the appropriate power of 16 produces a result in the range from 0.0625 to 16 . In fact, since I know that the exponent of the transformed argument will be zero if the argument’s exponent is even, and one if it is odd, in the actual assembly language version I produce the exponent by masking bits zero through six to zeros, and then oring in a hexadecimal 40 , to produce the excess64 bias. This takes much less time than would a floating point divide. m←⍵÷16*2×s In the assembly code, the linear term/constant term polynomial coefficients are precomputed. In the model I explicitly form c to show how it is derived. Since these are kept as short (fourbyte) floating point words in the assembly code, the model mimics this by taking only the first eight hexadecimal characters of the result of xfd , and using these as the argument to the function dfh , which converts hexadecimal character arguments to floating point numbers. For example, xfd .1 is 401999999999999a ; converting the first eight characters of this to a number with dfh gives 0.09999996423721314 , which has the hexadecimal representation 4019999900000000 , as desired. c←dfh 8 8 ↑xfd⌽0.14753807×(2*0.5)*⍳8 The key to the efficiency of the algorithm is the provision of eight different linear polynomials, depending on the range in which the transformed argument lies. The eight ranges can be discriminated by bits seven through ten of the number. This is the last bit of the exponent and the first three bits of the first hexadecimal digit. These four bits, considered as a number, can take on the values zero through fifteen. The expression htd 8↑xfd ⍵ produces, as before, a S/370 fourbyte integer. The utility functions sll and srl model the S/370 instructions having these names for their operation codes. They stand for shift left logical and shift right logical. They are used to shift all the bits in a S/370 general register left and right, respectively. The functions are dyadic, and take as left argument the amount of shift desired. Shifting a number seven bits to the left shifts out, and loses, bits zero through six. Shifting this twentyeight bits to the right shifts out the rightmost twentyone bits of the original number, and right justifies the original bits seven through ten, so that i has one of the values zero through fifteen, as desired. i←28 srl 7 sll htd 8↑xfd ⍵ The numbers zero through fifteen are mapped into the numbers zero through seven, according to the following table:
In the assembly code, the result values are four times these, since they are to be used as indices to fourbyte values. j is the result of selecting the appropriate value, using i as an index. j←((8⍴ 1 1 2 4)/⍳8)[i] In turn, j is used as an index to select the appropriate linear and constant terms from c , the vector of coefficients: directly to obtain the linear term, and in sevenscomplement form to obtain the constant term. The linear term c[j] is multiplied by the transformed argument m , using the me function, which mimics the behavior of the short floating point multiply instruction, me , of the S/370 instruction set. This product is added to the constant coefficient c[7j] , using the ae function, which mimics the behavior of the short floating point add instruction, ae , of the S/370 instruction set. The result is the initial approximation y0 , which has mre of 0.007453 , and thus guarantees at least 1.77 accurate hexadecimal digits. y0←c[7j] ae c[j] me m The next approximation, y1 , is formed, using Heron’s formula, and using the fourbyte versions of the divide (de) , add (ae) , and halve (he) instructions, since the mre of this approximation is such that only 3.79 hexadecimat digits are guaranteed accurate. y1←he y0 ae m de y0 The next approximation uses a Heron step with the full eightbyte form of the instructions (which correspond to the APL primitives exactly). At this point we are guaranteed a minimum of 7.82 hexadecimal digits. y2←0.5×y1+m÷y1 The last Heron step is slightly transformed in order to minimize rounding errors, at the expense of one extra operation. This last approximation is theoretically guaranteed to have 15.9 accurate hexadecimal digits. y3←y2+0.5×(m÷y2)y2 The result at this point may be off by an ulp, even though theory guarantees us 15.9 accurate hexadecimal digits, because S/370 floating point words contain only fourteen hexadecimal digits, and we may have suffered from machine truncation despite our efforts to avoid it. To overcome this, and make our routine perfect, we apply the Tuckerman rounding test. I found that it sufficed to apply only one part of this test. This saves a multiply instruction and a comparison. The nv function used below has as its result the number one ulp higher than its argument. In the assembly code the addition of an ulp is done by forming a number having the same exponent as y3 but with the fraction 00000000000001 , and adding this to y3 . ⍎(m>y3×nv y3)/'y3←nv y3' Finally, we multiply the last approximation, guaranteed to have the best possible fraction part, by the appropriate power of sixteen, yielding the final result. z←y3×16*s Results I translated my APL program into assembly language, assembled it, and was now able to time it and measure its size. The assembly language version took 168 bytes, as opposed to the 116 bytes of its (less accurate and slower) predecessor. There was some saving in space because of the reduced number of Heron steps (three instead of four) but the inclusion of eight polynomials instead of one, and the Tuckerman rounding test, increased its overall size. However, this 168 bytes is far less than the 604 bytes taken by the software version of the IBM square root routine made available by IBM to those who want the new accuracy of this routine but who don’t have 4381 computers. The time measurements showed that the new routine was about 7% faster than the old one, and only about 30% slower than the time taken by the hardware instruction. This new square root code has been running on the IPSA internal system for more than a year, and no errors have been reported in it. The entire new mathematical routine package,
containing 4086 statements, is 3716 bytes in size,
124 bytes smaller than the package it replaced.
To date, two errors have been reported in it.
In both cases the same errors
were found in the APL models.
They were easily found and fixed.
Acknowledgments are due
to Arlene Azzarello, Bob Bernecky, and Doug Forkes,
of I.P. Sharp Associates,
for their careful readings of the text.
Note [1] There is no reason why the square root function
couldn’t be part of APL.
In fact, when I designed the APL printer trains
for the IBM 1403 printer,
I included the radical symbol √
on the train,
in the expectation that one day
this function would be added to APL.
It would be a scalar function.
In its monadic use √⍵ would give
the square root of ⍵ .
In its dyadic use ⍺√⍵
would give the ⍺th root of ⍵ .
For example, 3√⍵
would give the cube root of ⍵ .
Anyone using APL informally
(for example, in writing a mathematics text)
could use this symbol.
References
First appeared in the APL86 Conference Proceedings, APL QuoteQuad, Volume 16, Number 4, 19860710.
