Inner Product — An Old/New Problem
Roger Hui
Abstract
Exploitation of an old algorithm for inner product
leads to a greater than 10fold improvement
in speed to +.× in Dyalog v12.1.
In the process dfns and dops were found to be
excellent tools of thought for thinking about
the problem.
We also demonstrate the product of some large sparse
matrices in J.
1. RowbyColumn
The conventional method for inner product
produces its result one element at a time,
operating on one row of the left argument
against one column of the right argument.
That is, if z ← x f.g y , then
z[i;j] = f/ x[i;] g y[;j]
Question: why is inner product defined
in this computationally inconvenient way?
A is a linear transformation
from n (dimensional) space
to m space, where m n←⍴A ,
and B is a linear
transformation from p space
to n space ( n p←⍴B ).
The transformations are effected by A+.×y
and B+.×y1 .
For example:
A←¯3+?3 3⍴10
B←¯3+?3 3⍴10
y←¯3+?3⍴10
B+.×y
17 12 ¯12
A+.×(B+.×y)
2 58 15
The last transformation is the result of transforming y
first by B and then by A .
The inner product A+.×B computes
the matrix for this transformation,
A composed with B .
(A+.×B) +.× y
2 58 15
A+.×B
¯9 ¯19 ¯2
¯6 16 8
9 ¯21 21
So the answer to the question is, inner product
is defined this way to effect the composition
of two linear transformations.
2. RowataTime
It turns out that there is an algorithm more efficient
than “rowbycolumn”, namely “rowatatime”.
For z ← x f.g y :
z[i;] = f⌿ x[i;] g[0] y
The phrase x[i;] g[0] y applies g
to an element of x[i;] against the corresponding
row of y. For example:
10 100 1000 ×[0] 3 4⍴⍳12
0 10 20 30
400 500 600 700
8000 9000 10000 11000
In practice, the cumulation f⌿ is
composed and interleaved with the dyadic
function g[0] so that no additional
temporary space is needed.
The result for rowatatime is mathematically
and numerically identical to that for the conventional
rowbycolumn method.
The above description is stated in terms of
matrices, but the arguments are applicable
to arguments of any rank.
3. Advantages
Rowatatime has some efficiency advantages over rowbycolumn:
APL arrays are stored in row major order,
so that in a column y[;j] successive
elements are separated from each other
and y[0;j] can be quite far from
y[n1;j]. On modern CPUs with
onchip caches, on large matrices,
it is possible for rowbycolumn to miss the
cache on every column, that is, on every element
of the inner product result.
In contrast, rowatatime operates on adjacent
memory locations and is “cache friendly”.
On 600by600 matrices, rowatatime
is faster than rowbycolumn by a factor of 10.
I believe that there is an additional advantage
if the right argument is boolean (bits), as indexing
a column y[;j] of boolean y is not trivial.
Rowattime makes it practical to exploit
zeros in the left argument.
(“Zero” in the general sense of
the zero in a field with
operations f and g .) In
rowatatime,
in x[i;] g[0] y , if x[i;j] is zero,
there is no need to do x[i;j] g y[j;] . In
rowbycolumn,
in x[i;] g y[;j] , if x[i;k] is zero,
only x[i;k] g y[k;j] can be elided.
What it means is that you can not afford to spend
the time to check x[i;k] for zero,
because it would slow down the general case.
Checking x[i;j] for zero in rowatatime
also slows down the general case, but the
cost is amortized over the entire row y[j;] .
On 600by600 matrices with a density of 0.5 (i.e. half zeros),
rowatatime with zero exploitation is faster than rowbycolumn
by a factor of 24.
Item j of the phrase x[i;] g[0] y
is x[i;j] g y[j;] . If the
left argument is boolean, this can have only two possible
results: 0 g y[j;] or 1 g y[j;] .
Implementation of this idea in J
provides a factor of 5 improvement for boolean inner products.
4. Models
The current rowbycolumn implementation
in Dyalog v12.0 starts by transposing the right
argument to move the leading axis to the back,
followed by an outer product of the rows.
The process can be modelled as follows:
dotc ←{↑ (↓⍺) ∘.(⍺⍺{⍺⍺/⍺ ⍵⍵ ⍵}⍵⍵) ↓(¯1⌽⍳⍴⍴⍵)⍉⍵}
x←¯500+?13 19 11⍴1000
y←¯500+?11 23⍴1000
(x+.×y) ≡ x +dotc× y
1
(x.≥y) ≡ x dotc≥ y
1
The expression can be simplified and shortened
with judicious use of components:
dotc ← {↑ (↓⍺) ∘.(⍺⍺{⍺⍺/⍺ ⍵⍵ ⍵}⍵⍵) ↓(¯1⌽⍳⍴⍴⍵)⍉⍵}
dotc1 ← {↑ (↓⍺) ∘.(⍺⍺{⍺⍺/⍺ ⍵⍵ ⍵}⍵⍵) ↓flip ⍵}
dotc2 ← {↑ (↓⍺) ∘.(⍺⍺/at ⍵⍵) ↓flip ⍵}
dotc3 ← {↑ ⍺ ∘.(⍺⍺/at ⍵⍵)compose↓ flip ⍵}
dotc4 ← {⍺ ∘.(⍺⍺/at ⍵⍵)under↓ flip ⍵} ⍝ nonce error
dotc5 ← {⍺ ∘.(⍺⍺/at ⍵⍵)under↓∘flip ⍵} ⍝ nonce error
dotc6 ← {∘.(⍺⍺/at ⍵⍵)under↓∘flip} ⍝ nonce error
flip ← {(¯1⌽⍳⍴⍴⍵)⍉⍵}
at ← {⍺⍺(⍺ ⍵⍵ ⍵)}
compose← {(⍵⍵ ⍺)⍺⍺(⍵⍵ ⍵)}
under ← {(⍵⍵⍣¯1)(⍵⍵ ⍺)⍺⍺(⍵⍵ ⍵)}
The transformation from dotc3 to dotc4
and from dotc5 to dotc6 do not work, but should.
The later definitions are shorter than
the earlier ones if the metric is word count;
alternatively, they are shorter if the compositions
were denoted by symbols such as ⍥ and ⍤
instead of at and compose .
Rowatatime can be modelled as follows:
dotr ← {↑ (↓⍺) ⍺⍺{⍺⍺⌿⍺ ⍵⍵[0] ⍵}⍵⍵¨ ⊂⍵}
dotr1 ← {↑ (↓⍺) ⍺⍺⌿at(⍵⍵[0])¨ ⊂⍵}
In a C implementation of the algorithm,
the functions ↑ ↓ ⊂ and the operator ¨
would not be performed explicitly.
Rather, they would be embodied as for loops
and other control structures to mediate access to the
arguments x and y .
In J, the dotc (rowbycolumn)
and dotr (rowatatime) operators
can be modelled as follows:
flip =: 0&:
dotcj =: 2 : 'u/@:v"1/ flip'
lr =: 1 : '1{u b. 0' NB. left rank
dotrj =: 2 : 'u/@(v"(1+(v lr),_))'
In the important special case of rank0 (scalar) functions,
where the left rank is 0, dotrj simplifies to:
dotrj0 =: 2 : 'u/@(v"1 _)'
dotr1 ← {↑ (↓⍺) ⍺⍺⌿at(⍵⍵[0])¨ ⊂⍵}
dotr1 is repeated from above.
The expression (↓⍺) f¨ ⊂⍵ means that rows
of ⍺ are applied against ⍵ in toto,
and that is what rank 1 _ does in J.
The leading mix ↑ is not needed in J because,
not having done the explicit split ↓ ,
there is no need to undo it.
5. Inner Products on Sparse Arrays
The sparse datatypes in J represent an array
by the indices and values of non“zero” entries.
For example:
] x=: (?.3 4$10) * ?.3 4$2
0 5 9 0
0 9 0 7
0 0 0 0
] y=: (?.4 5$10) * ?.4 5$2
0 5 9 0 0
9 0 7 0 0
0 0 3 0 1
0 0 0 0 0
] xs=: $. x NB. convert to sparse
0 1  5
0 2  9
1 1  9
1 3  7
] ys=: $. y
0 1  5
0 2  9
1 0  9
1 2  7
2 2  3
2 4  1
x : xs NB. match
1
y : ys
1
Functions apply to sparse arrays, including inner product.
x +/ .* y
45 0 62 0 9
81 0 63 0 0
0 0 0 0 0
xs +/ .* ys
0 0  45
0 2  62
0 4  9
1 0  81
1 2  63
(x +/ .* y) : xs +/ .* ys
1
Sparse arrays make possible large inner products
which are impractical in the dense domain.
In J7.01 beta:
load '\ijs\sparsemm.ijs'
NB. d sa s  random sparse array with density d and shape s
x=: 0.0001 sa 2$1e5
y=: 0.0001 sa 2$1e5
$ x NB. shape of x
100000 100000
*/ $ x NB. # of elements in x
1e10
z=: x +/ .*y NB. inner product of x and y
$ z NB. shape of z
100000 100000
+/ +./"1 x~:0 NB. # of nonzero rows in x
99995
+/ +./ y~:0 NB. # of nonzero columns in y
99995
+/@, z~:0 NB. # of nonzero entries in z
9989822
The preceding expressions show that the conventional
rowbycolumn approach would be problematic.
There are 99995 nonzero rows in x
and 99995 nonzero columns in y .
Consequently, there are roughly 1e10 rowbycolumn combinations
(of which only 1e7 end up being nonzero).
No matter how fast each combination is,
there are still a lot of them.
Sparse inner product in J uses a rowatatime algorithm.
The 1e5 by 1e5 inner product above took 3.6 seconds.
Presented at the BAPLA Conference in Wokefield Park, Reading, 89 June 2009.
created:  20090512 00:00 
updated:  20140616 14:20 
