
Chapter 18: Sets, Classes and RelationsIn this chapter we look at more of the builtin functions of J. The connecting theme is, somewhat loosely, working with set, classes and relations. Suppose that, for some list, for the purpose at hand, the order of the items is irrelevant and the presence of duplicate items is irrelevant. Then we can regard the list as (representing) a finite set. In the abstract, the set 3 1 2 1 is considered to be the same set as 1 2 3. The word "class" we will use in the sense in which, for example, each integer in a list belongs either to the odd class or to the even class. By "relation" is meant a table of two or more columns, expressing a relationship between a value in one column and the corresponding value in another. A relation with two columns, for example, is a set of pairs. 18.1 Sets18.1.1 MembershipThere is a builtin verb e. (lowercase e dot, called "Member"). The expresssion x e. y tests whether x matches any item of y, that is, whether x is a member of the list y. For example:
Evidently the order of items in y is irrelevant and so is the presence of duplicates in y.
We can test whether a table contains a particular row:
18.1.2 LessThere is a builtin verb . (minus dot, called "Less"). The expression x . y produces a list of the items of x except those which are members of y.
Evidently the order of items in y is irrelevant and so is the presence of duplicates in y. 18.1.3 NubThere is a builtin verb ~. (tilde dot, called "Nub"). The expression ~. y produces a list of the items of y without duplicates.
We can apply nub to the rows of a table:
18.1.4 Nub SieveThe verb "nub sieve" (~:) gives a boolean vector which is true only at the nub.
18.1.5 Functions for SetsThe customary functions on sets, such as setunion, setintersection or setequality, are easily defined using the builtin functions available. For example two sets are equal if all members of one are members of the other, and vice versa. seteq =: *./ @: (e. , e.~)
18.2 The Table AdverbRecall that the adverb / generates a verb; for example +/ is a verb which sums lists. More precisely, it is the monadic case of +/ which sums lists. The dyadic case of +/ generates a table:
The general scheme is that if we have z =: x f/ y then z is a table such that the value at row i column j is given by applying f dyadically to the pair of arguments i{x and j{y. That is, z contains all possible pairings of an item of x with an item of y. Here is another example:
The result shows, in the first row, the value of 'a' = 'face', in the second row the value of 'b' ='face' and so on. 18.3 Classes18.3.1 SelfClassifyConsider the problem of finding the counts of letters occurring in a string (the frequencydistribution of letters). Here is one approach. We form a table testing each letter for equality with the nub.
The expression ((nub y) = / y) can be abbreviated as (= y). The monadic case of the builtin verb = is called "Selfclassify").
If we sum each row of = y we obtain the counts, in the order of the letters in the nub.
The counts can be paired with the letters of the nub:
18.3.2 Classification SchemesGardeners classify soiltypes as acid, neutral or alkaline, depending on the pH value. Suppose that a pH less than 6 is classed as acid, 6 to 7 is neutral, and more than 7 as alkaline. Here now is a verb to classify a pH value, returning A for acid, N for neutral and L for alkaline (or limy). classify =: ({ & 'ANL') @: ((>: & 6) + (> & 7))
The classify function we can regard as defining a classification scheme. The letters ANL, which are in effect names of classes, are called the keys of the scheme. 18.3.3 The Key AdverbGiven some data (a list, say), we can classify each item to produce a list of corresponding keys.
We can select and group together all the data in, say, class A (all the data with key A):
Now suppose we wish to count the items in each class. That is, we aim to apply the monadic verb # separately to each group of items all of the same key. To do this we can use the builtin adverb /. (slash dot, called "Key").
For another example, instead of counting the members we could exhibit the members, by applying the box verb <.
The verb we apply can discover for itself the class of each separate argument, by classifying the first member: Here the verb u produces a boxed list: the key and count: u =: (classify @: {.) ; #
The general scheme for the "Key" adverb is as follows. In the expression x u /. y, we take y to be a list, and x is a list of keys of corresponding items of y according to some classification scheme, and u is the verb to be applied separately to each class. The scheme is: x u /. y means (= x) (u @ #) y To illustrate: y =: 4 5 6 7 8 x =: classify y u =: <
We see that each row of =x selects items from y, and u is applied to this selection. 18.3.4 LetterCounts RevisitedRecall the example of finding the counts of letters in a string.
Here is a variation. We note that we have in effect a classification scheme where we have as many different classes as different letters: each letter is (the key of) its own class. Thus we can write an expression of the form y u /. y. The applied verb u will see, each time, a list of letters, all the same. It counts them, with #, and takes the first, with {., to be a label for the class. u =: {. ; #
18.4 RelationsSuppose there are a number of publications, such as:
and we aim to catalog such publications. A suitable data structure for such a catalog might be a table relating authors to titles and another table relating titles to subjects. For example:
Such tables we may call "relations". The order of the rows is not significant. Here,for the sake of simplicity, we will stick to relations with two columns. Now we choose a representation for our relations. For a first approach, we choose tables of boxed strings. The authorstitles relation is: ] AT =: (". ;. _2) 0 : 0 'Smith' ; 'Pigs' 'Brown' ; 'Pets' 'Smith' ; 'Dogs' 'James' ; 'Dogs' ) +++ SmithPigs +++ BrownPets +++ SmithDogs +++ JamesDogs +++ and the titlessubjects relation is: ] TS =: (". ;. _2) 0 : 0 'Pigs' ; 'pigs' 'Pets' ; 'cats' 'Pets' ; 'dogs' 'Dogs' ; 'dogs' ) +++ Pigspigs +++ Petscats +++ Petsdogs +++ Dogsdogs +++ 18.4.1 Join of RelationsFrom the authorstitles relation AT and the titlessubjects relation TS we can compute an authorssubjects relation showing which author has written a title on which subject. We say that AT and TS are to be joined with respect to titles, and we would expect the join to look like this: +++ Smithpigs +++ Browncats +++ Browndogs +++ Smithdogs +++ Jamesdogs +++ The plan for this section is to look at a function for computing joins, then at an improved version, and then at the advantage of representing relations as tables of symbols rather than boxed strings. Finally we look at some performance comparisons. A method is as follows. We consider all possible pairs consisting of a row at from table AT and a row ts from table TS. Each pair at,ts is of the form: author; title; title; subject If title matches title, that is, item 1 matches item 2, then we extract author and subject, that is, items 0 and 3. Verbs for testing and extracting from at,ts pairs can be written as: test =: 1&{ = 2&{ extr =: 0 3 & { and these verbs can be plugged into a suitable conjunction to do the pairing. In writing this conjunction, we aim to avoid requiring the whole set of possible pairs to be present at the same time, since this set may be large. We also aim to avoid any duplicates in the result. Here is a first attempt. PAIR =: 2 : 0 : z =. 0 0 $ '' for_at. x do. for_ts. y do. if. u at,ts do. z =. z, v at,ts end. end. end. ~. z ) The join verb can now be written as: join =: test PAIR extr and we see:
The join verb as defined above is slow, because the test and extr verbs are applied to a single x,y pair at a time  they are scalar computations. Performance will be better if we can give these verbs as much data as possible to work on at one time. (This is a universal rule in J). Vector or array arguments are better. Here is a revised vectororiented version of PAIR and join, which still avoids building the entire set of pairs. VPAIR =: 2 : 0 : z =. 0 0 $ '' for_at. x do. z =. z , : v (#~"1 u) : at , "1 y end. ~. z ) vjoin =: test VPAIR extr giving the same result as before:
Representing relations as tables of boxed strings, as above, is less than efficient. For a repeated value, the entire string is repeated. Values are compared by comparing entire strings. Now we look at another possibility. Rather than boxed strings, a relation can be represented by a table of symbols. 18.4.2 What are Symbols?Symbols are for efficient computation with string data. Symbols are a distinct datatype, in the same way that characters, boxes and numbers are distinct datatypes. A symbol is a scalar which identifies, or refers to, a string. A symbol can be created by applying the builtin verb s: (lowercase s colon) to a boxed string. a =: s: <'hello' Now the variable a has a value of type symbol. We inspect this value in the usual way: a `hello and see that the value is displayed as the original string preceded by a leftquote. Even though a looks like a string when displayed, it is a scalar.
The original string is stored in a datastructure, maintained automatically by the J system, called the symboltable. Strings are not duplicated within the symboltable. Hence if another symbol b is created from the same string as a, then b is equal to a.
Notice that the comparison is simple scalar equality, with no need to compare the original strings. Our relations above can be converted to arrays of symbols, and joined as before.
Symbols are lexicographically ordered to reflect the ordering of the original strings. Hence tables of symbols can be sorted:
18.4.3 Measurements ComparedHere is a utility verb giving time in seconds to evaluate an expression, averaged over say 4 executions. time =: (8j5 & ":) @: (4 & (6!:2)) The examples of relations above are too small for meaningful performance measurements, so we make larger relations by replicating each say 100 times. AT =: 100 $ AT TS =: 100 $ TS sAT =: 100 $ sAT sTS =: 100 $ sTS There are 4 cases to compare: t1 =: time 'AT join TS' NB. scalar method, boxed strings t2 =: time 'sAT join sTS' NB. scalar method, symbols t3 =: time 'AT vjoin TS' NB. vector method, boxed strings t4 =: time 'sAT vjoin sTS' NB. vector method, symbols and we see: 3 3 $ ' '; 'strings'; 'symbols';'scalar';t1;t2; 'vector';t3;t4 ++++  strings symbols  ++++ scalar 2.70137 0.06671 ++++ vector 0.03806 0.00287 ++++ 18.4.4 Saving and Restoring the Symbol TableSuppose that data is an array of symbols. ] data =: s: 2 2 $ 'hello'; 'blah';'blah';'goodbye' `hello `blah `blah `goodbye For a symbol in data its original string ('hello' for example) is stored only in the symbol table, not in data itself. The original string is needed to display the value of the symbol. Suppose that we write data to a file, aiming to read it back in a new session. At the beginning of a new session, the symbol table is empty. Thus we must save the symbol table from the earlier session, and reinstate it at the beginning of the new session. First, here are two utility functions to save a value to a file and retrieve it. (See Chapter 27 and Chapter 28 .) save =: 4 : '(3!:1 x.) 1!:2 < y.' retr =: 3 : '3!:2 (1!:1 < y.)' Save the data to a file named, say, data.xyz data save 'data.xyz' The symbol table is not itself a variable, but the expression 0 s: 10 gives a value for it. We save this value to a file named, say, symtab.xyz (0 s: 10) save 'symtab.xyz' Start a new J session. The symbol table is initially empty, so begin by reinstating it from the file saved in the earlier session: 10 s: (retr 'symtab.xyz') 1 Now, with the correct symbol table in place, we can retrieve the array of symbols data from its file: DATA =: retr 'data.xyz' and see that the symbols are correctly interpreted: DATA `hello `blah `blah `goodbye This is the end of Chapter 18 
The examples in this chapter
were executed using J version 601obeta.
This chapter last updated 9 Jul 2006 .
Copyright © Roger Stokes 2006.
This material may be freely reproduced,
provided that this copyright notice is also reproduced.