Getting Started with Unicode in J

This is a very quick page of notes, mainly as a reference for answering questions. Maybe it will help someone with the general task of looking up, choosing and using an arbitrary Unicode glyph, which I don't think is easy for a complete beginner to do with the present documentation.

Turn your IQ philco down to approx. 60 and let's go...

Page under reconstruction.

It needs to avoid duplicating material in: Guides/Unicode.
For example, wide as defined here is functionally equivalent to uucp defined in stdlib.ijs, which is the recommended word to use. Other recommended words, always available in locale 'z', are: ucp (Unicode code point, inverse of: utf8) and ucpcount (count code points).

Here's an arbitrary unicode character:

Yes, it's the APL Domino. But it could be anything from the Unicode collection.

If you've got the font: APL385 Unicode installed, you'll see it, otherwise you may not. Refer to the last section of the APL to J Phrasebook and troubleshoot from there.

For even greater detail, see: Typesetting/APL Fonts

Copy-paste Domino into the J session (j602) and embed it in the expression:

   z=: 'abc⌹e'
   $z
7
NB. ...not 5, but 7!!
   z i. 'ce'
2 6
   z i. '⌹'
3 4 5

This shows you that Domino is encoded in the J session in utf-8. This is a standard for embedding a general unicode character as several bytes in a string of 1-byte chars.

Inside char vector z, Domino takes up not 1 but 3 consecutive positions: 3 4 5, so the chars either side of it occupy positions 2 and 6.

If you need to tabulate or index unicode characters in an orderly array, you can convert the whole string z to "wide-chars" (wchars):

   ]w=: 7 u: z
abc⌹e
   $w
5
   w i. 'ce'
2 4
   ]Domino=: 3{w
⌹
   $$Domino
0
   w i. Domino
3

You now have an orderly vector w of 5 wchars, which behave themselves under $ and i..

Choosing a sister character of Domino

Now consider this task: you've pasted a given unicode char from some given unicode-compliant software or document. You don't know that it's APL but you like the set and you want to use another character from the same set. You've heard of another character, called Quote Quad, looking something like this: .

You can look up a character code (called a [unicode] code point) on the unicode.org website and download tables of characters in PDF form.

A good place to start if you haven't the foggiest idea where to find your character is here: http://www.unicode.org/standard/where/

The main code charts are here: http://www.unicode.org/charts/index.html

The page is titled: Unicode 6.0 Character Code Charts.

If you know it's an APL char you want, then simply search the page for "APL". (You'll find it here: http://www.unicode.org/charts/PDF/U2300.pdf)

Perhaps though you don't know it is an APL char. Then you must find its code point and look it up in the most generalized way.

Finding the code point of a pasted character

   3 u: 7 u: '⌹'
9017
   require'convert'
   hfd 9017   NB. hex from dec: 9017 to look up in unicode.org
2339
   NB. Let's just confirm that hex numeral is correct...
   u: 16b2339
⌹

Looking up a character by its code point

At the top of the Code Charts page, http://www.unicode.org/charts/index.html there's a search box: Look up by character code:

You need to type a hex numeral in that box (...the code point). Viz the one you've just computed: 2339.

This reports to you:

Search Results for U+2339

    The most current code chart containing U+2339 is:

        http://www.unicode.org/charts/PDF/U2300.pdf (0.3 MB)

...and the link allows you to download the relevant table (U2300.pdf)

From this document you can look up Quote Quad, say, and find that its code point is 235e.

   u: 16b235e
⍞

A script to play with

NB. unicode snooper, by Ian Clark, Nov 2010.

BASE=: 4$$ HX=: '0123456789ABCDEF'
h4=:   HX {~ BASE #: ]  NB. 4-digit hex numeral
wide=: 2 u: 7 u: ]      NB. force chars: y into wchars
copt=: 3 u: 7 u: ]      NB. code-point (decimal)
PY=:   copt '?'         NB. init: latest code-point

cu=: monad define       NB. see [unicode] char: y
 if. 0=$,y do. y=. u: PY+1 end. NB. -->next code-pt
 HY=: ,h4 PY=: copt Y=: y
 smoutput Y,' U+',HY,' ',": PY
)

nx=: monad define       NB. see next [y] char[s]
 z=.|{.y,1
 cu^:z ''
)

px=: monad define       NB. see previous [y] char[s]
 PY=: PY-z=.|{.y,1
 cu^:z ''
 empty PY=: PY-z
)

0 : 0           NB. try these using Ctrl+R ...
cu '⌹'
cu wide '⌹'
cu ''           NB. next char
nx ''           NB. next char
px ''           NB. prev char
nx 16           NB. next 16 chars
px 16           NB. prev 16 chars
)

The above script displays the code-point of a given unicode character (copy/pasted from some string, eg in the J wiki) in both hex and decimal, the hex form being suitable for looking-up at unicode.org.

Enter:

and so on.


-- IanClark 2010-11-27 19:39:49

CategoryWorkInProgress

Guides/UnicodeGettingStarted (last edited 2011-11-09 16:59:43 by IanClark)