>>  <<  Usr  Pri  JfC  LJ  Phr  Dic  Rel  Voc  !:  Help  User

Unicode - UTF-8, UTF-16 and UTF-32

UTF-8, UTF-16 and UTF-32 are supported. UTF-8 is multibyte encoding for 8 bit data that includes all unicode code points and maps 7 bit ASCII unchanged. J continues to use the literal data type of 8 bits. When literal data is displayed it is treated as UTF-8.

In addition to support for UTF-8 there is UTF-16 which is based on literal2 (16 bit literal data type). See u: for details. UTF-16 is a variable length encoding in which each unicode code point in the range from U+0000 to U+FFFF consists 1 literal2 while each unicode code point in the range from U+10000 to U+10FFFF consists 2 literal2s. UTF-16 is used in interfacing with some external libraries such as Windows API. Also when working with unicode data of code points below U+10000, UTF-16 is more convenient for manipulating unicode data when multibyte UTF-8 is awkward.

UTF-32 based on literal4 (4-byte literal data type) is also support. See u: for details. Unlike UTF-8 and UTF-16 encoding, UTF-32 is a fixed length encoding in which each unicode code point is represented by one literal4 character. UTF-32 is the most convenient encoding for manipulating unicode

Standard library defines: ucp, uucp, ucpcount, and utf8. Script ~addons/convert/misc/unicode.ijs defines additional utilities.

There are good unicode and UTF-8 references on the web, for example: FAQ - UTF-8, UTF-16, UTF-32 & BOM and www.utf-8.com.

There are several related wiki pages, for example: UnicodeGettingStarted.

Operations on UTF-8 and UTF-16 data

Exception for the u: verb, J Engine does not interpret bytes are UTF-8 or literal2s are UTF-16 unicode code points. For example applying # on UTF-8 encoded byte literals gives the number of bytes but not unicode code points. Similarly applying # on UTF-16 encoded literal2 gives the number of literal2 but not unicode code points. Other verbs such as {. and dyad $ can produce invalid unicode sequences. Converting UTF-8 and UTF-16 to UTF-32 before processing can avoid those pitfalls.

boxed display

J805 has improved boxed display of UTF-8 with European or CJK (Chinese/Japanese/Korean) data.

Ascii takes 1 data byte and 1 display space. European takes 2 data bytes and 1 display space. CJK takes 3 data bytes and 2 display spaces (with correct font).

The mismatch in data bytes vs display space caused boxed display in previous releases to have misaligned vertical bars.

J805 does a better job aligning vertical bars as it takes into account this mismatch.

European is a simpler problem and is supported by most modern fixed pitch fonts.

CJK is more complicated and many fixed pitch fonts don't display them in 2 display spaces so boxed display won't align vertical bars.

unifont is a fixed pitch font that works with boxed CJK display. For Windows and Mac install the standard unifont ttf from: http://unifoundry.com/unifont.html. For linux, use apt-get (or similar) to install unifont from distro repos.

In J805 try:

<195 161{a.
<230 178 146{a.

Pleasing boxed display depends on valid UTF-8 sequences in the unicode ranges that are handled. For example, unicode chars that require 3 display spaces or are not supported in unifont will not have aligned vertical bars.

Windows console/powershell do not support UTF-8 (they use codepages) and will have problems with UTF-8.

updated - August 2016 - J805


>>  <<  Usr  Pri  JfC  LJ  Phr  Dic  Rel  Voc  !:  Help  User