[Oberon] Oberon with strings, Eberon
jwr at robrts.net
jwr at robrts.net
Thu Feb 25 07:04:47 CET 2016
Regarding Unicode, it is important to distinguish between the
character number and the encoding.
See especially http://www.unicode.org/standard/principles.html#Encoding_Forms
and http://unicode.org/standard/WhatIsUnicode.html
and https://en.wikipedia.org/wiki/Unicode
and http://www.unicode.org/standard/where/
"To keep character coding simple and efficient, the Unicode Standard
assigns each character a unique numeric value and name." The way that
number is encoded can differ. Three of the most common ways to encode
it are UTF-8, UTF-16, and UTF-32. UTF-8 and UTF-16 are both
variable-width encodings. UTF-32 is fixed width.
"UTF-8 is popular for HTML and similar protocols. ... It has the
advantages that the Unicode characters corresponding to the familiar
ASCII set have the same byte values as ASCII, and that Unicode
characters transformed into UTF-8 can be used with much existing
software without extensive software rewrites."
"[In UTF-16] all the heavily used characters fit into a single 16-bit
code unit, while all other characters are accessible via pairs of
16-bit code units."
"UTF-32 is useful where memory space is no concern, but fixed width,
single code unit access to characters is desired. Each Unicode
character is encoded in a single 32-bit code unit when using UTF-32."
"All three encoding forms need at most 4 bytes (or 32-bits) of data
for each character."
Chris Glur said:
"<single quote> which is *3 bytes*"
That would be the case in a UTF-8 encoding of a single-quote
character. UTF-8 could result in a 3-byte encoding, UTF-16 and UTF-32
could not. According to http://www.unicode.org/charts/PDF/U2000.pdf
the character number could be 2018, 2019, 201A, or 201B. Or 0027 or
02BC or 275C. According to http://www.unicode.org/charts/PDF/U0000.pdf
an ASCII apostrophe is 0027. That also says that 2019 is preferred for
apostrophe. So PART of what we're dealing with may be that the
apostrophe/single-quote is being encoded as one of the "single-quote"
characters, probably the preferred 2019 character, rather than as an
0027 "apostrophe".
I have never had to write code to deal with Unicode, but co-workers
have. A brute-force technique they used when memory and time allowed
was to detect and convert whatever input encoding was provided to
UTF-32, then deal with the text as fixed-width 32-bit "characters",
then convert back to whatever encoding was desired. That approach
could work reasonably well when implementing a character-by-character
parser.
More information about the Oberon
mailing list