[Oberon] Oberon with strings, Eberon

jwr at robrts.net jwr at robrts.net
Thu Feb 25 07:04:47 CET 2016


Regarding Unicode, it is important to distinguish between the  
character number and the encoding.

See especially http://www.unicode.org/standard/principles.html#Encoding_Forms
and http://unicode.org/standard/WhatIsUnicode.html
and https://en.wikipedia.org/wiki/Unicode
and http://www.unicode.org/standard/where/

"To keep character coding simple and efficient, the Unicode Standard  
assigns each character a unique numeric value and name."  The way that  
number is encoded can differ.  Three of the most common ways to encode  
it are UTF-8, UTF-16, and UTF-32.  UTF-8 and UTF-16 are both  
variable-width encodings. UTF-32 is fixed width.

"UTF-8 is popular for HTML and similar protocols. ... It has the  
advantages that the Unicode characters corresponding to the familiar  
ASCII set have the same byte values as ASCII, and that Unicode  
characters transformed into UTF-8 can be used with much existing  
software without extensive software rewrites."
"[In UTF-16] all the heavily used characters fit into a single 16-bit  
code unit, while all other characters are accessible via pairs of  
16-bit code units."
"UTF-32 is useful where memory space is no concern, but fixed width,  
single code unit access to characters is desired. Each Unicode  
character is  encoded in a single 32-bit code unit when using UTF-32."

"All three encoding forms need at most 4 bytes (or 32-bits) of data  
for each character."

Chris Glur said:
    "<single quote> which is *3 bytes*"
That would be the case in a UTF-8 encoding of a single-quote  
character. UTF-8 could result in a 3-byte encoding, UTF-16 and UTF-32  
could not. According to http://www.unicode.org/charts/PDF/U2000.pdf  
the character number could be 2018, 2019, 201A, or 201B. Or 0027 or  
02BC or 275C. According to http://www.unicode.org/charts/PDF/U0000.pdf  
an ASCII apostrophe is 0027. That also says that 2019 is preferred for  
apostrophe. So PART of what we're dealing with may be that the  
apostrophe/single-quote is being encoded as one of the "single-quote"  
characters, probably the preferred 2019 character, rather than as an  
0027 "apostrophe".

I have never had to write code to deal with Unicode, but co-workers  
have. A brute-force technique they used when memory and time allowed  
was to detect and convert whatever input encoding was provided to  
UTF-32, then deal with the text as fixed-width 32-bit "characters",  
then convert back to whatever encoding was desired. That approach  
could work reasonably well when implementing a character-by-character  
parser.



More information about the Oberon mailing list