[Oberon] Files.Write and 2-byte CHARs
Jörg
joerg.straube at iaeth.ch
Wed Jul 21 19:34:17 CEST 2021
Arthur
On additional remark: I recommended to represent the CHAR internally as 4 bytes; to be able to hold all possible Unicode chars.
If you want to cut memory usage in half, you could represent a CHAR internally as 2-byte and restrict yourself to Unicode BMP. BMP covers quite some languages, unfortunately no emojis 😊
In FPGA Oberon (with only 1 MB of RAM) you have to carefully design your font data structure, as Unicode fonts easily fill up your whole 1MB memory.
My recommendation: break the font up in 128 CHAR blocks and load them dynamically when needed. Most programmers only write programs (and use text) in ONE language only.
Latin OR Cyrillic OR Greek. It’s very seldom to have to render text in Latin AND Cyrillic AND Greek AND …
In Unicode those languages are often grouped in 128 characters (Greek: 03xx, Cyrillic: 04xx, Hebrew: 05xx, Arabic: 06xx and so on)
Emojis did not fit in the BMP anymore; they can be found in the SMP at 1F9xx
br
Jörg
From: Jörg <joerg.straube at iaeth.ch>
Sent: Wednesday, July 21, 2021 6:46 PM
To: 'ETH Oberon and related systems' <oberon at lists.inf.ethz.ch>
Subject: RE: [Oberon] Files.Write and 2-byte CHARs
Arthur
The implementation of CHAR is not defined in the Oberon report; it’s length is not defined, the charset is not defined and the coding is not defined neither.
It’s all up to the implementation.
The charset can be ASCII or EBCDIC or Unicode or any other. If it’s Unicode it can be coded as UTF-8, UCS-2, UTF-16 or others.
As implementor, you can decide to represent CHAR differently internally and externally.
In ProjectOberon the implementation is as follows:
* CHAR takes one byte
* CHAR charset is 7-bit ASCII
* internal = external
If you want to extend CHAR to more than 7bits, I recommend doing it as follows:
* Adapt the compiler to store CHAR internally as INTEGER (4 Byte). Don’t use UTF-8 internally as ARRAY OF CHAR will be no ARRAY anymore..
* Reading/Writing 4-byte CHAR: encode/decode it on the external medium as UTF-8
You have to do quite some adaptions to the system:
* Rewrite Fonts.Mod
* Decide how to store Filenames (32 CHAR or 32 Bytes) --> adapt FileDir.Mod
(recommendation: stick to 32 byte and treat them as UTF-8, consequence: filenames can get shorter than 32 CHARs if Unicode characters are used)
* You have to decide how to store module names --> adapt Modules.Mod
(recommendation: stick to 32 byte and treat them as UTF-8. Consequence: module names can get shorter than 32 CHARs if Unicode characters are used)
* You have to decide how to store string constants in the module --> adapt compiler
(recommendation: store them as 4byte CHAR to make string assignments s := “Hi 😎”; not too complex)
br
Jörg
From: Oberon <oberon-bounces at lists.inf.ethz.ch <mailto:oberon-bounces at lists.inf.ethz.ch> > On Behalf Of Arthur Yefimov
Sent: Wednesday, July 21, 2021 5:31 PM
To: oberon at lists.inf.ethz.ch <mailto:oberon at lists.inf.ethz.ch>
Subject: [Oberon] Files.Write and 2-byte CHARs
In some Oberon versions, type CHAR has a size of 2 or 4 bytes.
(i. e. BlackBox has a 2-byte CHAR, Active Oberon has 4-byte CHAR.)
In the latest Oberon language report (2016), the size of type CHAR
is not defined, instead CHAR is said to hold "the characters of a
standard character set". A new type BYTE is added that is said to
hold "the integers between 0 and 255". Now BYTE is used instead of
CHAR where it is necessary to work with binary data (or files).
Thus, in Project Oberon, module Files now has the following procedures:
PROCEDURE ReadByte*(VAR r: Rider; VAR x: BYTE);
PROCEDURE ReadBytes*(VAR r: Rider; VAR x: ARRAY OF BYTE; n: INTEGER);
PROCEDURE Read*(VAR r: Rider; VAR ch: CHAR);
PROCEDURE ReadString*(VAR R: Rider; VAR x: ARRAY OF CHAR);
PROCEDURE WriteByte*(VAR r: Rider; x: BYTE);
PROCEDURE WriteBytes*(VAR r: Rider; x: ARRAY OF BYTE; n: INTEGER);
PROCEDURE Write*(VAR r: Rider; ch: CHAR);
PROCEDURE WriteString*(VAR R: Rider; x: ARRAY OF CHAR);
The procedure Write is internally the same as WriteByte, and likewise
procedure Read is the same as ReadByte, but with different signatures.
The only difference is i.e. that
r.buf.data[r.bpos] := ORD(ch)
is used in Write instead of
r.buf.data[r.bpos] := x
(as in WriteByte).
In Project Oberon, the size of CHAR is 1 byte.
But, if CHAR were 2 bytes, module Files should provide a way to read and
write the characters in the way that is convenient for the further usage of
the file. If CHARs are to be written in a file raw, as a 2-byte integer, then
the file would have an encoding of UTF-16, UCS-2 or similar (without BOM),
and thus probably it will not display properly in any modern text viewer.
My proposal (in case of 2-byte or 4-byte CHARs) is to make Files.Read
and Files.Write work with CHARs in the following manner:
1. The file is assumed to be UTF-8 encoded.
2. Files.Read gets one or more bytes from a file and constructs
a value of CHAR.
3. Files.Write converts the given CHAR in UTF-8 and puts one
or more bytes in a file.
The number of bytes read or written for a 2-byte CHAR can be 1, 2 or 3,
as UTF-8 takes some bits for itself.
For your information:
2-byte version of Unicode covers all modern languages of the world, including
Chinese, Japanese, Korean and Thai. The rest 2 bytes of Unicode are used to
encode ancient writings, emoji, and some strange things like playing card icons,
tiles of the game Mah Jongg, and even dominoes.
Additionally, two procedures WriteChar and ReadChar can be added, that
write the values of CHARs directly (for fast local non-portable data storage).
Kind regards,
Arthur Yefimov
https://free.oberon.org/en
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.inf.ethz.ch/pipermail/oberon/attachments/20210721/e552c5a0/attachment-0001.html>
More information about the Oberon
mailing list