[Oberon] Files.Write and 2-byte CHARs

Wed Jul 21 19:34:17 CEST 2021

Arthur

On additional remark: I recommended to represent the CHAR internally as 4 bytes; to be able to hold all possible Unicode chars.

If you want to cut memory usage in half, you could represent a CHAR internally as 2-byte and restrict yourself to Unicode BMP. BMP covers quite some languages, unfortunately no emojis 😊

In FPGA Oberon (with only 1 MB of RAM) you have to carefully design your font data structure, as Unicode fonts easily fill up your whole 1MB memory.

My recommendation: break the font up in 128 CHAR blocks and load them dynamically when needed. Most programmers only write programs (and use text) in ONE language only.

Latin OR Cyrillic OR Greek. It’s very seldom to have to render text in Latin AND Cyrillic AND Greek AND …

In Unicode those languages are often grouped in 128 characters (Greek: 03xx, Cyrillic: 04xx, Hebrew: 05xx, Arabic: 06xx and so on)

Emojis did not fit in the BMP anymore; they can be found in the SMP at 1F9xx 

br

Jörg

From: Jörg <joerg.straube at iaeth.ch> 
Sent: Wednesday, July 21, 2021 6:46 PM
To: 'ETH Oberon and related systems' <oberon at lists.inf.ethz.ch>
Subject: RE: [Oberon] Files.Write and 2-byte CHARs

Arthur

The implementation of CHAR is not defined in the Oberon report; it’s length is not defined, the charset is not defined and the coding is not defined neither.

It’s all up to the implementation.

The charset can be ASCII or EBCDIC or Unicode or any other. If it’s Unicode it can be coded as UTF-8, UCS-2, UTF-16 or others.

As implementor, you can decide to represent CHAR differently internally and externally.

In ProjectOberon the implementation is as follows:

*	CHAR takes one byte
*	CHAR charset is 7-bit ASCII
*	internal = external

If you want to extend CHAR to more than 7bits, I recommend doing it as follows:

*	Adapt the compiler to store CHAR internally as INTEGER (4 Byte). Don’t use UTF-8 internally as ARRAY OF CHAR will be no ARRAY anymore..
*	Reading/Writing 4-byte CHAR: encode/decode it on the external medium as UTF-8

You have to do quite some adaptions to the system:

*	Rewrite Fonts.Mod
*	Decide how to store Filenames (32 CHAR or 32 Bytes) --> adapt FileDir.Mod
(recommendation: stick to 32 byte and treat them as UTF-8, consequence: filenames can get shorter than 32 CHARs if Unicode characters are used)
*	You have to decide how to store module names --> adapt Modules.Mod
(recommendation: stick to 32 byte and treat them as UTF-8. Consequence: module names can get shorter than 32 CHARs if Unicode characters are used) 
*	You have to decide how to store string constants in the module --> adapt compiler
(recommendation: store them as 4byte CHAR to make string assignments  s := “Hi 😎”; not too complex)

br

Jörg

From: Oberon <oberon-bounces at lists.inf.ethz.ch <mailto:oberon-bounces at lists.inf.ethz.ch> > On Behalf Of Arthur Yefimov
Sent: Wednesday, July 21, 2021 5:31 PM
To: oberon at lists.inf.ethz.ch <mailto:oberon at lists.inf.ethz.ch> 
Subject: [Oberon] Files.Write and 2-byte CHARs

In some Oberon versions, type CHAR has a size of 2 or 4 bytes.

(i. e. BlackBox has a 2-byte CHAR, Active Oberon has 4-byte CHAR.)

In the latest Oberon language report (2016), the size of type CHAR

is not defined, instead CHAR is said to hold "the characters of a

standard character set". A new type BYTE is added that is said to

hold "the integers between 0 and 255". Now BYTE is used instead of

CHAR where it is necessary to work with binary data (or files).

Thus, in Project Oberon, module Files now has the following procedures:

PROCEDURE ReadByte*(VAR r: Rider; VAR x: BYTE);
PROCEDURE ReadBytes*(VAR r: Rider; VAR x: ARRAY OF BYTE; n: INTEGER);
PROCEDURE Read*(VAR r: Rider; VAR ch: CHAR);
PROCEDURE ReadString*(VAR R: Rider; VAR x: ARRAY OF CHAR);

PROCEDURE WriteByte*(VAR r: Rider; x: BYTE);
PROCEDURE WriteBytes*(VAR r: Rider; x: ARRAY OF BYTE; n: INTEGER);
PROCEDURE Write*(VAR r: Rider; ch: CHAR);
PROCEDURE WriteString*(VAR R: Rider; x: ARRAY OF CHAR);

The procedure Write is internally the same as WriteByte, and likewise

procedure Read is the same as ReadByte, but with different signatures.

The only difference is i.e. that

    r.buf.data[r.bpos] := ORD(ch)

is used in Write instead of

    r.buf.data[r.bpos] := x

(as in WriteByte).

In Project Oberon, the size of CHAR is 1 byte.

But, if CHAR were 2 bytes, module Files should provide a way to read and

write the characters in the way that is convenient for the further usage of

the file. If CHARs are to be written in a file raw, as a 2-byte integer, then

the file would have an encoding of UTF-16, UCS-2 or similar (without BOM),

and thus probably it will not display properly in any modern text viewer.

My proposal (in case of 2-byte or 4-byte CHARs) is to make Files.Read

and Files.Write work with CHARs in the following manner:

1. The file is assumed to be UTF-8 encoded.

2. Files.Read gets one or more bytes from a file and constructs

    a value of CHAR.

3. Files.Write converts the given CHAR in UTF-8 and puts one

    or more bytes in a file.

The number of bytes read or written for a 2-byte CHAR can be 1, 2 or 3,

as UTF-8 takes some bits for itself.

For your information:

2-byte version of Unicode covers all modern languages of the world, including

Chinese, Japanese, Korean and Thai. The rest 2 bytes of Unicode are used to

encode ancient writings, emoji, and some strange things like playing card icons,

tiles of the game Mah Jongg, and even dominoes.

Additionally, two procedures WriteChar and ReadChar can be added, that

write the values of CHARs directly (for fast local non-portable data storage).

Kind regards,

Arthur Yefimov

https://free.oberon.org/en

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.inf.ethz.ch/pipermail/oberon/attachments/20210721/e552c5a0/attachment-0001.html>