[Oberon] Oberon with strings, Eberon
jwr at robrts.net
jwr at robrts.net
Sat Feb 27 04:42:40 CET 2016
Short version: There are existing tools which will properly translate
text files in Unicode UTF-8 to ASCII, assuming that most or all of the
characters in those files can be represented properly in ASCII.
Thinking of a text file as being infected by UTF-8 character viruses
does not facilitate development of an effective solution: how does one
inoculate a text file against viruses? What kind of anti-virus is
best to apply? How can an infected file be cured?
Thinking of it as a corrupted file could help, but isn't optimal. How
does one detect and remove the corruption? How can you tell the
difference between corrupted characters and ones that are correct but
unusual?
A useful approach might be to assume an existing file properly
contains Unicode characters encoded in UTF-8, but what is desired is a
text file containing only ASCII characters. Then the problem is how to
translate a Unicode UTF-8 file into an ASCII file, with a minimum of
problems. A further useful assumption is that the file is already
composed primarily of Latin characters which have ASCII codes. (This
approach is inappropriate for translating Chinese text to ASCII, for
instance.)
Given these assumptions, perhaps a tool is needed which reads in a
UTF-8 file, properly translates each encoded Unicode character (not
byte) to its character number, then for each outputs an appropriate
ASCII character to a separate output file. It could map multiple
kinds of single-quote characters to the ASCII apostrophe, multiple
kinds of double-quote characters to the ASCII double-quote, etc. as
desired.
If I were writing this, I would have the tool keep track of the
character value and text position of all Unicode characters which
failed to map properly to ASCII, and dump diagnostic information to
the output at the end following some kind of end-of-original-text
marker (such as a line consisting of a dozen equals signs.) For
diagnostics during testing I might also dump the character number and
number of occurrences for each character which DID translate properly
as well. The tool might terminate with a "success" result [here I'm
thinking like a Unix programmer...] if all characters were translated
properly, and with "failure" if some number (especially a majority) of
characters failed to translate.
The next step is to consider: why wouldn't someone else already have
done this? A google search for "Unicode to ASCII" quickly results in
several useful hits. The page
https://docs.python.org/2/howto/unicode.html indicates Python already
has a bunch of support for Unicode, so such a translator should be
easy to implement, especially in a Linux environment. The pages
http://stackoverflow.com/questions/2365411/python-convert-unicode-to-ascii-without-errors and http://stackoverflow.com/questions/816285/where-is-pythons-best-ascii-for-this-unicode-database lead to https://pypi.python.org/pypi/Unidecode which includes an already-developed mapping of Unicode-to-ASCII. These are not the only solutions, nor are they likely the best. With further searches I would expect to find even better solutions, programmed in a variety of
languages.
This is an Oberon forum, so a non-Oberon solution may not be optimal.
Perhaps there is an existing Oberon solution. If not, we could write
one, facilitated using the ideas and the examples above. But before
going to that effort, what is the actual problem or true requirement?
Maybe using a Python tool in a Linux environment, within a sequence of
other processing (PDF to UTF-8 to ASCII to speech in MP3?), is a
useful and acceptable approach for the problem at hand.
-- John Roberts
More information about the Oberon
mailing list