[Oberon] Oberon with strings, Eberon

jwr at robrts.net jwr at robrts.net
Sat Feb 27 04:42:40 CET 2016


Short version: There are existing tools which will properly translate  
text files in Unicode UTF-8 to ASCII, assuming that most or all of the  
characters in those files can be represented properly in ASCII.


Thinking of a text file as being infected by UTF-8 character viruses  
does not facilitate development of an effective solution: how does one  
inoculate a text file against viruses?  What kind of anti-virus is  
best to apply? How can an infected file be cured?

Thinking of it as a corrupted file could help, but isn't optimal. How  
does one detect and remove the corruption?  How can you tell the  
difference between corrupted characters and ones that are correct but  
unusual?

A useful approach might be to assume an existing file properly  
contains Unicode characters encoded in UTF-8, but what is desired is a  
text file containing only ASCII characters. Then the problem is how to  
translate a Unicode UTF-8 file into an ASCII file, with a minimum of  
problems.  A further useful assumption is that the file is already  
composed primarily of Latin characters which have ASCII codes. (This  
approach is inappropriate for translating Chinese text to ASCII, for  
instance.)

Given these assumptions, perhaps a tool is needed which reads in a  
UTF-8 file, properly translates each encoded Unicode character (not  
byte) to its character number, then for each outputs an appropriate  
ASCII character to a separate output file.  It could map multiple  
kinds of single-quote characters to the ASCII apostrophe, multiple  
kinds of double-quote characters to the ASCII double-quote, etc. as  
desired.

If I were writing this, I would have the tool keep track of the  
character value and text position of all Unicode characters which  
failed to map properly to ASCII, and dump diagnostic information to  
the output at the end following some kind of end-of-original-text  
marker (such as a line consisting of a dozen equals signs.)  For  
diagnostics during testing I might also dump the character number and  
number of occurrences for each character which DID translate properly  
as well. The tool might terminate with a "success" result [here I'm  
thinking like a Unix programmer...] if all characters were translated  
properly, and with "failure" if some number (especially a majority) of  
characters failed to translate.

The next step is to consider: why wouldn't someone else already have  
done this?  A google search for "Unicode to ASCII" quickly results in  
several useful hits. The page  
https://docs.python.org/2/howto/unicode.html indicates Python already  
has a bunch of support for Unicode, so such a translator should be  
easy to implement, especially in a Linux environment. The pages  
http://stackoverflow.com/questions/2365411/python-convert-unicode-to-ascii-without-errors and http://stackoverflow.com/questions/816285/where-is-pythons-best-ascii-for-this-unicode-database lead to https://pypi.python.org/pypi/Unidecode which includes an already-developed mapping of Unicode-to-ASCII. These are not the only solutions, nor are they likely the best. With further searches I would expect to find even better solutions, programmed in a variety of  
languages.

This is an Oberon forum, so a non-Oberon solution may not be optimal.  
Perhaps there is an existing Oberon solution. If not, we could write  
one, facilitated using the ideas and the examples above. But before  
going to that effort, what is the actual problem or true requirement?   
Maybe using a Python tool in a Linux environment, within a sequence of  
other processing (PDF to UTF-8 to ASCII to speech in MP3?), is a  
useful and acceptable approach for the problem at hand.

-- John Roberts




More information about the Oberon mailing list