<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
On 1/22/2016 3:38 AM, eas lab wrote:<br>
<blockquote
cite="mid:CAN3-DLEemRQ0mNaJrBFWf6Z6t2f4P5=2O7mc3q6tb5ixe8RZ-A@mail.gmail.com"
type="cite">
<pre wrap="">Who started this absurdity of replacing single-quote/apostrophe by 3 bytes.</pre>
</blockquote>
UTF-8 is a character encoding capable of encoding all possible
characters, or code points, in Unicode.<br>
<br>
The encoding is variable-length and uses 8-bit code units. It was
designed for backward compatibility with ASCII, and to avoid the
complications of endianness and byte order marks in the alternative
UTF-16 and UTF-32 encodings. The name is derived from: Universal
Coded Character Set + Transformation Format—8-bit.[1]<br>
Graph indicates that UTF-8 (light blue) exceeded other main
encodings of text on the Web, that by 2010 it was nearing 50%
prevalent. Encodings were detected by examining the text, not from
the encoding tag in the header,[2] and were sorted to the least
inclusive set;[3] thus, ASCII text tagged as UTF-8 or ISO-8859-1 is
identified as ASCII. By January 2016 the declared usage was up to
86%.[4]<br>
<br>
UTF-8 is the dominant character encoding for the World Wide Web,
accounting for 86.1% of all Web pages in January 2016 (with the most
popular East Asian encoding, GB 2312, at 0.9%).[4][2][5] The
Internet Mail Consortium (IMC) recommends that all e-mail programs
be able to display and create mail using UTF-8,[6] and the W3C
recommends UTF-8 as the default encoding in XML and HTML.[7]<br>
<br>
UTF-8 encodes each of the 1,112,064 valid code points in the Unicode
code space (1,114,112 code points minus 2,048 surrogate code points)
using <b>one to four </b>8-bit bytes (a group of 8 bits is known
as an octet in the Unicode Standard). Code points with lower
numerical values (i.e., earlier code positions in the Unicode
character set, which tend to occur more frequently) are encoded
using fewer bytes. The first 128 characters of Unicode, which
correspond one-to-one with ASCII, are encoded using a single octet
with the same binary value as ASCII, making valid ASCII text valid
UTF-8-encoded Unicode as well. And ASCII bytes do not occur when
encoding non-ASCII code points into UTF-8, making UTF-8 safe to use
within most programming and document languages that interpret
certain ASCII characters in a special way, e.g. as end of string.<br>
</body>
</html>