Difference between revisions of "Notes:Utf8"
From Maths
(Created page with "==Multi-byte characters== {| class="wikitable" border="1" |- ! Character representation ! Codepoint encoded ! Data payload ! Real payload ! Char payload ! Caveats |- ! style="...") |
(No difference)
|
Latest revision as of 22:12, 29 April 2016
Contents
Multi-byte characters
Character representation | Codepoint encoded | Data payload | Real payload | Char payload | Caveats |
---|---|---|---|---|---|
0xxx xxxx | 0xxx xxxx | 7 bits ~ 128 characters | 128 characters | 127 characters | Excluding the string-termination character, 0000 0000 |
110x xxxx 10yy yyyy | 0000 0xxx xxyy yyyy | 11 bits ~ 2048 characters | 1920 characters | 1920 characters | 128 values (lower 7 bits being 0) are lost, as they'd be mapping into the 1-byte code region |
1110 xxxx 10yy yyyy 10zz zzzz | xxxx yyyy yyzz zzzz | 16 bits ~ 65536 characters | 63488 characters | 63488 characters | Removing the 2048 characters (left most bits all being 0) that would map into the 2 byte or 1 byte ranges. |
1111 0www 10xx xxxx 10yy yyyy 10zz zzzz | 000w wwxx xxxx yyyy yyzz zzzz | 21 bits ~ 2097152 characters | 2031616 characters | 2031616 characters | Removing the 65536 characters which would correspond to the upper 8 bits of the codepoint being 0 |
[ilmath]\sum[/ilmath] | 2164864 characters (2.16m) | 2097152 characters (2.1m) | 2097151 characters (2.1m) | null byte removed from character payload. |
- Question: there's a lot of overlap here. This suggests that a first byte of 1100 0000 is invalid, because that'd overlap with the 1 byte characters! (As opposed to say mapping the 2 bytes in where the 1 bytes end....) it seems that they map codepoints right to byte representations (using the shortest) rather than maximising the byte representation space. Confirm this.
Codepoint notation
A "code point" denotes a Utf8 symbol. We use U+n[1] where n is at least 4 characters long, n is the hexadecimal notation of the number of the character. n is at most 6 hex digits long. Examples:
- U+0001, U+1234, U+12345, U+102345 (U+123456 was not used because it is invalid I guess?)
Codepoint mapping
U+0080 or 1000 0000 in binary is the FIRST 2 byte code-point
Encoding/Decoding
See table 3.6 in Unicode Specification 8.0 for details. The code points are encoded using the smallest number of bytes[1]:Table 3.7
Special characters and ranges
The points here are given in 6 digit hex form for text-alignment reasons.
Codepoint | Purpose |
---|---|
U+000000 | null-terminator |
U+00001F | (legacy character - see[1]:23.1) |
U+00007F | (legacy character - see[1]:23.1) |
U+000080 | (legacy character - see[1]:23.1) |
... | |
U+00009F |
There are loads of these....