Difference between revisions of "Notes:Utf8"

From Maths
Jump to: navigation, search
(Created page with "==Multi-byte characters== {| class="wikitable" border="1" |- ! Character representation ! Codepoint encoded ! Data payload ! Real payload ! Char payload ! Caveats |- ! style="...")
 
(No difference)

Latest revision as of 22:12, 29 April 2016

Multi-byte characters

Character representation Codepoint encoded Data payload Real payload Char payload Caveats
0xxx xxxx 0xxx xxxx 7 bits ~ 128 characters 128 characters 127 characters Excluding the string-termination character, 0000 0000
110x xxxx 10yy yyyy 0000 0xxx xxyy yyyy 11 bits ~ 2048 characters 1920 characters 1920 characters 128 values (lower 7 bits being 0) are lost, as they'd be mapping into the 1-byte code region
1110 xxxx 10yy yyyy 10zz zzzz xxxx yyyy yyzz zzzz 16 bits ~ 65536 characters 63488 characters 63488 characters Removing the 2048 characters (left most bits all being 0) that would map into the 2 byte or 1 byte ranges.
1111 0www 10xx xxxx 10yy yyyy 10zz zzzz 000w wwxx xxxx yyyy yyzz zzzz 21 bits ~ 2097152 characters 2031616 characters 2031616 characters Removing the 65536 characters which would correspond to the upper 8 bits of the codepoint being 0
[ilmath]\sum[/ilmath] 2164864 characters (2.16m) 2097152 characters (2.1m) 2097151 characters (2.1m) null byte removed from character payload.
Question: there's a lot of overlap here. This suggests that a first byte of 1100 0000 is invalid, because that'd overlap with the 1 byte characters! (As opposed to say mapping the 2 bytes in where the 1 bytes end....) it seems that they map codepoints right to byte representations (using the shortest) rather than maximising the byte representation space. Confirm this.

Codepoint notation

A "code point" denotes a Utf8 symbol. We use U+n[1] where n is at least 4 characters long, n is the hexadecimal notation of the number of the character. n is at most 6 hex digits long. Examples:

  • U+0001, U+1234, U+12345, U+102345 (U+123456 was not used because it is invalid I guess?)

Codepoint mapping

U+0080 or 1000 0000 in binary is the FIRST 2 byte code-point

Encoding/Decoding

See table 3.6 in Unicode Specification 8.0 for details. The code points are encoded using the smallest number of bytes[1]:Table 3.7

Special characters and ranges

The points here are given in 6 digit hex form for text-alignment reasons.

Codepoint Purpose
U+000000 null-terminator
U+00001F (legacy character - see[1]:23.1)
U+00007F (legacy character - see[1]:23.1)
U+000080 (legacy character - see[1]:23.1)
...
U+00009F

There are loads of these....

References

  1. 1.0 1.1 1.2 1.3 1.4 Unicode 8.0 standard