Difference between revisions of "Notes:Utf8"

Latest revision as of 22:12, 29 April 2016

Multi-byte characters

Character representation	Codepoint encoded	Data payload	Real payload	Char payload	Caveats
0xxx xxxx	0xxx xxxx	7 bits ~ 128 characters	128 characters	127 characters	Excluding the string-termination character, 0000 0000
110x xxxx 10yy yyyy	0000 0xxx xxyy yyyy	11 bits ~ 2048 characters	1920 characters	1920 characters	128 values (lower 7 bits being 0) are lost, as they'd be mapping into the 1-byte code region
1110 xxxx 10yy yyyy 10zz zzzz	xxxx yyyy yyzz zzzz	16 bits ~ 65536 characters	63488 characters	63488 characters	Removing the 2048 characters (left most bits all being 0) that would map into the 2 byte or 1 byte ranges.
1111 0www 10xx xxxx 10yy yyyy 10zz zzzz	000w wwxx xxxx yyyy yyzz zzzz	21 bits ~ 2097152 characters	2031616 characters	2031616 characters	Removing the 65536 characters which would correspond to the upper 8 bits of the codepoint being 0
[ilmath]\sum[/ilmath]		2164864 characters (2.16m)	2097152 characters (2.1m)	2097151 characters (2.1m)	null byte removed from character payload.

Question: there's a lot of overlap here. This suggests that a first byte of 1100 0000 is invalid, because that'd overlap with the 1 byte characters! (As opposed to say mapping the 2 bytes in where the 1 bytes end....) it seems that they map codepoints right to byte representations (using the shortest) rather than maximising the byte representation space. Confirm this.

Codepoint notation

A "code point" denotes a Utf8 symbol. We use U+n^[1] where n is at least 4 characters long, n is the hexadecimal notation of the number of the character. n is at most 6 hex digits long. Examples:

U+0001, U+1234, U+12345, U+102345 (U+123456 was not used because it is invalid I guess?)

Codepoint mapping

U+0080 or 1000 0000 in binary is the FIRST 2 byte code-point

Encoding/Decoding

See table 3.6 in Unicode Specification 8.0 for details. The code points are encoded using the smallest number of bytes^[1]^{:Table 3.7}

Special characters and ranges

The points here are given in 6 digit hex form for text-alignment reasons.

Codepoint	Purpose
U+000000	null-terminator
U+00001F	(legacy character - see^[1]^:23.1)
U+00007F	(legacy character - see^[1]^:23.1)
U+000080	(legacy character - see^[1]^:23.1)
...
U+00009F

There are loads of these....

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 Unicode 8.0 standard

[U8.0-1] 1.0 ^1.1 ^1.2 ^1.3 ^1.4 Unicode 8.0 standard

[1]

Difference between revisions of "Notes:Utf8"

Latest revision as of 22:12, 29 April 2016

Contents

Multi-byte characters

Codepoint notation

Codepoint mapping

Encoding/Decoding

Special characters and ranges

References

Navigation menu

Views

Personal tools

Navigation

Search

Tools