All About UTF-8

UTF-8 is the most common form of text encoding, but it wasn’t always this way. Before UTF-8 was the standard, multiple encoding existed for different character sets and were generally simplistic. This simply meant that these character sets were easily used however, lead to their downfall as they were limited by the number of characters they could represent. UTF-8 however, has enough room to encoding every character currently needed and is still expanding, encompassing the majority of characters and symbols (including emojis).

What Is UTF-8?

First, let’s look at the ASCII table in binary form, take the character ‘A‘. In the ASCII table the value for this character is 01000001 in binary (65 in decimal form). The leading zero is the important part concerning its encoding in the UTF-8 table. This leading zero is what separates the ASCII range from the extended ASCII range.

For example, we can compare two values, one inside of the ASCII range and a second inside of the extended ASCII range, take again the character ‘A’ and the euro symbol ‘€’.

A 0100 0001
1000 0000

Depending on if the leading bit is 0 or 1 indicates if the character is part of the ASCII range or the extended ASCII range.

Why Is This Important to UTF-8?

This is important because of the way UTF-8 is encoded. UTF-8 uses multiple bytes to encode each character. It was designed to be backwards compatible with ASCII, which is why there is an overlap with ASCII and extended ASCII. However, in the case of UTF-8 the first bit was used to indicate if the character was multiple bytes or not, which means that the extended ASCII range needed to be moved to another part of the UTF-8 table.

If the leading bit is 0, we consider this the ASCII range and will only exist as a single byte. However, a leading bit of 1 signifies that the character will be expressed in multiple bytes. How many bytes the character contains is described by the number of 1 bits found before a 0 bit in the leading byte.

To explain this concept further let’s consider the euro symbol again, this time from the UTF-8 table. As the extended ASCII table does not exist in the UTF-8 table, the euro symbol will be within a multi-byte character. Here it’s decimal value is 8,364 expressed in binary as 1110 0010 1000 0010 1010 1100, a three-byte character. Looking at the first byte 1110 0010 we can see that there are three 1 bits before finding a 0 bit meaning that the character must be three bytes long (including the leading byte). The following bytes are always of form 10xx xxxx (where x can be either 1 or 0). This form is different from all other leading bytes meaning that it can never be confused with the start of a character.

Byte Formations

The table below shows all possible formations of bytes within the UTF-8 table. Again x signifies that the bit can be either 1 or 0.

1 0xxx xxxx
2 110x xxxx 10xx xxxx
3 1110 xxxx 10xx xxxx 10xx xxxx
4 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx

Although it would be possible to encode more bytes into the UTF-8 table. UTF-8 was restricted to 32 bit (4 Bytes) as this has proved to be more than enough to encode every required character. Originally UTF-8 was proposed to extend to 6-byte characters, however, this has now been restricted to 4-bytes. This has caused some confusion within the development community as some developers believe that 5 and 6-byte characters are valid.

UTF-8 Auto-Detection

Auto-detection of UTF-8 can be performed using the formation of its multi-byte characters. A sequence of bytes can be processed and determined if it’s likely to be a UTF-8 sequence due to the formation of bytes. If the formation of bytes does not follow one of the sequences above we can determine that the sequence is not UTF-8.

However, there is an issue if the string appears to be UTF-8. There is no direct way to determine that the string truly is UTF-8 or if its a sequence of extended ASCII characters. In this case, where the bytes form valid UTF-8 characters the sequence is likely UTF-8 as chains of multiple extended ASCII characters that form correct UTF-8 sequences are rare.

ASCII Fallback Consequences

As previously mentioned, the ASCII range is included in the UTF-8 table and therefore incorrect processing of UTF-8 will preserve the ASCII range, keeping common letters legible. The consequences of an ASCII fallback, in general, aren’t bad. Normally you’ll experience this on a display or in a message where multiple characters appear where there should only be one, and generally, don’t affect the legibility of the text. For example, have you ever been sent a message that contains a currency symbol and what should have been £ has become £. Here the UTF-8 version of the pound sign has been used but has been processed as ASCII. Here is the binary for the extended ASCII and the UTF-8 versions of the pound sign.

ASCII 1010 0011
UTF-8 1100 0010 1010 0011

In this case, the fallback means that both bytes are shown. However, notice that the following byte in the UTF-8 sequence is the same as the ASCII byte, meaning that the pound symbol is still displayed and the text remains legible. Elegant fallbacks like this ensure that UTF-8 remains useful even when the text is incorrectly processed and most likely what caused the widespread use and popularity of UTF-8.


Related Posts