Thursday, November 15, 2012

UTF and encodings


So there are a number of ways(ha!) to go from bits/bytes of data to human readable glyphs(like 'a' or 'A' or the chinese characters, etc). Each is called an 'encoding'. One way to go from bytes to (english) characters is ASCII encoding. This uses 8 bits, or actually it uses 7 bits and the extra bit is something that doesnt matter too terribly much to me. This means we can encode 2^7=128 different glyphs(or characters, I'm not sure of the technical distinction. Actually we can also include 'beep', space, escape, etc. For my purposes here i will call all the things we represent 'characters')using ASCII encoding. Naturally the ones ASCII includes for encoding are all the english letters(41 to 7A, in hex) some non readable characters(0 through 20 in hex) and numbers (30 through 39 in hex). You can look up the rest. 

UTF encoding is a little different, it starts by taking all the writing systems(in the world) and assigning them to code points those code points are then assigned to a hex number which is stored in memory(in binary).

Every platonic letter(but also space, beep, escape, etc) in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639.  This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041.

So with UTF we go from characters to code points and from code points to bits/bytes in hex (or decimal, but hex is more common). Now how do code points map to actual numbers represented by bytes, equivalently how do we go from a couple of bytes(how many bites do we get? ASCII took 7.) to a code point. Well UTF-8, UTF-16, etc are all different ways of going from a code point to an actual character(or non readable character).

UTF-8 makes some sense to me. as does UTF-16, though it takes up more space. UTF-8 uses between 1 and 6 bytes to store a code point, whereas UTF-16 uses either 2 bytes or 4 bytes to store a code point.  

In UTF-8 U+0041 maps to just 41(in hex) in memory which is the character 'A'. Characters above ASCII, meaning above '127' in decimal or above 007F in hex, are stored in 16 bits(2 bytes). So UTF-8 doesnt use 2 bytes until it 'needs' to(the extra bit on the end of that first byte in UTF-8 is just set to 0 I think). 

UTF-16 uses at least 2 bytes so U+0041 goes to 0041(in hex) in memory which is the character 'A'. For english characters this is a bit too much wasted memory('Hello' would look like 00 48 00 65 00 6C 00 6C 00 6F vs in UTF-8 it would be just 48 65 6C 6C 6F). So UTF-16 is a little memory heavy because every other byte is often 00, which in UTF-8, 00, would go to just null, since thats what 00 is in ASCII. UTF-8 and ASCII are the same up to 7F(in hex) or 127(in decimal).

There are other UTF encoding but 8 and 16 seem like the big ones to me. Of course the big thing to remember is that you have to have an encoding before you can read anything on a screen. This is why its so important.