So you can USE the unicode character set efficiently, given that UTF-16 and UTF-32 can be as much as twice as big depending on how many non-ASCII7 characters you use.
Unicode is a 32 bit codepage set, meaning there are 4,294,967,296 possible characters. UTF-8 allows codepage 0 (ascii7 + some of extended ASCII) to be sent as one byte, while other code pages can be variable width. That makes UTF-8 the most efficient in terms of size if you want to use the unicode character set, and for Latin-1 languages you get a degree of backwards compatibility to classic ASCII -- non-ASCII characters just rendering as a few bytes of gibberish, but the rest being rendered legible.
It's a bit like the CISC vs. RISC architecture argument, where RISC opcodes are usually a fixed width, letting the processor be more efficient internally using less pathways -- CISC tends to be variable length making the bus more efficient, at the cost of increasing the transistor count; which is why most modern CPU's tend to have an extra bit of the wafer set aside to translate CISC to RISC, so you can have the internal efficiency of RISC with the bus efficiency of CISC.
That said, I'd ignore that HTML 5 charset meta bull, it's a pathetic attempt to remove a few characters from the HEAD to make up for all the extra code bloat the ALLEGEDLY semantic new tags inside BODY create -- you want to set UTF with a META (so that if it's not set by the http header, which in production REALLY should be set if using UTF-8), use HTML 4's
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Since that's what it is, the equivalent of the HTTP Header... which lets you set ALL your HTTP headers, particularly in local testing where things like CHARSET on META doesn't do a blasted thing!
Just another of the reasons I think 90%+ of HTML 5 is utter garbage.
Last edited by deathshadow; 06-02-2014 at 10:30 PM.
As always, a very thorough answer deathshadow
I think you may have missed my point:
I mealy addressing the issue of the fact that the tag addresses the "charset" as "UTF-8" which is incorrect. "UTF-8" is not a characterset it is an encoding. The tag should have been something like:
Oh, that's different then. Nevermind.
Mostly that's just legacy because there are all the OTHER character sets that are just character sets you declare there... like windows-1252, iso-8859-1, iso-8859-2, iso-8859-16, CP932, ANSI (aka IBM extended), ASCII (bottom 7 bits only).
Unicode is... well... new, has a whole bunch of new ways of being stated -- and it was easier to just use the existing name being used for all those other character sets than it is to make some new way of stating it.
... a lesson I wish the folks at the W3C would remember from when 4 Strict was created; instead of this current "people can't use anything properly, so let's just throw more code at it" attitude that resulted in HTML 5.
Users Browsing this Thread
There are currently 1 users browsing this thread. (0 members and 1 guests)