dcsimg
www.webdeveloper.com
Results 1 to 6 of 6

Thread: Utf-8/16

  1. #1
    Join Date
    Nov 2002
    Location
    England
    Posts
    275

    Utf-8/16

    Can anyone tell me the difference between these two charsets? I assume one (UTF-16?) has more characters?

  2. #2
    Join Date
    Nov 2002
    Location
    Baltimore, Maryland
    Posts
    12,279
    This gets a little complicated, but utf-8 and utf-16 are encodings and not character sets, though it's usually assumed that they both use the Unicode character set (http://www.unicode.org/). A character set is the way numbers are assigned to particular glyphs. An encoding is the way the ones and zeros are assigned to particular numbers. The 8 in utf-8 means that the smallest data chunk is eight bits and the 16 in utf-16, that the smallest chunk is sixteen bits long. And while with utf-8 characters \x00 to \xff are encoded the same as with ASCII, with utf-16 they are not.

    As an aside, those two bytes in utf-16 can be in either order. The first byte in the file will describe which is the high order byte. If you ever see a file that starts with a funny looking glyph and then the rest of the characters are separated by \x00s then you are using iso-8859-1 to view utf-16.
    “The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect.”
    —Tim Berners-Lee, W3C Director and inventor of the World Wide Web

  3. #3
    Join Date
    Nov 2002
    Location
    XYZZY - UK
    Posts
    1,760
    To add to Charles's explanation: typically if you're using Standard English characters and XHTML 1.0 one could suggest placing UTF-8 character encoding within the XML declaration.

    ASCII: is a code for representing English characters as numbers, with each letter assigned a number from 0 to 127. For example, the ASCII code for uppercase M is 77. Most computers use ASCII codes to represent text, which makes it possible to transfer data from one computer to another.

    Single byte: is usually used in reference to a character set, which supports a maximum of 256 characters. Consisting of 8 bits, one byte (or octet) can support numbers ranging from 0 (zero) to 255, i.e. 256 unique numeric values.

    Unicode: is the standard for representing characters as integers. There are seven character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE. UTF-16, for example, uses 16 bits per character, which means that it can represent more than 65,000 unique characters. This number has become necessary for some languages, such as Greek, Chinese and Japanese.

    Many analysts believe that as the software industry becomes increasingly global, Unicode may eventually replace ASCII (which uses 8 bits for each character) as the standard character coding format.

    So you see it's to do with numbers not letters themselves.

  4. #4
    Join Date
    Nov 2002
    Location
    England
    Posts
    275
    Ah OK thanks guys, that's cleared some things up. I figured it probably had to do with bitage. One other thing I wanted to know is if I use a simple charset like US-ASCII, can I then use entities to get to characters that are not in that characterset? That's the impression I got of what entities are for when I read the HTML 4.01 specs.

  5. #5
    Join Date
    Nov 2002
    Location
    Baltimore, Maryland
    Posts
    12,279
    In SGML and XML entities are simply constants that are declared in the Document Type Definition (DTD). If you look at the the actual 4.01 DTD (http://www.w3.org/TR/html4/strict.dtd) you will find the following at line 140:

    <!--================ Character mnemonic entities =========================-->

    <!ENTITY % HTMLlat1 PUBLIC
    "-//W3C//ENTITIES Latin1//EN//HTML"
    "HTMLlat1.ent">
    %HTMLlat1;

    <!ENTITY % HTMLsymbol PUBLIC
    "-//W3C//ENTITIES Symbols//EN//HTML"
    "HTMLsymbol.ent">
    %HTMLsymbol;

    <!ENTITY % HTMLspecial PUBLIC
    "-//W3C//ENTITIES Special//EN//HTML"
    "HTMLspecial.ent">
    %HTMLspecial;

    Those 'include' external files into the DTD. And if you look in turn at one of those files (http://www.w3.org/TR/html4/HTMLlat1.ent) you will see things like:

    <!ENTITY uuml CDATA "&amp;#252;" -- latin small letter u with diaeresis, U+00FC ISOlat1 -->

    That declares &amp;uuml; to represent &amp;#252; which represents a character outside of the range of ASCII. So if you are going to use that entity then you will need to declare that you are using iso-8859-1 or utf. All of the standard entities work with iso-8859-1.

    As an aside you can, in theory, make your own entities and save yourself a whole lot of typing. Browser support is spotty to say the least but it works if you are going to run your page as XML through a processor.
    “The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect.”
    —Tim Berners-Lee, W3C Director and inventor of the World Wide Web

  6. #6
    Join Date
    Nov 2002
    Location
    England
    Posts
    275
    Ah OK thanks. That's what I originally thought, but the section on entities confused me a bit

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center



Recent Articles