www.webdeveloper.com
Results 1 to 4 of 4

Thread: He�p: my text �eeps ��ing thi�

  1. #1
    Join Date
    Jul 2010
    Location
    /ramdisk/
    Posts
    865

    Question He�p: my text �eeps ��ing thi�

    The problem is that people are copying and pasting from text-editors that replace normal characters with special ones which are not ASCII. This is OK, the text looks nicer when you’re viewing it in the text editor, but I need to display the text in an HTML document.

    �oN±��§À��, know what I mean?

    Here is what I currently use, and it seems to fix the majority of the characters. From previous experience working with UTF8 I know that my browser does not have every character on the map which is why I wrote U+FFFD so you can see.


    Edit: Here is how I am handling the text from start to finish:

    1) A user copy/pastes from word into a textfield on a webpage with:
    HTML Code:
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    2) I put it into a mySQL table... the field is either varchar(N) or mediumtext. The collation is latin1_swedish_ci.

    3) I take the text out using:

    PHP Code:
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    </head>
    <body>
    <?php
    System
    ::mysql_con();
    $string mysql_fetch_array(mysql_query("select field from table, limit 1"));
    echo 
    System::MS_COMPAT($string['field'], truetrue);
    ?>
    </body>
    </html>
    PHP Code:
    class System #this is just a snippet
      /**
       * When people copy/paste from MS word, the encoding causes
       * display problems... Use this to convert it to some better ^_^
       *
       * @param string $string  The String to Convert
       * @param bool   $fromSQL applies strip slashes
       * @param bool   $nl2br   applies nl2br
       */
      
    public final static function MS_COMPAT($string$fromSQL false$nl2br false) { 
        
    $string $fromSQL stripslashes($string) : $string;
        
    $string htmlentities($stringENT_COMPAT"UTF-8");
        if (
    $nl2br)   $string self::myNL2BR("\n"+$string);
        return 
    $string;
      }


    Why am I doing it wrong?

    Edit: I added my table's collation: latin1_swedish_ci (maybe this is why?)
    Last edited by eval(BadCode); 02-15-2011 at 08:25 PM.

  2. #2
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,633
    Assuming you want to work with UTF-8 throughout, make sure that:

    a. The database is storing it as UTF-8, e.g.:
    Code:
    CREATE TABLE `test`.`example_table` (
    `sample_field` VARCHAR(255) NOT NULL ) 
    ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_unicode_ci
    b. When you connect to mysql in your script, before doing any other queries, do the following to ensure you are "talking" to MySQL in UTF-8:
    PHP Code:
    mysql_query("SET NAMES 'utf8' COLLATE 'utf8_unicode_ci'"); 
    c. If/when applying htmlentities() to your output, be sure to use the 3rd optional parameter to specify UTF-8:
    PHP Code:
    echo htmlentities($textENT_QUOTES'UTF-8'); 
    d. In addition to the content-type meta tag, also output a HTTP header in case your web server sends something else and the browser uses that instead of the meta tag:
    PHP Code:
    <?php
    header
    ("Content-Type: text/html; charset='utf-8'");
    That should take care of all UTF-8 issues on the PHP side. You can also add an accept_charset attribute to your <form> tags to "ask" the browser to only accept UTF-8 inputs, though of course you cannot depend on it.

    Lastly, I use this function to filter inputs before saving them in the DB if I'm worried about people cutting and pasting text from M$ Word documents with their proprietary character set for punctuation: http://www.charles-reace.com/blog/20...-ms-word-text/.
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

  3. #3
    Join Date
    Jul 2010
    Location
    /ramdisk/
    Posts
    865
    I already checked the headers sent by the server using wireshark. It's not that.

    I did not intend work with utf8 at all.

    I will do my best to implement your solution. I trust your experience

    Just a thought however:
    I find it perverse that I'm changing my table collation because of a text editor that has nothing to do with my application ... I feel bullied by M$... had this application been for myself I wouldn't give this issue the time of day.


    Edit:
    Is there an ASCII only version of your filter_text function?
    Last edited by eval(BadCode); 02-15-2011 at 09:33 PM.

  4. #4
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,633
    Quote Originally Posted by eval(BadCode) View Post
    I already checked the headers sent by the server using wireshark. It's not that.

    I did not intend work with utf8 at all.
    I assumed you did since you had it in your content-type meta tag. Also, it makes your page more internationally accessible, since ASCII only supports the base "western" latin character set.

    I will do my best to implement your solution. I trust your experience

    Just a thought however:
    I find it perverse that I'm changing my table collation because of a text editor that has nothing to do with my application ... I feel bullied by M$... had this application been for myself I wouldn't give this issue the time of day.
    UTF-8 has nothing to do with M$, which uses its own character set (I don't recall of the top of my head what its designation is). If you don't want to use UTF-8 and instead limit yourself to only western latin characters, you can replace UTF-8 throughout with ISO-8859-1 for PHP functions and "latin1" for the MySQL character set (normally the default) and latin1_general_ci for the MySQL collation type.
    Edit:
    Is there an ASCII only version of your filter_text function?
    Nothing I have lying around.

    EDIT: If you filter your output with htmlentities($text, ENT_QUOTES, "ISO-8859-1") if you choose to go that route for your content-type, you should be OK even if you use that function (I'm 90&#37; sure).
    Last edited by NogDog; 02-15-2011 at 09:58 PM.
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center



Recent Articles