The problem is that people are copying and pasting from text-editors that replace normal characters with special ones which are not ASCII. This is OK, the text looks nicer when you’re viewing it in the text editor, but I need to display the text in an HTML document.
�oN±��§À��, know what I mean?
Here is what I currently use, and it seems to fix the majority of the characters. From previous experience working with UTF8 I know that my browser does not have every character on the map which is why I wrote U+FFFD so you can see.
Edit: Here is how I am handling the text from start to finish:
1) A user copy/pastes from word into a textfield on a webpage with:
HTML Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
2) I put it into a mySQL table... the field is either varchar(N) or mediumtext. The collation is latin1_swedish_ci.
3) I take the text out using:
PHP Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> <?php System::mysql_con(); $string = mysql_fetch_array(mysql_query("select field from table, limit 1")); echo System::MS_COMPAT($string['field'], true, true); ?> </body> </html>
PHP Code:
class System { #this is just a snippet /** * When people copy/paste from MS word, the encoding causes * display problems... Use this to convert it to some better ^_^ * * @param string $string The String to Convert * @param bool $fromSQL applies strip slashes * @param bool $nl2br applies nl2br */ public final static function MS_COMPAT($string, $fromSQL = false, $nl2br = false) { $string = $fromSQL ? stripslashes($string) : $string; $string = htmlentities($string, ENT_COMPAT, "UTF-8"); if ($nl2br) $string = self::myNL2BR("\n"+$string); return $string; } }
Why am I doing it wrong?
Edit: I added my table's collation: latin1_swedish_ci (maybe this is why?)
Last edited by eval(BadCode); 02-15-2011 at 07:25 PM.
c. If/when applying htmlentities() to your output, be sure to use the 3rd optional parameter to specify UTF-8:
PHP Code:
echo htmlentities($text, ENT_QUOTES, 'UTF-8');
d. In addition to the content-type meta tag, also output a HTTP header in case your web server sends something else and the browser uses that instead of the meta tag:
That should take care of all UTF-8 issues on the PHP side. You can also add an accept_charset attribute to your <form> tags to "ask" the browser to only accept UTF-8 inputs, though of course you cannot depend on it.
Lastly, I use this function to filter inputs before saving them in the DB if I'm worried about people cutting and pasting text from M$ Word documents with their proprietary character set for punctuation: http://www.charles-reace.com/blog/20...-ms-word-text/.
"Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
~ Terry Pratchett in Nation
I already checked the headers sent by the server using wireshark. It's not that.
I did not intend work with utf8 at all.
I will do my best to implement your solution. I trust your experience
Just a thought however:
I find it perverse that I'm changing my table collation because of a text editor that has nothing to do with my application ... I feel bullied by M$... had this application been for myself I wouldn't give this issue the time of day.
Edit:
Is there an ASCII only version of your filter_text function?
Last edited by eval(BadCode); 02-15-2011 at 08:33 PM.
I already checked the headers sent by the server using wireshark. It's not that.
I did not intend work with utf8 at all.
I assumed you did since you had it in your content-type meta tag. Also, it makes your page more internationally accessible, since ASCII only supports the base "western" latin character set.
I will do my best to implement your solution. I trust your experience
Just a thought however:
I find it perverse that I'm changing my table collation because of a text editor that has nothing to do with my application ... I feel bullied by M$... had this application been for myself I wouldn't give this issue the time of day.
UTF-8 has nothing to do with M$, which uses its own character set (I don't recall of the top of my head what its designation is). If you don't want to use UTF-8 and instead limit yourself to only western latin characters, you can replace UTF-8 throughout with ISO-8859-1 for PHP functions and "latin1" for the MySQL character set (normally the default) and latin1_general_ci for the MySQL collation type.
Edit:
Is there an ASCII only version of your filter_text function?
Nothing I have lying around.
EDIT: If you filter your output with htmlentities($text, ENT_QUOTES, "ISO-8859-1") if you choose to go that route for your content-type, you should be OK even if you use that function (I'm 90% sure).
Last edited by NogDog; 02-15-2011 at 08:58 PM.
"Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
~ Terry Pratchett in Nation
Bookmarks