www.webdeveloper.com
Results 1 to 8 of 8

Thread: Non-English RegExp for removing non-alphanumeric

  1. #1
    Join Date
    Mar 2006
    Posts
    73

    Non-English RegExp for removing non-alphanumeric

    I have used this script:

    Code:
    someString.replace(/[^A-Za-z0-9 .]/g, '')
    ...many times to remove non-alphanumeric and non "." and " " characters but am having to re-think its use as I start working on non-American English languages for string replacement. The reason for this is that this RegExp also pulls out special characters such as "ó" and "ñ". I'm not certain, but I think it would also remove all double-byte characters such as various Asian-language words.

    Has anyone run into this problem and have they found a simple coding solution to catch all non-English special characters?

    Yours,
    Dave

  2. #2
    Join Date
    Dec 2003
    Location
    Bucharest, ROMANIA
    Posts
    15,428
    Yes, it is possible, but you should specify the special characters you want to allow. And use the suitable utf charcode, probably utf-8 in your case.
    Code:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
    <html>
    <head>
    <title>Untitled Document</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <meta http-equiv="Content-Style-Type" content="text/css">
    <meta http-equiv="Content-Script-Type" content="text/javascript">
    <script type="text/JavaScript">
    function valid(f) {
    f.value=f.value.replace(/[^A-z&#199;&#209;Q&#192;&#193;&#200;&#201;&#205;&#204;&#207;&#211;&#210;&#218;&#217;&#220;]/ig,'');
    } 
    </script>
    </head>
    <body><br>
    <form id="myform" action="">
    <input name="mytext" type="text" onkeyup="valid(this)" onblur="valid(this)">
    </form>
    </body>
    </html>
    Last edited by Kor; 02-24-2011 at 10:48 AM.

  3. #3
    Join Date
    Mar 2006
    Posts
    73
    So what you're suggesting is build a library of uppercase and lowercase special characters and using that for the RegExp?

    Makes sense...

    Is there a single or simpler command, however, that would include double-byte or all non-English-but-still-alphanumeric characters that you know of?

    Yours,
    Dave

  4. #4
    Join Date
    Dec 2003
    Location
    Bucharest, ROMANIA
    Posts
    15,428
    Quote Originally Posted by Sylvan012 View Post
    Is there a single or simpler command, however, that would include double-byte or all non-English-but-still-alphanumeric characters that you know of?
    Well, somehow yes, but the RegExp range is not related with the single or double byte characters, it is related with the ASCII range of them. A range like [a-z] covers the ASCII from 97(a) to 122(z). If your special characters you want to allow are in a continuous range, you may use only the first and the last term of the range, as in extended ASCII.
    Last edited by Kor; 02-24-2011 at 11:11 AM.

  5. #5
    Join Date
    Mar 2006
    Posts
    73
    Quote Originally Posted by Kor View Post
    Well, somehow yes, but the RegExp range is not related with the single or double byte characters, it is related with the ASCII range of them. A range like [a-z] covers the ASCII from 97(a) to 122(z). If your special characters you want to allow are in a continuous range, you may use only the first and the last term of the range.
    Ooohhh... Ok, that's an excellent observation I'd not thought of.

    Off-hand, do you know of a good online resource that could provide an incremental list of all such characters?

    I'll search for my own, certainly, but if you've got experience with a good resource like that, I thought I'd ask.

    Yours,
    Dave

  6. #6
    Join Date
    Dec 2003
    Location
    Bucharest, ROMANIA
    Posts
    15,428
    http://www.cdrummond.qc.ca/cegep/inf...iles/ascii.htm

    For instance [&#199;-&#209;] should cover all the characters from ASCII extended 128 = &#199; to 165 =&#209;.

  7. #7
    Join Date
    Mar 2006
    Posts
    73
    Quote Originally Posted by Kor View Post
    http://www.cdrummond.qc.ca/cegep/inf...iles/ascii.htm

    For instance [-] should cover all the characters from ASCII extended 128 = to 165 =.
    Whoa... Ok, that's pretty useful and awesome! Thank you!

  8. #8
    Join Date
    Dec 2003
    Location
    Bucharest, ROMANIA
    Posts
    15,428
    If you are interested in the matter, here's some additional information about handling special characters, ASCII codes, foreign alphabets. Unicode and Regular Expressions.

    You said something about Asian characters and double-byte characters. In fact there are several way of encoding (encoding and characters set are different animals): 1, 2, 3, or 4 bytes per code point (or combination). Code points are a sort of a mapping between characters and numbers. There are "mysterious" name for those Unicode variants (UTF-36 covers 4 bytes per code point, while UTF-8 covers 1,2,3, and 4 bytes per code point )

    Take care, also about the difference between bit and byte

    When in comes about Regular Expressions, a range of characters can be created not only upon their ASCII decimal value, but also on their Unicode or Hexa values. A detailed explanation/standard:

    http://unicode.org/reports/tr18/

    And an additional note about Regular Expressions in JavaScript. Usually people think that RegExp notation is universal. Well, it is more or less a truth. There are some small differences (in fact few incomplete implementations) from a language to another. In JavaScript, the RegExp are similar with those in Pearl.

Thread Information

Users Browsing this Thread

There are currently 2 users browsing this thread. (0 members and 2 guests)

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center



Recent Articles