...many times to remove non-alphanumeric and non "." and " " characters but am having to re-think its use as I start working on non-American English languages for string replacement. The reason for this is that this RegExp also pulls out special characters such as "ó" and "ñ". I'm not certain, but I think it would also remove all double-byte characters such as various Asian-language words.
Has anyone run into this problem and have they found a simple coding solution to catch all non-English special characters?
Is there a single or simpler command, however, that would include double-byte or all non-English-but-still-alphanumeric characters that you know of?
Well, somehow yes, but the RegExp range is not related with the single or double byte characters, it is related with the ASCII range of them. A range like [a-z] covers the ASCII from 97(a) to 122(z). If your special characters you want to allow are in a continuous range, you may use only the first and the last term of the range, as in extended ASCII.
Well, somehow yes, but the RegExp range is not related with the single or double byte characters, it is related with the ASCII range of them. A range like [a-z] covers the ASCII from 97(a) to 122(z). If your special characters you want to allow are in a continuous range, you may use only the first and the last term of the range.
Ooohhh... Ok, that's an excellent observation I'd not thought of.
Off-hand, do you know of a good online resource that could provide an incremental list of all such characters?
I'll search for my own, certainly, but if you've got experience with a good resource like that, I thought I'd ask.
If you are interested in the matter, here's some additional information about handling special characters, ASCII codes, foreign alphabets. Unicode and Regular Expressions.
You said something about Asian characters and double-byte characters. In fact there are several way of encoding (encoding and characters set are different animals): 1, 2, 3, or 4 bytes per code point (or combination). Code points are a sort of a mapping between characters and numbers. There are "mysterious" name for those Unicode variants (UTF-36 covers 4 bytes per code point, while UTF-8 covers 1,2,3, and 4 bytes per code point )
Take care, also about the difference between bit and byte
When in comes about Regular Expressions, a range of characters can be created not only upon their ASCII decimal value, but also on their Unicode or Hexa values. A detailed explanation/standard: