www.webdeveloper.com
Results 1 to 8 of 8

Thread: [RESOLVED] regexp trouble: \S won't match special characters - WHY??

  1. #1
    Join Date
    Jan 2006
    Location
    South Africa
    Posts
    62

    resolved [RESOLVED] regexp trouble: \S won't match special characters - WHY??

    Hi everyone,

    I'm sure this is embarrassingly simple but I've been at it for too long and I just don't see it. What I'm trying to do is to remove single characters from a string. I.e. in the string 'B comes before c but after a' the single characters must be removed, so that the remaining result is 'comes before but after'.

    Now all literature I've seen so far states that \S in a regexp should match a non-whitespace character. However I find that it does not match any special characters in the target string. For example,

    Code:
    echo preg_replace (('/\b\S\b/', '', "B comes before c but after a");
    produces the desired

    Code:
    comes before but after
    but if the target string contains any of the characters escaped by preg_quote (i.e. \ + * ? [ ^ ] $ ( ) { } = ! < > | : etc... ) these are left in the remaining result:

    Code:
    echo preg_replace (('/\b\S\b/', '', "A { comes } before c ? but : after a !");
    produces:

    Code:
    { comes } before ? but : after !
    Why is this? Yes, special characters have to be escaped in the regexp or you can expect unpredictable results, but if \S should just match a non-whitespace character (defined as [\t\n\r\f\v] according to the literature) why does it not match ? or ! or any other 'special' character?

    And, of course, if I want to remove a single character from a string as I'm trying to do here, what _is_ the proper way to do it?

    * Bangs head against wall *

  2. #2
    Join Date
    May 2006
    Location
    the netherlands
    Posts
    454
    i don't think it's so much a problem of the way \S is interpreted, but more the way \b is handled. I think your special characters act as word-bounderies and that's probably why they're not matched.

    i think if you rewrite your regex to
    PHP Code:
    echo "[".preg_replace('/\S/'''"A { comes } before c ? but : after a !")."]"
    you will see that \S also matches the special characters

  3. #3
    Join Date
    Jan 2006
    Location
    South Africa
    Posts
    62
    Quote Originally Posted by themarty View Post
    i don't think it's so much a problem of the way \S is interpreted, but more the way \b is handled. I think your special characters act as word-bounderies and that's probably why they're not matched.
    Aha. You might very well have a point there. I hadn't thought of that. Thanks for clearing that one up!

    (Update: you are absolutely right! With the above info I did some more digging around on Google and I found this: "A word boundary (\b ) is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order)..." and of course a special character is not a word character (\w)... so there you go.)

    Quote Originally Posted by themarty View Post
    i think if you rewrite your regex to
    PHP Code:
    echo "[".preg_replace('/\S/'''"A { comes } before c ? but : after a !")."]"
    you will see that \S also matches the special characters
    Yes... but then it removes all nonwhitespace characters, of course, not just the single ones flanked by two word boundaries... Hm... What to do?

    * scratches head *

    // FvW
    Last edited by frankvw; 10-05-2010 at 02:26 AM.

  4. #4
    Join Date
    May 2006
    Location
    the netherlands
    Posts
    454
    Quote Originally Posted by frankvw View Post
    Yes... but then it removes all nonwhitespace characters, of course, not just the single ones flanked by two word boundaries
    I know; it was not meant as a solution, but just to demonstrate \S does match the other characters.

    Hm... What to do?

    * scratches head *
    First you'll need to define for yourself, what the rules of replacement are. If that rule is, for example:

    "any single character that is not a white-space and that is enclosed by whitespaces"

    Then that is what you're going to convert into a regex. Which would then look something like this: /\s\S\s/

    But you need to start with the definition first

  5. #5
    Join Date
    Apr 2003
    Location
    Netherlands
    Posts
    21,654
    PHP Code:
    echo "[".preg_replace('/[[:punct:]]|\b\S\b/'''"A { comes } before c ? but : after a !")."]"

  6. #6
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,518
    My interpretation based on what I think is the requirement:
    PHP Code:
    $result = (preg_replace('/(?<=\s|^)\S(?=\s|$)/'''$string); 
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

  7. #7
    Join Date
    Jan 2006
    Location
    South Africa
    Posts
    62

    Smile

    Quote Originally Posted by themarty View Post
    I know; it was not meant as a solution, but just to demonstrate \S does match the other characters.
    And that was enough to eventually point me into the right direction, so it did help me out in the end!

    I did start with the definition, and it would have worked out fine, if not for the fact that special characters turn out to be interpreted as word boundaries. I totally agree with you about the 'think before you do' bit but even though I did think about the parameters (and then some) I got tripped up by insufficient experience with how the various parts of a regular expression can surprise the unwary. :-)

    In the end I arrived at the following:

    Code:
    preg_replace ('/\W|\b\S\b/', ' ', $string)
    which works fine for me. It also removes combinations of single and special characters (e.g. not only ' a ' and ' ! ' are removed but also ' a! ') which I actually had failed to consider, but turns out to be darn near perfect. :-) It does leave multiple white-spaces in the string, but those are not a problem and trivial to get rid of in any case.

    BTW the reason why I'm using something like \b\S\b and not \s\S\s is that single characters at the start or end of a string need to be removed as well. So if you use \s\S\s on something like 'a before b' neither 'a' nor 'b' is removed even though they are single chars.

    Anyway. Your response did lead me to find the solution, so I owe you may thanks, fellow countryman! (I moved from Rotterdam to Johannesburg, South Africa, several years ago.)

    // FvW

  8. #8
    Join Date
    Jan 2006
    Location
    South Africa
    Posts
    62
    Quote Originally Posted by Fang View Post
    PHP Code:
    echo "[".preg_replace('/[[:punct:]]|\b\S\b/'''"A { comes } before c ? but : after a !")."]"
    I fiddled with the POSIX syntax for quite a while, but it didn't work for me. Strangely enough. But then I got it sorted out using \W|\b\S\b so I sort of gave up at that point... :-)

    // FvW

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center



Recent Articles