www.webdeveloper.com
Page 1 of 2 12 LastLast
Results 1 to 15 of 16

Thread: Replacing characters with html codes, but excluding html tags

  1. #1
    Join Date
    Nov 2008
    Posts
    2,477

    Replacing characters with html codes, but excluding html tags

    I have a string which contains html markup. As an example:

    HTML Code:
    <p class="foo">Some text in a paragraph "which may be quoted & have special chars" in it</p>
    <p>It is true that <span>1</span> > 5, but not that 1 > 0</p>
    As you can see there might be some special characters, and ultimately this HTML is going to be rendered and I don't want plain characters like " and &, i want &quot; and &amp; etc.

    Obviously I can't just run this through htmlspecialchars() because that would convert the HTML tags too.

    I want to end up with this:

    HTML Code:
    <p class="foo">Some text in a paragraph &quot;which may be quoted &amp; have special chars&quot; in it</p>
    <p>It is true that <span>1</span> &gt; 5, but not that 1 &gt; 0</p>
    Does anyone know of a way of converting these characters when they are outside of an HTML tag? I'm thinking it might have to be done using a regex (which isn't my strong point!) since I can't guarantee that the code will be well-formed XML.

  2. #2
    Join Date
    Jan 2009
    Posts
    3,346
    You could use htmlspecialchars and then switch back the html entities using a search and replace all the &gt; and &lt;.

  3. #3
    Join Date
    Nov 2008
    Posts
    2,477
    What about the quotes around attributes though? Or < and > within the actual content? The more I think about this the more I think I might just have to enforce well-formed content and use xml_parse().

  4. #4
    Join Date
    Jan 2009
    Posts
    3,346
    I didn't think about that. That's probably your best bet then.

  5. #5
    Join Date
    Nov 2008
    Posts
    2,477
    Hmmmm I'm still struggling with this.

    Using an XML parser doesn't work, because it falls over it finds characters like & and < in the string. This is obviously a problem as the entire point is to find those characters and fix them.

    There must be some way of taking some HTML source and replacing html entities within the tags?

  6. #6
    Join Date
    Jan 2005
    Location
    Alicante (Spain)
    Posts
    7,742
    I'm not saying it's bulletproof but maybe something like this:
    PHP Code:
    $subject preg_replace_callback(
        
    '/(?<=\>)((?![<]\/*[a-z][^>]*[>]).)+/is',
        
    create_function(
            
    '$matches',
            
    'return htmlspecialchars($matches[0]);'
        
    ),
        
    $subject
    ); 

  7. #7
    Join Date
    Nov 2008
    Posts
    2,477
    Thanks a lot, that seems to work nicely from my initial tests. Is there an easy way of skipping any php content (ie between <?php and ?>) that may exist? This is the only issue I still seem to be having. So for example the following:

    PHP Code:
    $input = '<p class=\"test\">A string with < and > and & and " characters.</p><p>PHP tags are <?php echo "broken"?> though...</p>';
    Is currently returned like this:

    Code:
    <p class="test">A string with &lt; and &gt; and &amp; and &quot; characters.</p> <p>PHP tags are &lt;?php echo &quot;broken&quot;; ?&gt; though...</p>
    I could just replace &lt;?php and ?&gt; with <?php and ?>, but it would be neater if they could be left untouched in the first place.

    EDIT: actually replacing &lt;?php and ?&gt; would only be part of the problem anyway - i'd still have any & and " characters within the PHP itself which would get converted.
    Last edited by Mindzai; 06-05-2009 at 08:50 AM.

  8. #8
    Join Date
    Jan 2005
    Location
    Alicante (Spain)
    Posts
    7,742
    You could try it like this:
    Code:
    /(?<=\>)((?![<](\?|\/)*[a-z][^>]*[>]).)+/is
    But it all depends what's between those PHP tags that might upset the regex. If it's just a basic echo statement like here though it should be ok.

    But php that contains html tags would break it. <?php echo '<html>' ?> for example.
    Last edited by bokeh; 06-05-2009 at 09:01 AM.

  9. #9
    Join Date
    Jan 2005
    Location
    Alicante (Spain)
    Posts
    7,742
    I think for anything more complex you would need to cut it up into different strings and use more specific regexes in each case (a two or three step process rather than a one size fits all regex).

  10. #10
    Join Date
    Nov 2008
    Posts
    2,477
    Hmmmm yes sometimes the PHP could be fairly complex. I think I might try what you suggest, if I split out the string based on the php <?php and ?> tags, I could then just apply this callback to the non-php portions of the string, then stick it back together after.

  11. #11
    Join Date
    Nov 2008
    Posts
    2,477
    OK this is what I've got, fingers crossed it seems to work OK, thanks for the help:

    PHP Code:
    function cleanEntities($string) {
        
        $string = htmlspecialchars_decode($string);
        
        $parts = preg_split('/(<\?.+?\?>)/', $string, -1, PREG_SPLIT_DELIM_CAPTURE);
        
        $string = '';
        
        foreach ($parts as $part) {
            if (false === mb_strpos(trim($part), '<?')) {
                $string .= preg_replace_callback(
                    '
    /(?<=\>)((?![<](\?|\/)*[a-z][^>]*[>]).)+/is',
                    create_function(
                        '
    $matches',
                        '
    return htmlspecialchars($matches[0]);'
                    ),
                    $part
                );
            } else {
                $string .= $part;
            }
        }
        
        return $string;
        
    }

  12. #12
    Join Date
    Nov 2008
    Posts
    2,477
    Noticed a small bug but I'm unable to edit:

    PHP Code:
    $parts preg_split('/(<\?.+?\?>)/'$string, -1PREG_SPLIT_DELIM_CAPTURE); 
    Should be:

    PHP Code:
    $parts preg_split('/(<\?.+?\?>)/us'$string, -1PREG_SPLIT_DELIM_CAPTURE); 
    The modifiers needed to be added to the regex.

  13. #13
    Join Date
    Jan 2005
    Location
    Alicante (Spain)
    Posts
    7,742
    Code:
    <\?.+?\?>
    Maybe not important but that will split on an xml tag.
    Code:
    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>

  14. #14
    Join Date
    Nov 2008
    Posts
    2,477
    Yes i did think that, though I dont anticipate having any XML in the string. Maybe I'd be better with something like this though:

    Code:
    /(<\?(php|=).+?\?>)/us

  15. #15
    Join Date
    Jan 2014
    Posts
    12
    wow well its easy if you want to put copyright sign just write & copy ; same as or you can write numbers, you dont put space in there, if you want more updates you can you cab join blogs.

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center



Recent Articles