www.webdeveloper.com
Results 1 to 15 of 15

Thread: [RESOLVED] A RegExp for cutting any JavaScript out from the code

Hybrid View

  1. #1
    Join Date
    May 2006
    Location
    Somewhere behind your screen
    Posts
    1,648

    resolved [RESOLVED] A RegExp for cutting any JavaScript out from the code

    Hello guys! First of all two things:

    1. Happy New Year to everyone, wish you tons of good, let this year bring you all only positive emotions and expectations!
    2. typeof(Santa) is myth

    so, here is my problem. i need a regular expression which cuts any javascript from the page html code. i mean only scripts between <script> and </script> tags, not the inline javascript.

    i wrote the function which disables scripts:

    Code:
    function noscripts(content){var s_path_1=/<script/ig,s_path_2=/\/script>/ig;content=content.replace(s_path_1,'<!--script').replace(s_path_2,'script-->');return content;}
    but i would like to fully remove this stuff from the html. i'm using JQuery $.get() and i need to remove scripts from the returned data before any manipulations with it, because i put a portion of the returned data into a temporary div to be able to search through its elements and i do not need scripts in there.

    i tried to use this one
    Code:
    /<\s*script[^>]*>[\w\W\s]*<\s*\/script>/ig;
    sometimes it matches but sometimes it does not. can anybody help? thanks in advance
    xxx: Guess Buddhist riddle: "What is the sound of one hand clapping?"
    yyy: facepalm

  2. #2
    Join Date
    Mar 2009
    Posts
    501
    Maybe I don't quite understand what you require, but if you put all of a block of HTML into a separate DIV, can't you do something like this:

    Code:
    var f = document.getElementById('myDivWithJavascriptTagsInIt');
    var scr = f.getElementsByTagName('script');
    var i, len = scr.length;
    for(i = 0; i < len; i++){
        scr[i].parentNode.removeChild(scr[i]);
    }
    Or is there something I'm not understanding here?

  3. #3
    Join Date
    May 2006
    Location
    Somewhere behind your screen
    Posts
    1,648
    thanks Tcobb, but i need to remove scripts from the returned data before i put it into the temp div. there is no any elements at that moment the data is just a large string. that's why i need a regular expression to remove the matching parts from the large string - something like this:
    Code:
    var noscripts=/bla-bla-bla/ig;
    $.get("somepage.php",function(data){
    data=data.replace(noscripts,'');/*this removes scripts from the data*/
    $('#temp_div').html(data);
    /*the data is clean for now and i can manipulate with it*/
    });
    xxx: Guess Buddhist riddle: "What is the sound of one hand clapping?"
    yyy: facepalm

  4. #4
    Join Date
    Mar 2009
    Posts
    501
    NEVERMIND--I understand what you're talking about now. I'm removing the reply.
    Last edited by Tcobb; 01-11-2013 at 05:31 PM.

  5. #5
    Join Date
    May 2006
    Location
    Somewhere behind your screen
    Posts
    1,648
    ok thank you very much, i'll try it the way you recommend
    xxx: Guess Buddhist riddle: "What is the sound of one hand clapping?"
    yyy: facepalm

  6. #6
    Join Date
    Mar 2009
    Posts
    501
    Sorry---

    I understand that the javascript may be self-executing, in which case my solution will not help you. I have had to do a similar thing in PHP, and here is the approximate process.

    (1) Loop through the string, replacing every '<(space)' with '<'
    (2) repeat the loop so long as the resultant length of the string is different
    (3) do the same thing for '(space)>'
    (4) use regexps to replace any variant of '<script such as '<SCript' with '<script'
    (5) use regexps to replace any variant of '</script> such as '</SCript>' with '</script>'
    (6) now use regexps to replace anything beginning with <script and ending with </script> with the empty string.

    I hope this is of some use to you.

    --Regards

  7. #7
    Join Date
    Jul 2008
    Location
    urbana, il
    Posts
    2,787
    <script> tags added to dom element via innerHTML will not execute, so it's safe to remove them after.
    scary i know, but trust me, it's safe.

    if still in doubt, replace /<script[^>]+>/g with <script type='nothing'> to deactivate the tags first.

  8. #8
    Join Date
    May 2006
    Location
    Somewhere behind your screen
    Posts
    1,648
    rnd me, in the very first post i adduced the function for disabling scripts called noscripts
    xxx: Guess Buddhist riddle: "What is the sound of one hand clapping?"
    yyy: facepalm

  9. #9
    Join Date
    May 2006
    Location
    Somewhere behind your screen
    Posts
    1,648
    what i am trying to find is just regular expression for removing all the scripts with their tags from the response data markup before it is added on the main page... i could not compose such regexp by myself even using The Regex Coach that's why i asked for help ))
    xxx: Guess Buddhist riddle: "What is the sound of one hand clapping?"
    yyy: facepalm

  10. #10
    Join Date
    Aug 2008
    Location
    Sweden
    Posts
    227
    Happy new year Padonak!

    I changed your RegExp a bit so it would match more variations of script tags:
    Code:
    /<\s*script.*?>.*?(<\s*\/script.*?>|$)/ig
    Though, I would choose rnd's solution over using regular expressions - it seems to the most secure way.
    New to web development or in need of a good reference? Check out the Mozilla Developer Network or W3Schools.

  11. #11
    Join Date
    May 2006
    Location
    Somewhere behind your screen
    Posts
    1,648
    Quote Originally Posted by refreezed View Post
    Happy new year Padonak!

    I changed your RegExp a bit so it would match more variations of script tags:
    Code:
    /<\s*script.*?>.*?(<\s*\/script.*?>|$)/ig
    Though, I would choose rnd's solution over using regular expressions - it seems to the most secure way.
    thanks man, but this regexp matches tags only and ignores everything between these tags. i do not need to remove or disable the script tags only, i am trying to find the way to cut off the script tags and all the code between them by replacing it with nothing

    data=data.replace(regexp goes here,'nothing');

    for example, let us try to remove all scripts from this code:
    Code:
    <meta http-equiv="content-type" content="text/html; charset=windows-1251">
    <meta http-equiv="Pragma" content="no-cache">
    <meta http-equiv="Expires" content="-1">
    
    <script type="text/javascript" src="/scripts/jquery-1.4.2.min_223.js"></script>
    <script type="text/javascript" src="/scripts/overlib.js"></script>
    <script type="text/javascript" src="/scripts/ressources.js"></script>
    <script type="text/javascript" src="/scripts/form.js"></script>
    </head>
    <body>
    <script type="text/javascript" language="javascript">
    var ress = new Array(2647618, 174070, 442978);
    var max = new Array(5525950,3069400,1709800);
    var production = new Array(32.011111111111, 10.041666666667, 5.6625);
    window.setInterval("res_online()",1000);
    </script>
    
    <form name="ress" id="ress" style="display:inline">
    <INPUT TYPE="hidden" ID="metall" value="0">
    <INPUT TYPE="hidden" ID="crystall" value="0">
    <INPUT TYPE="hidden" ID="deuterium" value="0">
    <INPUT TYPE="hidden" ID="bmetall" value="0">
    <INPUT TYPE="hidden" ID="bcrystall" value="0">
    <INPUT TYPE="hidden" ID="bdeuterium" value="0">
    </form>
    this is a fragment of the response data. before i put it in a temp div i need this code to become "script-free", smth like this:

    Code:
    <meta http-equiv="content-type" content="text/html; charset=windows-1251">
    <meta http-equiv="Pragma" content="no-cache">
    <meta http-equiv="Expires" content="-1">
    
    
    </head>
    <body>
    
    
    <form name="ress" id="ress" style="display:inline">
    <INPUT TYPE="hidden" ID="metall" value="0">
    <INPUT TYPE="hidden" ID="crystall" value="0">
    <INPUT TYPE="hidden" ID="deuterium" value="0">
    <INPUT TYPE="hidden" ID="bmetall" value="0">
    <INPUT TYPE="hidden" ID="bcrystall" value="0">
    <INPUT TYPE="hidden" ID="bdeuterium" value="0">
    </form>
    xxx: Guess Buddhist riddle: "What is the sound of one hand clapping?"
    yyy: facepalm

  12. #12
    Join Date
    Mar 2009
    Posts
    501
    I haven't tried it, but this might do what you're looking for with just a string into string type operation.

    Code:
    function killScripts(str){  //takes string as argument containing the HTML
         //get ride of all outer spaces in  tags
         var arr, inArr, i, oldLen, output, len = str.length;
         do{
           oldLen = len;
           str = str.replace(/< /ig, '<');
           str = str.replace(/ >/ig, '>);
           len = str.length;
          } while(len != oldLen);
          //dispose of case sensitivity
          str = str.replace(/<script/ig,'<script');
          str = str.replace(/<\/script/ig,'</script');
          //now take them out
          arr = str.split('<script');
          len = arr.length;
          output = arr[0];
          for(i = 1; i < len; i++){
    	    inArr = arr[i].split('</script>');
    	    output += inArr[1];
           }
           return output;
    }
    Last edited by Tcobb; 01-12-2013 at 06:14 PM.

  13. #13
    Join Date
    Aug 2008
    Location
    Sweden
    Posts
    227
    Oh wow, I forgot that the dot doesn't match newlines... This works for your example:

    Code:
    /<\s*script[^>]*>[\s\S]*?(<\s*\/script[^>]*>|$)/ig
    And FYI, the last expression I posted did not ignore anything between the script tags - it only failed if the script contained newlines.

    Edit: on a closer look, the biggest problem with your original expression is that you make a greedy search between the script's start and end tag, thus it'll match the very first <script>, the very last </script>, and everything in between (including other script end and start tags).
    Last edited by ReFreezed; 01-12-2013 at 07:52 PM.
    New to web development or in need of a good reference? Check out the Mozilla Developer Network or W3Schools.

  14. #14
    Join Date
    Jul 2008
    Location
    urbana, il
    Posts
    2,787
    i think that we might not need to be so defensive against malformed scripts.
    if i understand it, the scripts are coming from a trusted source, so you probably won't see something like:


    Code:
    <scRipt>alert(1);/*<script>alert(2)</script>*/;'</script>';alert(3)</script>

    if you don't think you might be under attack, it's quite simple to remove all script tags:


    Code:
    function noscript(strHTML){
       var div=document.createElement("div");
       div.innerHTML = strHTML;
       var scripts = div.getElementsByTagName("script");
    
      for(var i=scripts.length;i--;){
         scripts[i].parentNode.removeChild(scripts[i]);
      }
    
      return div.innerHTML;
    
    }

    here is a live test (click RUN to execute)
    Last edited by rnd me; 01-12-2013 at 08:15 PM.

  15. #15
    Join Date
    May 2006
    Location
    Somewhere behind your screen
    Posts
    1,648
    many thanks to all you guys who tried to help me in this thread! i've always knew that i could find help here ))

    ReFreezed, the second regexp edition works perfectly - here is the evidence matches_now.png, thanks!
    Tcobb, i haven't tried your code yet, but i'm going to try it and put it in my "must have" js folder if it works (it looks like it does), thanks!
    rnd me, thank you for trying to help me, i very much appreciate it! i know that removing elements through the DOM would be the simpliest way, but it causes js-errors if i let these scripts stay in the response data, that's why i need string operations to sweep the scripts out of the data before i put it in the page. But anway, thanks!
    xxx: Guess Buddhist riddle: "What is the sound of one hand clapping?"
    yyy: facepalm

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center



Recent Articles