dcsimg
www.webdeveloper.com
Results 1 to 14 of 14

Thread: Finding Banned Words On A Page And Not Within Other Words!

  1. #1
    Join Date
    Oct 2016
    Posts
    152

    Finding Banned Words On A Page And Not Within Other Words!

    Php Lovers!

    I am NOT searching for banned words within other words on a page but searching for banned words within a loaded page.
    I am not actually looking for banned words within other words but within the page (meta tags, content).

    And so, if I am looking for the word "****", then the word "****erel" should not trigger the filter.

    I just tested this code and, yes, as expected the code works but as you can guess there is a lot of cpu power cycling through. One moment the page loads, the other moment it goes grey and shows signs that the page is taking too long to load. And all this on localhost. Now, I can imagine what my webhost would do!
    So now, we will have to come-up with a better solution. Any ideas ?
    How-about we do not get the script to check on the loaded page for all the banned words ? How-about we get the script to halt as soon as 1 banned word is found and an echo has been made which banned word has been found and where on the page ? (meta tags, body content, etc.).
    Any code suggestions ?

    Here is what I got so far:

    Code:
    <?php
     
    /*
    ERROR HANDLING
    */
     
    // 1). $curl is going to be data type curl resource.
    $curl = curl_init();
     
    // 2). Set cURL options.
    curl_setopt($curl, CURLOPT_URL, 'https://www.buzzfeed.com/mjs538/the-68-words-you-cant-say-on-tv?utm_term=.xlN0R1Go89#.pbdl8dYm3X');
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true );
     
    // 3). Run cURL (execute http request).
    $result = curl_exec($curl);
    $response = curl_getinfo( $curl );
     
    if( $response['http_code'] == '200' )
       {
        //Set banned words.
        $banned_words = array("Prick","Dick","***");
     
        //Separate each words found on the cURL fetched page.
        $word = explode(" ", $result);
        
       //var_dump($word);
     
       for($i = 0; $i <= count($word); $i++)
          {
          foreach ($banned_words as $ban) 
             {
             if (strtolower($word[$i]) == strtolower($ban))
                {
                 echo "word: $word[$i]<br />";
                 echo "Match: $ban<br>";
                }
             else
                {
                 echo "word: $word[$i]<br />";
                 echo "No Match: $ban<br>";  
                }
             }
          }
       }  
     
    // 4). Close cURL resource.
    curl_close($curl);

  2. #2
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    22,158
    You'll probably want to use regular expressions with the preg_*() functions so that you can assert for word boundaries, include optional plural forms, etc.

    http://php.net/preg_match
    http://php.net/manual/en/pcre.pattern.php
    "Well done....Consciousness to sarcasm in five seconds!" ~ Terry Pratchett, Night Watch

    How to Ask Questions the Smart Way (not affiliated with this site, but well worth reading)

    My Blog
    cwrBlog: simple, no-database PHP blogging framework

  3. #3
    Join Date
    Oct 2017
    Location
    Lithuania
    Posts
    25
    I would suggest using preg_replace because your code won't remove all bad words.

    For example, let's say one of your banned words is "Lexus". Your code will only detect word banned in sentence like this - "I like my Lexus car". However, if sentence will be like "I like my Lexus!", "Wow...Lexus is so cool", your code will do nothing because it expects each banned word to start and end with an empty space.

    Hence, you should use preg_replace, which is more flexible than explode.

  4. #4
    Join Date
    Oct 2016
    Posts
    152
    Thank you NogDog & PhpMillion,

    I see a complete blank page!
    Did I manage to do as you suggested NogDog ?
    Here's my latest update NogDog & PhpMillion:

    [code]
    <?php

    /*
    ERROR HANDLING
    */

    // 1). $curl is going to be data type curl resource.
    $curl = curl_init();

    // 2). Set cURL options.
    curl_setopt($curl, CURLOPT_URL, 'https://www.buzzfeed.com/mjs538/the-68-
    words-you-cant-say-on-tv?utm_term=.xlN0R1Go89#.pbdl8dYm3X');
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true );

    // 3). Run cURL (execute http request).
    $result = curl_exec($curl);
    $response = curl_getinfo( $curl );

    if( $response['http_code'] == '200' )
    {
    //Set banned words.
    $banned_words = array("Prick","Dick","***");

    //Separate each words found on the cURL fetched page.
    $word = explode(" ", $result);

    //var_dump($word);

    for($i = 0; $i <= count($word); $i++)
    {
    foreach ($banned_words as $ban)
    {
    if (strtolower($word[$i]) == strtolower($ban))
    {
    echo "word: $word[$i]<br />";
    echo "Match: $ban<br>";

    $regex = '/\b'; // The beginning of the regex string syntax
    $regex .= implode('\b|\b', $banned_words); // joins all the banned words to the string with correct regex syntax
    $regex .= '\b /i'; // Adds ending to regex syntax. Final i makes it case insensitive
    $substitute = '****';
    $cleanresult = preg_replace($regex, $substitute, $result);
    echo $cleanresult;
    }
    else
    {
    echo "word: $word[$i]<br />";
    echo "No Match: $ban<br>";
    }
    }
    }
    [
    Last edited by uniqueideaman; 10-04-2017 at 06:15 PM.

  5. #5
    Join Date
    Oct 2017
    Location
    Lithuania
    Posts
    25
    First of all, you may want to modify your regex because it only detects exactly the same words as your explode function. In other words, instead of using a regex which would find words that end with symbols like . ! ? and so on (like many regular sentences end), you are searching for SPACEbanned_wordSPACE twice, causing an unneeded load.

  6. #6
    Join Date
    Oct 2016
    Posts
    152
    PhpMillion,

    I check for codes in Stack Over Flow and I see unfinished threads started by others. I myself sometimes start there too but do not always get the solution. Since solutions are not properly solved there. I grab codes from there and bring it to others (like I am doing now). If you want to know the full history of this particular code then you are welcome to read this small thread there:
    https://stackoverflow.com/questions/...in-other-words

    Anyway, I will try your advice later when I return in-doors.

    Thanks!

  7. #7
    Join Date
    Mar 2007
    Location
    localhost
    Posts
    5,257
    The big question is more like how often are you going to perform this operation? Once a day or once a second...

    That is something that you have to consider and what sort of impact it will have on your server and the resources to support multiple requests, filtering, etc.
    --> JavaScript Frameworks like JQuery, Angular, Node <--
    ... and please remember to wrap code with forum BBCode tags:-

    [CODE]...[/CODE] [HTML]...[/HTML] [PHP]...[/PHP]

    If you can't think outside the box, you will be trapped forever with no escape...

  8. #8
    Join Date
    Oct 2017
    Location
    Lithuania
    Posts
    25
    In my personal opinion, it's not really important how often code will be executed. Even if OP executes it once a year, it's a good practice (and just a common sense) to keep the code as small and efficient as possible. And executing the same operation twice simply doesn't look logical.

  9. #9
    Join Date
    Oct 2016
    Posts
    152
    I will get php to execute the code once everytime a page loads on a user's screen.
    In other words, whenever the proy fetches a page, it must check for banned words. The filter must get to work!

  10. #10
    Join Date
    Mar 2007
    Location
    localhost
    Posts
    5,257
    Quote Originally Posted by phpmillion View Post
    In my personal opinion, it's not really important how often code will be executed. Even if OP executes it once a year, it's a good practice (and just a common sense) to keep the code as small and efficient as possible. And executing the same operation twice simply doesn't look logical.
    The point of my question is to highlight the problem of the server having to filter everytime the page is requested, so it is important, dismissing the question like you have doesn't help the OP to resolve the problem.

    IMHO, if the page is going to be requested multiple times then the page should be parse on import and the original and the redacted copy get put in a database.
    --> JavaScript Frameworks like JQuery, Angular, Node <--
    ... and please remember to wrap code with forum BBCode tags:-

    [CODE]...[/CODE] [HTML]...[/HTML] [PHP]...[/PHP]

    If you can't think outside the box, you will be trapped forever with no escape...

  11. #11
    Join Date
    Oct 2017
    Posts
    24
    Quote Originally Posted by phpmillion View Post
    In my personal opinion, it's not really important how often code will be executed. Even if OP executes it once a year, it's a good practice (and just a common sense) to keep the code as small and efficient as possible. And executing the same operation twice simply doesn't look logical.
    I agree with you. Executing the code twice is resource consuming.
    But, I'm curious:
    Q1. Where in this code the code is forced to execute or loop twice ?
    Code:
    <?php
    
    /*
    ERROR HANDLING
    */
    //declare(strict_types=1);
    ini_set('display_errors', '1');
    ini_set('display_startup_errors', '1');
    error_reporting(E_ALL);
    mysqli_report(MYSQLI_REPORT_ERROR | MYSQLI_REPORT_STRICT);
    
    
    //RESULT: Code working!
    
    // 1). Set banned words.
    $banned_words = array("blow", "nut", "bull****");
    // 2). $curl is going to be data type curl resource.
    $curl = curl_init();
    // 3). Set cURL options.
    curl_setopt($curl, CURLOPT_URL, 'https://www.buzzfeed.com/mjs538/the-68-words-you-cant-say-on-tv?utm_term=.xlN0R1Go89#.pbdl8dYm3X');
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true );
    // 4). Run cURL (execute http request).
    $result = curl_exec($curl);
    if (curl_errno($curl)) {
        echo 'Error:' . curl_error($curl);
    }
    $response = curl_getinfo( $curl );
    if($response['http_code'] == '200' )
    {
        $regex = '/\b';     
        $regex .= implode('\b|\b', $banned_words);   
        $regex .= '\b/i'; 
        $substitute = '****';
        $cleanresult = preg_replace($regex, $substitute, $result);
        echo $cleanresult;
    }
    curl_close($curl);
    ?>
    Q2. How would you code it ? I mean, may I see a code sample ?

    Q3. If I change this:

    Code:
    // 3). Set cURL options.
    curl_setopt($curl, CURLOPT_URL, 'https://www.buzzfeed.com/mjs538/the-68-words-you-cant-say-on-tv?utm_term=.xlN0R1Go89#.pbdl8dYm3X');
    to this:

    Code:
    // 3). Set cURL options.
    $url = "https://www.buzzfeed.com/mjs538/the-68-words-you-cant-say-on-tv?utm_term=.xlN0R1Go89#.pbdl8dYm3X";
    curl_setopt($curl, CURLOPT_URL, "$url");
    then it works fine. But, is it safe to do it like this ?

    Q4. Where on Mini Proxy, on which line, would I add the banned words filter code (which you see in my Q1) or your banned words filter code so that, when banned words are found on the proxied pages, then the banned words are substituted ?

    @all members:
    You're welcome to add a few lines of code onto Mini Proxy so that the proxy users are blocked from viewing pages that contain banned words.
    https://github.com/joshdick/miniProx.../miniProxy.php
    Then, kindly attach the script here so that I can get hold of your update and install it on my site. You're welcome to ue the proxy regular and you'e welcome to invite others to use it to.
    I've googled but no luck in finding a shared webhost who will allow me to run a web proxy. Therefore, you're welcome to recommend many. I'll only hire the dedicated server once I've picked-up a lot of regular users.


  12. #12
    Join Date
    Oct 2017
    Posts
    24
    Does anyone know the answers to Q2 & Q4 ?

  13. #13
    Join Date
    Mar 2007
    Location
    localhost
    Posts
    5,257
    I would say that using MySQL for this operation to find words in a field by using a filter on a result set from another table would be one possible method of getting the actual words found as a result set.

    Its been a very long time since I have had to use complex results and filtering, so I am not about to bend my head relearning it, I suggest it if you are going to be filtering lots of pages, you store the original document in a table, you apply a filter based on another table by looking for those words in the target document to give a set of words as the result, you use that to replace those words in that document.

    So step 1 is to obtain the word list rather than using conventional scripting to loop through an array, you use the database which is optimized for searching data for results,
    Step 2 is you can use that data set in a second query to replace those words found and store a copy that you then serve to the client.

    What this does is allow you to review the defectiveness of the filtering and look for anomalies and improve the filtering process.
    --> JavaScript Frameworks like JQuery, Angular, Node <--
    ... and please remember to wrap code with forum BBCode tags:-

    [CODE]...[/CODE] [HTML]...[/HTML] [PHP]...[/PHP]

    If you can't think outside the box, you will be trapped forever with no escape...

  14. #14
    Join Date
    Oct 2017
    Posts
    24

    Question

    Quote Originally Posted by \\.\ View Post
    I would say that using MySQL for this operation to find words in a field by using a filter on a result set from another table would be one possible method of getting the actual words found as a result set.

    Its been a very long time since I have had to use complex results and filtering, so I am not about to bend my head relearning it, I suggest it if you are going to be filtering lots of pages, you store the original document in a table, you apply a filter based on another table by looking for those words in the target document to give a set of words as the result, you use that to replace those words in that document.

    So step 1 is to obtain the word list rather than using conventional scripting to loop through an array, you use the database which is optimized for searching data for results,
    Step 2 is you can use that data set in a second query to replace those words found and store a copy that you then serve to the client.

    What this does is allow you to review the defectiveness of the filtering and look for anomalies and improve the filtering process.
    Quote Originally Posted by \\.\ View Post
    I would say that using MySQL for this operation to find words in a field by using a filter on a result set from another table would be one possible method of getting the actual words found as a result set.

    Its been a very long time since I have had to use complex results and filtering, so I am not about to bend my head relearning it, I suggest it if you are going to be filtering lots of pages, you store the original document in a table, you apply a filter based on another table by looking for those words in the target document to give a set of words as the result, you use that to replace those words in that document.

    So step 1 is to obtain the word list rather than using conventional scripting to loop through an array, you use the database which is optimized for searching data for results,
    Step 2 is you can use that data set in a second query to replace those words found and store a copy that you then serve to the client.

    What this does is allow you to review the defectiveness of the filtering and look for anomalies and improve the filtering process.
    I am trying to add this filter a web proxy. Mini Proxy.
    I don't think it is a good idea to have the html of every page fetched by the proxy since the proxy would be publicly available. Will run out of space very fast. Yes, when a banned word is found on a page, that page's url can be saved to a tbl (banned urls) in the db and when a user requests a page, the proxy can check the banned urls list and prevent the page from loading if the url is black listed. I can program it like that to save bandwidth and time. But first things first, I need to get the banned words filter to work on the Mini Proxy to which I'm having trouble right now:
    http://www.webdeveloper.com/forum/sh...-On-Which-Line

Thread Information

Users Browsing this Thread

There are currently 2 users browsing this thread. (0 members and 2 guests)

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center



Recent Articles