Click to See Complete Forum and Search --> : Problem with regular expressions when word not first on line


nemesis_256
10-13-2007, 06:37 PM
I'm working on a script that will find certain keywords in an html file, and then link them to a file with the same name as the keyword (spaces are replaced with dashes and an extension added). I'm looping through an array which reads the keywords to look for. Here's the line with the regular expressions.

$fString = preg_replace("/[^(<h\d?>)?.*(<\/h\d?>)?][^$keywordList[$count]]$keywords[$count2]{1}/i", "<a href=\"$links[$count2]\">$keywords[$count2]</a>", $fString, 1);

$keywords has the word, and $links has the file name.
/[^(<h\d?>)?.*(<\/h\d?>)?]
that part of the regular expression says to ignore any header tags
[^$keywordList[$count]]
that part ignores the current keyword ($keywordList is an array with all the keywords, and each has their own html file).
$keywords[$count2]{1}
the last part looks for the actual keyword.

So the problem is that it only works if the keyword is the first thing on a line. If there's some other word and a space before it, it doesn't find it and no link is created. Also sometimes it seems to erase some letters that are not part of the keyword. For example if I had "qwertykeyword", and it searched for "keyword" a few letters such as "rty" would disappear.

Thanks in advance.

scragar
10-13-2007, 10:09 PM
yeah, I'm guessing that your ignore header tags thing is the problem. You wanna give me some sample text of header format and content format so I can fix this for you(or if it's in HTML format let me know, that's an easy fix :P)

hyperlisk
10-14-2007, 05:19 AM
Try adding the 'm' parameter to the regular expression. 'm' makes the regular expression match across multiple lines:
$fString = preg_replace("/[^(<h\d?>)?.*(<\/h\d?>)?][^$keywordList[$count]]$keywords[$count2]{1}/im", "<a href=\"$links[$count2]\">$keywords[$count2]</a>", $fString, 1);

nemesis_256
10-14-2007, 08:40 AM
It is plain HTML, meaning there is a doctype, header, body, divs, paragraphs, and so on. Here's a small sample.
<h1>sdfjsdflsdjf file sharing sdfsdf sldk</h1>

<p>
best file sharing program dslfks dfsj sdf lksjd building a web page sdlfjsdfkjsdflsdf creating a web page sdfsdldsfk jsdfk sdlksj df file share program sdflksjdf lsdkfjsdflksdfj lksdfj file sharing application sdf sdfsf sdf df file sharing ds sdfs dfsdf sdfs df fsdsd f
</p>


I tried playing around with the header part of the regular expression, and I wasn't able to make it do anything different. What's also annoying is that it seems to link words that are within a header tag anyway.

Adding the m parameter didn't make any difference either.

If I add spaces to the regular expression (so it looks for spaces on both sides of the keyword) it seems to work better at finding keywords in the middle of a line/paragraph, but I still get the problem of it erasing a few letters before the beginning anchor tag.

I hope I can make this work. :(

nemesis_256
10-14-2007, 03:06 PM
Another update. I changed the regex for detecting the header tags so it would also work if that tag had attributes.

$fString = preg_replace("/[^<h\d\".+<\/h\d>][^$keywordList[$count]]$keywords[$count2]{1}/im", "<a href=\"$links[$count2]\">$keywords[$count2]</a>", $fString, 1);

I also did a bit more exploration with letters disappearing when a tag is added. I removed the two parts of the regex that do the ignoring (so it was only /$keywords[$count2]{1}/im) and at that point the letters did not get removed.

So something in the two square bracket areas is deleting a few letters. Is there another way I can tell it to ignore something? Like some kind of if statement that checks those before checking the keyword. (I hope this makes sense, it's kinda hard to explain).

nemesis_256
10-15-2007, 09:56 PM
anyone?

Kostas Zotos
10-16-2007, 11:31 AM
Hi,

I prepared a bit of code for this issue to suggest a possibly workaround.. (Please Copy - paste the code, save as .php and run).

In general two different regular expressions used to achieve the results
The regex use the "s" dot-all modifier in conjunction with "m" to replace at once all keyword occurences (that meet the regex conditions) with corresponding links. $keywords, $links and $keywordList fill in with simple examples for demonstration purposes..

<?php

$keywords =array("sharing", "file", "program", "php"); // Keywords to search and replace
$links =array( "dictionary", "http://www.domain.com/data", "wiki", "wiki/web"); // Keywords links locations - URIs
$keywordList=array( "backup"); // Keywords to not match, example the: "backup file" will be not replaced

// Our text as a long string (here a short HTML file for demonstration)
$Subject='
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html> <head>
<title>Preg Repace Test</title>

<META http-equiv="content-type" content="text/html; charset=iso-8859-1">
<META name="Keywords" content=" sharing, file, program, php, backup ">
<script language="JavaScript" type="text/javascript" src="Home/ScriptFile.js"></script></head>

<body topmargin="30" leftmargin="50" bgColor="#cecece" style="font:14px/18px verdana">
<center>
Regular Expression Test in PHP (preg_replace example)<br>
(automatically creates links to files for matched keywords)

<h2>sdfjsdflsdjf backup file sharing sdfsdf sldk</h2>
<p>
php best file sharing <b> backup program </b> dslfks dfsj sdf lksjd building a web page sdlfjsdfkjsdflsdf creating a web page sdfsdldsfk jsdfk sdlksj df file share

program sdflksjdf lsdkfjsdflksdfj lksdfj file sharing application sdf sdfsf sdf df <b> backup file </b> sharing ds sdfs dfsdf PHP sdfs df <b> backup sharing </b> fsdsd f
</p>

</center></body>
</html>
'; // The end single quot of our $Subject var.


$OriginalSubject=$Subject;
$count=0;

echo "<center>\n<h2>".'REGULAR EXPRESSIONS: TEXT REPLACEMENT IN PHP'."</h2>\n";
echo '( Example by K.Zotos &nbsp -10/2007- )'."\n</center>\n<br><hr><br>\n";

for ($count2=0; $count2<count($keywords); $count2++) {

$Subject=$OriginalSubject; // Initialize our text after each replacement.

// To avoid replace keywords in header.. we temporarily remove everything from start until header end (not the best solution)
// Note: If text exists after the header end, and before the <body> start ( </header> ... <body> ) it will not replaced however, and this maybe an issue..
$Matches = array(0 => "");
$Pattern = "/.+\/head\s*>/ims";
$Result = preg_match($Pattern, $Subject, $Matches); // If header section found $Matches[0] will contains the whole matched string

if ($Result >0) $Subject=preg_replace($Pattern, "", $Subject); // Temporarily replace the header contents with "" empty string
// we have stored them in $Matches array..

$Extension='.php'; // File extension
$File='/'.$keywords[$count2].$Extension; // File name with a forward slash in fron of it (will used to compose keywords URLs)
$Replacement='<a href="'.$links[$count2].$File.'">'.$keywords[$count2].'</a>';

// The following pattern matches a keyword that not preceded by words in $keywordList + one space.
// Eg. if search for the keyword "program" and the "not match" word is "backup" then the "backup program" will not replaced..
$Pattern = "/(?<!$keywordList[$count]\s)$keywords[$count2]/ims"; // i: case insensitive search, m:multiline, s:dot-all (. dot char can be anything, even new line)
$Subject = @$Matches[0].preg_replace($Pattern, $Replacement, $Subject); // Concatenates the previously removed init text (header) with the body text after replacement.

// Prepare and Output the results in HTML.
$Text="\n<b><span style='color:#115566'>".' -------------- TEXT REPLACEMENT &nbsp Array Element ( '.$count2." ) </span></b> ------------------- ";
$Text.= "\n\n<br><br>\n".'Keyword Replaced: <b>'.$keywords[$count2].'</b><br>';
$Text.= "\n".'Not Replaced: <b>'.$keywordList[$count].' '.$keywords[$count2].'</b>';
$Text.= "<br>\n".'File URL: <b>'.$links[$count2].$File.'</b>';
$Text.="<br>\n\n".$Subject. "\n\n\n<br><hr><br><br>";

echo $Text;
}

?>

Of course there were always better approaches and any suggestion is welcome as ever :)

Kostas

nemesis_256
10-17-2007, 08:34 AM
Thanks for the example, it's quite good, but not exactly what I'm trying to do.

First, can you explain what this part of the regex does?
(?<!

Also, I need a part of it that will ignore h1 through h6 tags. I tried putting what I have (which is [^<h\d\".+<\/h\d>]) into your code and I still got the problem of some characters disappearing in the replace. Not only that but it didn't even seem to work.

Thanks for the help, I think we're getting there...

Kostas Zotos
10-18-2007, 05:03 AM
Hello,
To be honest i am not completely sure exactly what you want, anyway..

First, can you explain what this part of the regex does?
(?<!

It called lookbehind negative assertion and in general used when you want to exclude a string that precede the current matching point.
Assertion, is a way to look forward or backward from the matching point and ensure that a string exist (positive assertion) or no (negative) to the left or right of matching pattern. Note also that that string (inside assertion) itself actually not captured (not including in the mathing)..

It formed like this (parentesis with a special marking inside): (?=something) // (lookahead positive assertion) eg. (test)(?=abc\s) //matches the word test when followed by "abc" and a whitespace character.
(?!something) // (lookahead negative assertion) matching when the current match point not followed by the "something" string or subpattern.

To search before the matching point the lookbehind assertions used:
(?<=something) // lookbehind positive assertion, when want something to exist -precede your current matching point
(?<!something) // lookbehind negative assertion, when want something to not exist before the mathing point. eg. (?<!\d)[A-Z]+ (matches one or more capital characters that not preceded by digit.
-------------------------

A possible pattern to exclude the heading <h1-h6> tags could be:
"/.+^(<h\d.+<\/h\d>)/"

(Note: when use the: [^<h\d\".+<\/h\d>] without any quantifier (*,+,etc) outside of character class [ ] it actually match (or not) any (one) of the characters which are inside that class (except if a range "-" used like a-z), but not the whole string.. For example: the . (dot) metacharacter inside [.] represent just a dot (not any character) in other word has no special meaning inside character class.. Maybe this is a reason that some chars trimmed or disappeared from your results..)
-----------------------

I optimized the code, now works as follows:

Searches the given HTML text, locate the keywords given in an array and the first (and easily all if prefer) that may find which is not before the body start tag (<body) or inside heading tags (<h1-h6>), repplace that keyword with a corresponding link (from a second array) adding as file name the keyword plus a custom extension (here adds .php) so creates an anchor link (a href=) with the given (from array) path plus "/keyword.php", eg. www.any-domain.com/keyword.php.

This process Repeated for all the keywords of array and each time create a new link (as described previously). The results (formated HTML with active links) placed in the same HTML document sequencially, the first text with the first keyword as link, then the same text with another keyword as link and so on.. (this repetition is done mainly for preview purposes)

You can fill the keywords, links and words that should not replaced arrays with yours and the program automatically detect the keywords according to regular expressions and convert that words to active links.
In the example only one word used that must not match: example this is the word "backup" so if the keyword to convert to link is eg. "program", then the first "program" will be converted to link but the "backup program" will not converted (because the "program" precede by the "backup" and our regular expression was set to not match that pair of words..

The updated code follows in the next post (unfortunately the whole post is somewhat looong.. apologise for this )

Kostas Zotos
10-18-2007, 05:05 AM
Well the updated code is here (for comments please see the previous post) (consider this as an example rather than a solution that has the exact specific behavior you may need, you have to adapt and modify it, or simply just get an idea of a theoritical possible workaround or approach):

<?php

$keywords =array("sharing", "file", "program", "php"); // Keywords to search and replace
$links =array( "dictionary", "http://www.domain.com/data", "wiki", "wiki/web"); // Keywords links locations - URIs
$keywordList=array( "backup"); // Keywords to not match, example the: "backup file" will be not replaced


/* -------------------------------------------- SAMPLE HTML TEXT FOR TEST ------------------------------------------- */


// Our text as a long string (here a short HTML file for demonstration).
$Subject='
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html> <head>
<title>Preg Repace Test</title>

<META http-equiv="content-type" content="text/html; charset=iso-8859-1">
<META name="Keywords" content=" sharing, file, program, php, backup ">
<script language="JavaScript" type="text/javascript" src="Home/ScriptFile.js"></script></head>

<style type="text/css">
<!--
h3 { color:#991111; }
h4 { color:#991111; }
-->
</style>


<!-- NOTE: This is just a comment, to test if replaced (or no), keywords that are located before body tag (normally no), keywords: sharing file php program.
The previous keywords must stay intact (not replaced by anchor <a href=.. > tags) since they are before the body tag end excluded from replacement -->

<body topmargin="30" leftmargin="50" bgColor="#cecece" style="font:14px/18px verdana">
<center>
<h3> HEADING TEXT 1: Test sharing file program php </h3>
Regular Expression Test in PHP (preg_replace example)<br>
(automatically creates links to files for matched keywords)

<h3>*HEADING TEXT 2: sdfjsdflsdjf backup file sharing sdfsdf sldk*</h3>
<p>
php best file sharing <b> backup program </b> dslfks dfsj sdf lksjd building a web page sdlfjsdfkjsdflsdf creating a web page sdfsdldsfk jsdfk sdlksj df file share

program sdflksjdf lsdkfjsdflksdfj lksdfj file sharing application sdf sdfsf sdf df <b> backup file </b> sharing ds sdfs dfsdf PHP sdfs df <b> backup sharing </b> fsdsd f
</p>
<br>
<h4>HEADING TEXT 3: This test tags and keywords sharing file program php</h4>

</center></body>
</html>
'; // The end single quot of our $Subject var.


$OriginalSubject=$Subject;
$count=0;

echo "<center>\n<h2>".'REGULAR EXPRESSIONS: TEXT REPLACEMENT IN PHP'."</h2>\n";
echo '( Example by K.Zotos &nbsp -10/2007- )'."\n</center>\n<br><hr><br>\n";


for ($count2=0; $count2<count($keywords); $count2++) {

$Subject=$OriginalSubject; // Initialize our text after each replacement.


/* --------------------------------------- REMOVE EVERYTHING BEFORE THE BODY TAG -------------------------------------- */


// To avoid replace keywords in header.. we temporarily remove everything from document start until the body start tag ( <body )
// in other word removes anything that precedes the "<body" tag.
$Matches = array(0 => "");
//$Pattern = "/.+\/head\s*>/ims"; // This matches anything from document start until (including) the end of header
$Pattern = "/.+(?=<body)/ims"; // Matches anything until the "<body" start tag
$Result = preg_match($Pattern, $Subject, $Matches); // If text found from document start until the "<body" tag, $Matches[0] will contains the whole matched string

if ($count2==0) echo "<br>\n\n<!-- ------ TEXT EXCLUDED FROM REPLACEMENT (START) ------- -->\n".$Matches[0]."<!-- ------ TEXT EXCLUDED FROM REPLACEMENT

(END) ------- -->\n\n"; // To see the removed text (everything before the body start tag)

if ($Result >0) $Subject=preg_replace($Pattern, "", $Subject); // Temporarily replace the contents before the body tag with "" empty string
// we have stored them in $Matches array..


/* ------------------------------------------- REPLACE ALL HEADING (<H1 - H6>) TAGS ----------------------------------------- */

// Replace all heading tags with a custom text - place holder. After keyword matching, will restore the heading texts in their initial positions
$Pattern="/<h\d.+<\/h\d>/"; // The pattern to match the heading tags
$HeadingSets=array();
preg_match_All($Pattern, $Subject, $HeadingSets, PREG_SET_ORDER);
$Subject=preg_replace($Pattern,"*PLACE_HOLDER*", $Subject); // Replace the headings


// echo 'Headings Found: '.count($HeadingSets)."<br>\n";

$HeadingTags=array();
$HeadingElement="";
for($a=0;$a<count($HeadingSets);$a++){ // Each element of $HeadingSets array, is also an array with its first element to be the matchin heading
$HeadingElement=$HeadingSets[$a][0];
array_push($HeadingTags,$HeadingElement);
//echo $HeadingElement."<br>\n";
}

/* ---------------------------------------------- REPLACE KEYWORDS WITH LINKS ----------------------------------------------- */

$Extension='.php'; // File extension
$File='/'.$keywords[$count2].$Extension; // File name with a forward slash in fron of it (will used to compose keywords URLs)
$Replacement='<a href="'.$links[$count2].$File.'">'.$keywords[$count2].'</a>';

// $Pattern="/(^(<\s*head.+\/head\s*>))|(?<!$keywordList[$count])$keywords[$count2]/ims";

// The following pattern matches a keyword that not preceded by words in $keywordList + one space.
// Eg. if search for the keyword "program" and the "not match" word is "backup" then the "backup program" will not replaced..

$Pattern = "/(?<!$keywordList[$count]\s)$keywords[$count2]/ims"; // i: case insensitive search, m:multiline, s:dot-all (. dot char can be everything, even new line)


$Subject = @$Matches[0].preg_replace($Pattern, $Replacement, $Subject,1); // Concatenates the previously removed init text (header) with the replaced one.


/* ------------------------------------------- RESTORE ALL HEADING (<H1 - H6>) TAGS ----------------------------------------- */

// Replace the placeholder texts with the original heading tags as stored in the $HeadingTags array. For this we are going to supply the peg_replace
// with 2 arrays; the saved headings and another one with an equivalent number of pattern elements (the same string used previously as place holder)
$PlaceHolders=array();
$Pat="/\*PLACE_HOLDER\*/";
for($a=0;$a<count($HeadingTags);$a++) {
array_push($PlaceHolders, $Pat); // The $PlaceHolders array requires to have equal number of elements (patterns) to achieve a one by one replacement
}
// echo 'Patterns Count: '.count($PlaceHolders)."<br>\n";

$Subject=preg_replace($PlaceHolders, $HeadingTags, $Subject,1);


/* ------------------------------------------------------- PRINT OUT RESULTS ------------------------------------------------------ */


// Prepare and Output the results in HTML format.
$Text="\n<div style='color:#cc2211'><b><span style='color:#115566'>".' -------------- TEXT REPLACEMENT &nbsp Array Element: '.$count2.' => '.$keywords[$count2]."

------------------- </span></b> ";
$Text.= "\n\n<br><br>\n".'Headings Found: '.count($HeadingSets)."<br>";
$Text.= "\n".'Keyword Replaced: <b>'.$keywords[$count2].'</b><br>';
$Text.= "\n".'Not Replaced: <b>'.$keywordList[0].' '.$keywords[$count2].'</b>';
$Text.= "<br>\n".'File Link: <b>'.$links[$count2].$File.'</b></div>';
$Text.="<br>\n\n".$Subject. "\n\n\n<br><hr><br><br>";

echo $Text;
}

?>

I can't think something else at this time.. although the approach seems to work (based on more than one regular expressions and manipulations to come to a result and this is something that maybe needs a farther optimization and why not, a complete redesign :)

Kostas

Kostas Zotos
10-18-2007, 06:10 AM
One last addition:

Example, the output for the "sharing" keyword is like this:

-------------- TEXT REPLACEMENT Array Element: 0 => sharing -------------------

Headings Found: 3
Keyword Replaced: sharing
Not Replaced: backup sharing
File Link: dictionary/sharing.php


HEADING TEXT 1: Test sharing file program php

Regular Expression Test in PHP (preg_replace example)
(automatically creates links to files for matched keywords)

*HEADING TEXT 2: sdfjsdflsdjf backup file sharing sdfsdf sldk*

php best file sharing (http://dictionary/sharing.php) backup program dslfks dfsj sdf lksjd building a web page sdlfjsdfkjsdflsdf creating a web page sdfsdldsfk jsdfk sdlksj df file share program sdflksjdf lsdkfjsdflksdfj lksdfj file sharing application sdf sdfsf sdf df backup file sharing ds sdfs dfsdf PHP sdfs df backup sharing fsdsd f

HEADING TEXT 3: This test tags and keywords sharing file program php

--------------------------------------------------------

Only The first keyword "sharing" converted to link.
The "backup sharing" not converted since we want that the keyword must not preceded from the word "backup" [or any word(s) ] plus a space.
Keyword inside headings <h1-h6> also not converted.
Finally any keyword that is located before the body tag (<body) (actually enything from document start until the body start tag not processed)

This repeated for all the keywords, and links automatically created that follows the previous conditions..

I right too many details i think.. :(

Regards

Kostas

nemesis_256
10-18-2007, 08:04 AM
Wow, thank you so much for all of that. I haven't tried it out yet but I hope I can make it work with this. I have one quick question about this regular expression.
"/.+^(<h\d.+<\/h\d>)/"

What does the ^ symbol do in there? I thought it only meant "NOT" when it was inside the square brackets [], which was the reason I included those. I noticed that you actually didn't use it in the actual code, so I assume it's not that important.

Kostas Zotos
10-19-2007, 08:58 PM
Don't mention it :)

I thought it only meant "NOT" when it was inside the square brackets []...
Yes, you are right.. it was my fault.

It negates the character class if it is the first character
[^abc] //Matches any character except a,b,c

The circumflex ^ can also be outside of character class to indicate that the match must be at start of subject, also can be at start of alternative subpatterns, or after new line character "\n" in conjunction with the multiline modifier "m". In all these cases act as anchor to indicate matching in the start of subject or where new lines occur.
Example:
"/^[A-Z]+.+/" // Match only when the subject starts from one or more capital letters.

If you escape it, means just the ^ char. Example:
"/\d\^\d/" // Will match a string like "5^2"

I didn't use it because my aim was to match the headings and replace them..

This seems to be an easy way to exclude the headings..
(since you need everything except headings, you can temporarily replace them with an empty char "")

$Matches=array();
$Subject="Before Heading* <h4>THIS IS HEADING TEXT</h4> *After Heading";
$Pattern="/<h\d.+<\/h\d>/i"; // The pattern to match the heading tags ( note the caseless modifier "i" )

$Subject=preg_replace($Pattern,"", $Subject); // Replace the headings with empty string

echo $Subject; // prints: Before Heading* *After Heading (All the heading replaced with "")


Also the code with a couple of improvements is:
<?php

$keywords =array("sharing", "file", "program", "php"); // Keywords to search and replace
$links =array( "dictionary", "http://www.domain.com/data", "wiki", "wiki/web"); // Keywords links locations - URIs
$keywordList=array( "backup"); // Keywords to not match, example the: "backup file" will be not replaced


/* -------------------------------------------- SAMPLE HTML TEXT FOR TEST ------------------------------------------- */


// Our text as a long string (here a short HTML file for demonstration).
$Subject='
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html> <head>
<title>Preg Repace Test</title>

<META http-equiv="content-type" content="text/html; charset=iso-8859-1">
<META name="Keywords" content=" sharing, file, program, php, backup ">
<script language="JavaScript" type="text/javascript" src="Home/ScriptFile.js"></script></head>

<style type="text/css">
<!--
h3 { color:#991111; }
h4 { color:#991111; }
-->
</style>


<!-- NOTE: This is just a comment, to test if replaced (or no), keywords that are located before body tag (normally no), keywords: sharing file php program.
The previous keywords must stay intact (not replaced by anchor <a href=.. > tags) since they are before the body tag end excluded from replacement -->

<body topmargin="30" leftmargin="50" bgColor="#cecece" style="font:14px/18px verdana">
<center>
<h3> HEADING TEXT 1: Test sharing file program php </h3>
Regular Expression Test in PHP (preg_replace example)<br>
(automatically creates links to files for matched keywords)

<H3>*HEADING TEXT 2: sdfjsdflsdjf backup file sharing sdfsdf sldk*</H3>
<p>
php best file sharing <b> backup program </b> dslfks dfsj sdf lksjd building a web page sdlfjsdfkjsdflsdf creating a web page sdfsdldsfk jsdfk sdlksj df file share

program sdflksjdf lsdkfjsdflksdfj lksdfj file sharing application sdf sdfsf sdf df <b> backup file </b> sharing ds sdfs dfsdf PHP sdfs df <b> backup sharing </b> fsdsd f
</p>
<br>
<h4>HEADING TEXT 3: This test tags and keywords sharing file program php</h4>

</center></body>
</html>
'; // The end single quot of our $Subject var.


$OriginalSubject=$Subject;
$count=0;

echo "<center>\n<h2>".'REGULAR EXPRESSIONS: TEXT REPLACEMENT IN PHP'."</h2>\n";
echo '( Example by K.Zotos &nbsp -10/2007- )'."\n</center>\n<br><hr><br>\n";


for ($count2=0; $count2<count($keywords); $count2++) {

$Subject=$OriginalSubject; // Initialize our text after each replacement.


/* --------------------------------------- REMOVE EVERYTHING BEFORE THE BODY TAG -------------------------------------- */


// To avoid replace keywords in header.. we temporarily remove everything from document start until the body start tag ( <body )
// in other word removes anything that precedes the "<body" tag.
$Matches = array(0 => "");
//$Pattern = "/.+\/head\s*>/ims"; // This matches anything from document start until (including) the end of header
$Pattern = "/.+(?=<body)/ims"; // Matches anything until the "<body" start tag
$Result = preg_match($Pattern, $Subject, $Matches); // If text found from document start until the "<body" tag, $Matches[0] will contains the whole matched string

if ($count2==0) echo "<br>\n\n<!-- ------ TEXT EXCLUDED FROM REPLACEMENT (START) ------- -->\n".$Matches[0]."<!-- ------ TEXT EXCLUDED FROM REPLACEMENT

(END) ------- -->\n\n"; // To see the removed text (everything before the body start tag)

if ($Result >0) $Subject=preg_replace($Pattern, "", $Subject); // Temporarily replace the contents before the body tag with "" empty string
// we have stored them in $Matches array..


/* ------------------------------------------- REPLACE ALL HEADING (<H1 - H6>) TAGS ----------------------------------------- */

// Replace all heading tags with a custom text - place holder. After keyword matching, will restore the heading texts in their initial positions
$Pattern="/<h\d.+<\/h\d>/im"; // The pattern to match the heading tags
$HeadingSets=array();
preg_match_All($Pattern, $Subject, $HeadingSets, PREG_SET_ORDER);
$Subject=preg_replace($Pattern,"*PLACE_HOLDER*", $Subject); // Replace the headings


// echo 'Headings Found: '.count($HeadingSets)."<br>\n";

$HeadingTags=array();
$HeadingElement="";
for($a=0;$a<count($HeadingSets);$a++){ // Each element of $HeadingSets array, is also an array with its first element to be the matchin heading
$HeadingElement=$HeadingSets[$a][0];
array_push($HeadingTags,$HeadingElement);
//echo $HeadingElement."<br>\n";
}

/* ---------------------------------------------- REPLACE KEYWORDS WITH LINKS ----------------------------------------------- */

$Extension='.php'; // File extension
$File='/'.$keywords[$count2].$Extension; // File name with a forward slash in fron of it (will used to compose keywords URLs)
$Replacement='<a href="'.$links[$count2].$File.'">'.$keywords[$count2].'</a>';

// $Pattern="/(^(<\s*head.+\/head\s*>))|(?<!$keywordList[$count])$keywords[$count2]/ims";

// The following pattern matches a keyword that not preceded by words in $keywordList + one space.
// Eg. if search for the keyword "program" and the "not match" word is "backup" then the "backup program" will not replaced..

$Pattern = "/(?<!$keywordList[$count]\s)$keywords[$count2]/ims"; // i: case insensitive search, m:multiline, s:dot-all (. dot char can be anything, even new line)


$Subject = @$Matches[0].preg_replace($Pattern, $Replacement, $Subject,1); // Concatenates the previously removed init text (header) with the replaced one.


/* ------------------------------------------- RESTORE ALL HEADING (<H1 - H6>) TAGS ----------------------------------------- */

// Replace the placeholder texts with the original heading tags as stored in the $HeadingTags array. For this we are going to supply the peg_replace
// with 2 arrays; the saved headings and another one with an equivalent number of pattern elements (the same string used previously as place holder)
$PlaceHolders=array();
$Pat="/\*PLACE_HOLDER\*/";
for($a=0;$a<count($HeadingTags);$a++) {
array_push($PlaceHolders, $Pat); // The $PlaceHolders array requires to have equal number of elements (patterns) to achieve a one by one replacement
}
// echo 'Patterns Count: '.count($PlaceHolders)."<br>\n";

$Subject=preg_replace($PlaceHolders, $HeadingTags, $Subject,1);


/* ------------------------------------------------------- PRINT OUT RESULTS ------------------------------------------------------ */


// Prepare and Output the results in HTML format.
$Text="\n<div style='color:#cc2211'><b><span style='color:#115566'>".' -------------- TEXT REPLACEMENT &nbsp Array Element: '.$count2.' => '.$keywords[$count2]."

------------------- </span></b> ";
$Text.= "\n\n<br><br>\n".'Headings Found: '.count($HeadingSets)."<br>";
$Text.= "\n".'Keyword Replaced: <b>'.$keywords[$count2].'</b><br>';
$Text.= "\n".'Not Replaced: <b>'.$keywordList[$count].' '.$keywords[$count2].'</b>';
$Text.= "<br>\n".'File Link: <b>'.$links[$count2].$File.'</b></div>';
$Text.="<br>\n\n".$Subject. "\n\n\n<br><hr><br><br>";

echo $Text;
}

?>

Eventually, if you want to replace a certain keyword with link, only if that keyword-link not exist; I think is better to compare this keyword with the existing links in your array and proceed with the replacement only if not exist rather than trying to test this condition with a regex which has to "play" a double role, to match from one point and to avoid matching at the same time.. since this make things more complex.. This is my opinion of cource :)

Good luck with the project!

nemesis_256
10-20-2007, 07:01 PM
Alright, I'm making progress. I have the header tags being ignored as well as the current keyword (different for every page). I'm ignoring both of those with the same code you have for ignoring the headers.

Now I need to ignore words that are within an anchor tag. In other words, one that has already been linked. I'm doing this with an exact match with the following if
if (!preg_match("/<a.+>$keywords[$count2]<\/a>/i", $fString))
That works only for the exact keyword, but if I have the keyword "best file sharing program" linked, it will still be found when it looks for "file sharing" later on. So I end up with something like
<a href="best-file-sharing-program.html">best <a href="file-sharing.html">file sharing</a> program</a>

I'm trying to fix this problem with the negative look behind and look ahead. So I have a regex like this
$fString = preg_replace("/(?<!>(\w)+\s)$keywords[$count2](?!\s(\w)+<\/a>)/im", "<a href=\"$links[$count2]\">$keywords[$count2]</a>", $fString, 1);
It works for when there's exactly one word on each side of the keyword so it matches "best file sharing program" (and doesn't link it) but it doesn't match something like "file sharing program", meaning the "file sharing" within that still gets linked.

So is there a way I can make the negative look around work to have either only the look behind, only the look forward, or both? Even then there's still the problem of having yet another word before or after the one that is linked.

nemesis_256
10-21-2007, 06:02 PM
Nevermind, I found a solution!

It's somewhat messy, but I'm replacing all the spaces within the keyword with weird characters that will never be in the text in that specific sequence anyway, and then I put them back after all the links are done.

Thank you again for your help! I probably wouldn't have been able to do this without it.

Kostas Zotos
10-22-2007, 01:16 PM
Nevermind, I found a solution!
Fine!

Thnaks also for your kind words!