Click to See Complete Forum and Search --> : extract matches


buggy
09-02-2004, 05:16 AM
I have taken in a web page and I would like to extract all data that appear betweeen >< (names etc). No comma, quotes, colons etc just text. I have removed spaces also.

I am using the following code with no success:

if (($filestring) = ($filestring =~ /(\>\w+\<)/)) {

$comp_string = $filestring;
open(fileout,">> NOutput.txt") or dienice("Couldn't open Output.txt for writing: $!");
flock(fileout,2);
seek(fileout,0,2);
print fileout $comp_string,"\n";
close(fileout);
$filestring = substr($filestring, index($filestring,($filestring =~ /(\>\w+\<)/))+6, length($filestring))
}
while(($filestring) = ($filestring =~ /(\>\w+\<)/))
{
$comp_string = $filestring;
open(fileout,">> NOutput.txt") or dienice("Couldn't open Output.txt for writing: $!");
flock(fileout,2);
seek(fileout,0,2);
print fileout $comp_string,"\n";
close(fileout);
$filestring = substr($filestring, index($filestring,($filestring =~ /(\>\w+\<)/))+6, length($filestring))
}

Any help would be great!!

cyber1
09-02-2004, 11:04 PM
I'm not sure why you have such an elaborate routine maybe a few more details as to what you are trying to accomplish would be helpful.

Try the following:
$filestring =~ m/<.*?>(.*?)<.*?>/;

or if you want all of a given tag use:
@hrefs=($text =~ m|href\s*=\s*\"([^\"]+)\"|ig);

the text string is contents of a file read in as an array converted to a string.
It then grabs all hrefs.

-Bill

buggy
09-03-2004, 05:46 AM
I am trying to get and print out all the company names from a web page (http://jobsearch.monster.ie/jobsearch.asp?brd=1&cy=IE&lid=955&fn=1&q=&sort=rv&vw=d) they are all of the same format.

Charles
09-03-2004, 12:51 PM
HTML is a far too complicated thing for you to crack with a regular expression. You need a parser - HTML::Parser (http://www.perldoc.com/perl5.6/lib/HTML/Parser.html).