Click to See Complete Forum and Search --> : parsing


LLuaP
05-25-2005, 04:18 PM
It seems that the seemingly simple problems are the ones that get you. I'm still a newbie so this is probably peanuts for the monks here.

Background
First, I am using ActivePerl 5.8 on a WinXP machine. I have been coding in Perl for about a week now so I'm still very green.

The Problemo
I'm having a problem parsing a long text with a repeating pattern. I want to be able to extract some strings from a LONG string similar to this one:

$string = 'xxxxxgoodooo[foo 1bar]999,999SOMEWORDS[/foo bar]xxxgoodooo[foo 2bar]ooooo[/foo bar]xxxgoodooo[foo 3bar]ooooo[/foo bar]';

where:
"xxx" could be any alphanumeric in any amount.
"ooo" could be any alphanumeric in any amount
"bar" could be any alphanumeric in any amount

-----------
I want to extract and print only the pattern (goodooo[foo ] [/foo ]) that has numbers and alphabetic characters, IN ESSENCE, i want to extract something like:

goodooo[foo bar]999,999 AND SOME ALPHABETICAL[/foo bar]

but NOT:

goodooo[foo bar]ALPHABETICAL ONLY[/foo bar]



I am able to extract the strings with ALPHABETICAL using \w with this code:


while($string =~ m/(good\w+\[foo\s+\w+]\w+\[\/foo\s+\w+])/ig)
{
print $1, "\n";
print "printed the ones without numbers in the middle!\n";
}



But when try extracting the one with NUMBERS & ALPHABETICAL using a combination of \d and \w I get nothing. Here is my code for that:


while($string =~ m/(good\w+\[foo\s+\w+]\d+\w+\[\/foo\s+\w+])/ig)
{
print $1, "\n";
print "printed the ones WITH numbers in the middle\n";
}


Summary

So my problem is that when I try to match a mixture of alphabetical and numeric, it doesn't extract it into the $1 variable. I thought \w represented alphanumeric.

What am I doing wrong? When I try to extract the numbers themselves I have no problem but for somereason when I mix the numbers with alphabetical characters there's no match in that string. Could it be a bug in ActivePerl or in my reasoning? =)

I would appreciate any help. Thanks for reading!
LLuaP

Charles
05-25-2005, 06:17 PM
#!c:\perl\bin\perl.exe

use strict;

my $string = 'xxxxxgoodooo[foo 1bar]999,999SOMEWORDS[/foo bar]xxxgoodooo[foo 2bar]ooooo[/foo bar]xxxgoodooo[foo 3bar]ooooo[/foo bar]';

print "$_\n" foreach ($string =~ m|goodooo\[foo \d+bar\][\w\d,]+\[/foo bar\]|g);

Nedals
05-25-2005, 11:53 PM
I am able to extract the strings with ALPHABETICAL using \w with this code:
\w will match [a-z,A-Z,0-9,_] So, in this case, you don't need the \d unless you want to only match [0-9]

Neither, however, will match a comma

LLuaP
05-26-2005, 11:44 PM
Fantastic! Thank you soo much, Nedals! This is the line that ended up doing the whole grueling parsing for me:

print OUTPUT foreach ($string =~ m|(consolidated.{1,500}<table.*?>.+?</table.*?>)|sig);

I LOVE Regex and Perl!! Shazam!