Click to See Complete Forum and Search --> : Problem with search function


*Tom*
12-17-2008, 09:49 AM
I'm tring to write a simple text search that ignores any non-alphanumeric characters. For example, I don't want a user to enter something like "find this" and not get a match because the string that's being searched contains something like "find-this", so I remove any unwanted characters from both strings before looking for a match.The problem is, I'd like to be able to display the string afterwards with the matched part highlighted, but I want to be able to display the original string and not the version with the removed characters.

Here's a simplified version of the code:


$string = 'the term "find-this" is contained within this string';
$search_term = 'find this';

$string =~ s/[^a-z0-9]//gi; # 'thetermfindthisiscontainedwithinthisstring'
$search_term =~ s/[^a-z0-9]//gi; # 'findthis'

if($string =~ /$search_term/i)
{
$string =~ s/($search_term)/\<b>$1<\/b>/gi; # what gets printed: 'theterm<b>findthis</b>iscontainedwithinthisstring'
print $string; # what I want printed: 'the term "<b>find-this</b>" is contained within this string'
}
else
{
print 'No match was found';
}

The only way I've managed to get what I wanted is, instead of removing the unwanted characters I replace them with a single character (in this example I've used a period) so the search can still find a match. I hold a list of all the removed characters in another variable so I can put them all back in place of the periods after the search.


$removedchars = $string;
$removedchars =~ s/[a-z0-9]//g; # ' "-" '
$string =~ s/[^a-z0-9]/./gi; # 'the.term..find.this..is.contained.within.this.string'
$search_term =~ s/[^a-z0-9]/./gi; # 'find.this'

if($string =~ /$search_term/i)
{
$string =~ s/($search_term)/\<b>$1<\/b>/gi;
@string_chars = split(//, $string);
@removed_chars = split(//, $removedchars);

for(my $i=0,$j=0;$i<@string_chars;$i++)
{
$string_chars[$i] = $removed_chars[$j++] if($string_chars[$i] eq '.')
}
$string = join('', @string_chars);
print $string;
}
else
{
print 'No match was found';
}

This works, but it seems way over the top for something relatively simple. I'll be looking for matches within a few hundred lines of text with each search the user performs, so I'd rather have a more efficient method. Can anyone suggest a better way of doing this?
Thanks for any help.

dragle
12-19-2008, 12:00 PM
How 'bout:
#!/usr/bin/perl

use strict;
use warnings;

my $test_string = 'This string has "find-this" within it.';
my $fragment = 'find this';

(my $frag_comp = $fragment) =~ s%[^a-zA-Z0-9]+%\[\^a\-zA\-Z0\-9\]\+%g;
if ($test_string =~ /$frag_comp/) {
(my $print_string = $test_string) =~ s%($frag_comp)%<b>$1</b>%g;
print $print_string, "\n";
}
else {
print 'Term not found.', "\n";
}
I haven't bench-marked it so I can't speak as to whether it's any more or less efficient than what you have. The point is, instead of a dot, replace the characters in the search term that you don't care about with the pattern itself. Then just compare the text to the resulting pattern (no need to modify the text itself; so no need to worry about tracking the original contents). I'm making an assumption here that you should match more than one non-alpha character; if that's incorrect just take the plusses out of the regex.

HTH,

Nedals
12-19-2008, 02:42 PM
Here's another way to do it

my $string = 'the term find-this is contained within this string';
my $search_term = 'find this';

# In search term, convert any non-alphanumeric chars to '.' (match any char)
$search_term =~ s/\W/\./g;

if($string =~ /$search_term/i) {
$string =~ s/($search_term)/<b>$1<\/b>/gi;
print "$string\n"; # prints 'the term <b>find-this</b> is contained within this string';
}
else {
print "No match was found\n";
}

*Tom*
12-22-2008, 04:24 AM
Hi, thanks for your replies, the solution looks so obvious now :rolleyes:. You're right about me wanting to match more than one non-alpha character, that only occurred to me after I'd posted but I couldn't work out how to edit my post (if that's even possible).

An unrelated question:

print "No match was found\n";
print 'No match was found', "\n";

Is one way any better than the other?

dragle
12-22-2008, 10:14 AM
print "No match was found\n";
print 'No match was found', "\n";

Is one way any better than the other?

I did that out of habit (not always a good one). Single quotes can be faster in many cases because they don't have to be interpolated (whereas double quotes do). But in this case the simpler:
print "No match was found.\n";
is the faster statement according to benchmarks. But note that in this example:
my $nf = 'No match was found.';
print $nf, "\n";
print "$nf\n";
the first print statement is the faster one.

Cheers!