Click to See Complete Forum and Search --> : Similar string searching to eliminate duplicates from database


Psytherium
06-21-2007, 06:32 PM
Hello,

I have a database of about 16,000 apartments that are for the most part user generated entries. Now I have another database of apartments that I have purchased of about 115,000 apartments. I want to add in these new entries to the old one, but I am concerned about duplicates. As you can imagine with the original content being user generated, there are some typos, non-official names, etc. I've generated this script to try and identify duplicates:

<?php
$states = array("'AK'", "'AL'", "'AR'", "'AZ'", "'CA'", "'CO'", "'CT'", "'DC'", "'DE'", "'FL'", "'GA'", "'HI'", "'IA'", "'ID'", "'IL'", "'IN'", "'KS'", "'KY'", "'LA'", "'MA'", "'MD'", "'ME'", "'MI'", "'MN'", "'MO'", "'MS'", "'MT'", "'NC'", "'ND'", "'NE'", "'NH'", "'NJ'", "'NM'", "'NV'", "'NY'", "'OH'", "'OK'", "'OR'", "'PA'", "'PR'", "'RI'", "'SC'", "'SD'", "'TN'", "'TX'", "'UT'", "'VA'", "'VT'", "'WA'", "'WI'", "'WV'", "'WY'");
$alpha = array("'a'", "'b'", "'c'", "'d'", "'e'", "'f'", "'g'", "'h'", "'i'", "'j'", "'k'", "'l'", "'m'", "'n'", "'o'", "'p'", "'q'", "'r'", "'s'", "'t'", "'u'", "'v'", "'w'", "'x'", "'y'", "'z'");
$db = mysql_connect ('localhost', '******', '********')
or die ("Unable to connect to Database Server");

mysql_select_db ('dev01', $db)
or die ("Could not select database");

$start = microtime(true);
$count = 0;
for($i = 0; $i < 50; $i++)
{
for( $j = 0; $j < 26; $j++)
{
$query = "SELECT apartmentname, apartmentcity, apartmentstate FROM apartmentinfo WHERE apartmentstate = {$states[$i]} AND LEFT(apartmentname,1) like {$alpha[$j]}";

$result = mysql_query($query, $db) // WHERE LEFT(apartmentname,1) like 'z' LIMIT
or die("Invalid query: " . mysql_error());

$num=mysql_numrows($result);
$array = array( array() );
for( $k=0; $k< $num; $k++ ) {
$array[$k] = mysql_fetch_array( $result, MYSQL_ASSOC );
}


for( $k=0; $k< $num-1; $k++ ) {
for( $l=$k+1; $l< $num; $l++ ) {
if($array[$k]['apartmentstate'] == $array[$l]['apartmentstate']
&& $array[$k]['apartmentcity'] == $array[$l]['apartmentcity'])
{
$percent = 0.0;
similar_text($array[$k]['apartmentname'], $array[$l]['apartmentname'], $percent);
if($percent > 80)
$count++;
}
}
}
}
}
echo $count;
echo "<br />";
echo (microtime(true) - $start);
?>

Am I on the right track, or should I go about this differently? I would love to make this as accurate as possible so I can automate some of the elimination of duplicates, or else I will just have to settle with flagging possible duplicates and changing them myself.

Thanks!

Sheldon
06-21-2007, 11:26 PM
try this, it is untested.

<?php

mysql_connect ('localhost', '******', '********') or die ("Unable to connect to Database Server");

mysql_select_db ('dev01') or die ("Could not select database");

$count = 0;
$delete = 0;
$query = "SELECT * FROM apartmentinfo ORDER BY apartmentname ASC";
$result = mysql_query($query) or die ("Invalid query: " . mysql_error());
$number = mysql_num_rows($result);

while($row = mysql_fetch_assoc($result)){
while($new = mysql_fetch_assoc($result)){ /* If teh two record databases are in a different table then you will need to create a different query. */
$count = ($count + 1);
if(($row['apartmentname'] == $new['apartmentname']; ) and
($row['apartmentcity'] == $new['apartmentcity']; ) and
($row['apartmentstate'] == $new$row['apartmentstate'];)){
// We have a duplicate
mysql_query("DELETE FROM apartmentinfo WHERE id= '{$new['id']}' LIMIT 1") or die ("Could not delete duplicate record." . mysql_error());
/* The line above assumes you have set a unique id for each record. */
echo($row['apartmentname'] . " matches. Second record deleted");
$delete = ($delete + 1);

}else{
echo("No match for: " . $row['apartmentname']);
}

}
echo("Completed<br />{$count} records found. <br />{$delete} records deleted");
?>

Psytherium
06-22-2007, 07:15 AM
Thanks for the reply!

I've taken care of all of the duplicates that are exact matches and now I need to take care of the duplicates that are near exact matches. Here are a couple of examples:

Houston Apartments
Huoston Apartments

Sun & Sky Apartments
Sun and Sky Apartments

University Park Apartments
University Pk Apartments

Stuff like that. That's why I'm using similar_text(). If anyone has any ideas, suggestions, algorithms for me to look into, criticism, or anything, please feel free to respond!

Thanks,
Jon