Click to See Complete Forum and Search --> : php or perl to import large tab delimited text files to mysql


thesailor
09-05-2007, 11:52 AM
Hey everyone... I am an old BBS guy and military geek turned real estate broker who tries too hard sometimes learning - but I do want to understand - so please bear with me... I am a rather visual learner... that said - I've got a basic on perl, xml, php, and mysql.... here is what my project is for the next couple of days...

Goal: Setup duplicate database structure (except using mysql on my servers) and data tables for about 15 tables. The provider of the data is a real estate related MLS service. The data is everything from the various listings (homes commercial etc) to office id's and other structures I might use in the future... I want to focus my learning on 2 or 3 right now...

1) data is provided to me via ftp in the form of tab delimited text files. It appears that for each table/data group - (for instance listings-residential.txt) is a straight dump as of like 4AM or so each day... file is .gz'd... this particular file has some 6,000 daily rows and about 80 collumns, many of which I am actually interested in saving long term... this file does not have any updates on which listings/rows were removed from day to day... this will need to be addressed in our script as well to keep my db accurate..

2) picture files associated with each of these listings are zipped on the same ftp server daily, the past 4 to 7 days of ziped pictures are available, noted in the zip file name itself is the date it was zipped... as well as a pictures-old.gz that is about 750MB containing all the older pics we need to grab and squeeze into a server directory for our scripts to search... the pictures are jpg mostly and are in the format of MLS#(relating to the mls # of the listing)x i.e. main picture is MLS###### .jpg then additional pictures (if provided) are MLS######A.jpg, MLS######B.jpg etc... this provider allows upto an additional 8 pictures so the last in the series is H... ( I mention all this as below I will ask a better way to concantonate the MLSprefix and trailing A-H as appropriate on the actual number part of the MLS number we select from the tables we create with the actual mls data part)....

3)I have a lot of ideas - and I know there are a lot of long term DB maintenance items to not overlook... here is what I have seen done elsewhere and what I am looking at - basically I want a cron script to run daily to grab these txt files and picture files, extract them to appropriate directory and parse the text file import to existing mysql databases.. but I do not yet have the table structures all setup right - I do have a text file from the provider that shows the various collumn names and categories as well as lenght... so far I have just been using manual table creation - very slow - and setting each collumn as varchar and 255... perhaps I should be looking for better ways there too...

So recap - I need to setup the tables from the txt files first... then write script to download and extract then import to mysql... if we get that far I will probably ask to hire ya for a short while to write some more stuff with me to actually use the data once we get it setup and storing the data daily... like I have the ability to receive data dumps from the county offices - I would like to import that data to a database, and offer web interface for users searching the mls data to also cross associate to the corresponding tax data - i.e. the script when displaying mls info will try to automatically associate mls address as given with the data we receive from the county office - if no exact match is found, then offer "near match or possible drop own selection to end user to pick or to browse/search to get correct data from the county)... but thats a ways down the road yet...

Any takers on this project of mine who don't mind taking donations now and then and can help me learn a little along the way?

I will log in from dev pc to post code in a moment for your review...

I am a bit lost as to what is the better way to go - bash script to get the files and extract to proper directory or just use perl or perhaps php? I am game for learning new things... been looking at ruby and python... but I am certainly no expert in any of them...

thesailor
09-05-2007, 12:33 PM
this appears to do a lot of what I want... care to give feedback on it? I also have seen a bash script for the same... I will post here for review...

This is Google's cached copy of mysql_txt_import_last.txt from
http://data.getafreelancer.com/project/23338/mysql_txt_import_last.txt

Google is neither affiliated with the authors of this page nor responsible for its content.



http://data.getafreelancer.com/project/23338/

br.sql
mysql_txt_import_last.txt

#!/usr/bin/perl -w
use strict;
use DBI;
use CGI;
use Net::SMTP;
my @files;
my @dbtables;

####### CONFIG Section Variables ###########################
my $debug='yes'; # yes/no if dubug = yes if you want status of execution of script to be shown throught web.
my $dbname='kathi_jaykali'; # mysql database name
my $host='localhost'; # ip address or host of Msql server
my $port= '3306'; # mysql port
my $user= 'kathi_jaykali'; # mysql user
my $pass= '*******'; # mysql password
my $ftphost= 'idx.fnismls.com'; # ftp host for getting the files
my $ftp_path = '/idx'; # ftp path to the files
my $ftp_user = 'carmls_bollsmedia'; # ftp user
my $ftp_pass = '*******'; # ftp_pass
$files[0] = 'users.txt.gz'; # Users file
$dbtables[0] = 'users'; # SQL Table for users
$files[1] = 'features.txt.gz'; # Features file
$dbtables[1] = 'Features'; # SQL Table for Features
$files[2] = 'featurecategories.txt.gz'; # Features Cat. file
$dbtables[2] = 'FeatureCategories'; # SQL Table for Features Cat.
$files[3] = 'organizations.txt.gz'; # Organization File
$dbtables[3] = 'Organizations'; # SQL Table for Organization
$files[4] = 'Listings-Multi-Family.txt.gz'; # Multi-Family File
$dbtables[4] = 'Multi Family'; # SQL Table for Multi-Family
$files[5] = 'Listings-Residential.txt.gz'; # Residential File
$dbtables[5] = 'Residential'; # SQL Table for Residential
$files[6] = 'Listings-Commerical - Industrial.txt.gz'; # Commerical - Industrial File
$dbtables[6] = 'Commercial Industrial'; # SQL Table for Commerical - Industrial
$files[7] = 'Listings-Vacant Land.txt.gz'; # Vacant Land File
$dbtables[7] = 'Vacant Land'; # SQL Table for Vacant Land
my @tar_names = ('/pics-residential-','/pics-multi-family-','/pics-vacant land-','/pics-commerical - industrial-');
my $pics_path = '/www/kathimerritt.com/pics'; # path for storing pictures
my $mailhost = 'localhost'; # SMTP address, if it's empty there will be no mail report
my $mailto = '*******@gmail.com'|| $ENV{USER};# recieving mail address for mail report of status
my $mailfrom = 'admin@kathimerritt.com' || $ENV{USER}; # sender mail address optional
my $subject = 'csv2sql report'; # mail subject

########################### END CONFIG ##############



my $smtp;
if (defined ($mailhost) and $mailhost ne ''){
$smtp = Net::SMTP->new($mailhost);
$smtp->mail($mailfrom);
$smtp->to($mailto);
$smtp->data();
$smtp->datasend("To: $mailto\n");
my $time = localtime(time);
$smtp->datasend("Subject: $subject $time\n");
}
my $err=0;
my $sth;
my $dbh = DBI->connect ("DBI:mysql:dbname=$dbname;host=$host;
port=$port", $user, $pass,{
PrintError => 0,
AutoCommit => 0
}) or $err="$DBI::errstr";
if (defined ($mailhost) and $mailhost ne '' and $err ne '0'){
$smtp->datasend("\n Error Can't connect to mysql : $err");
$smtp->dataend();
$smtp->quit;

}
if ($debug eq 'yes'){
my $cgi = new CGI();
print $cgi->header (-expires => 'now');
print "<html><body>\n";
if ($err ne '0'){
print "<strong>Error Can't connect to mysql : $err</strong><br />\n";
print "</body></html>\n";
exit
} else {
print "<strong>Connect to mysql OK</strong><br />\n";
}
}

my $n=0;
foreach my $table (@dbtables){
$err='';
$sth=$dbh->do ("DELETE FROM `$table` ",undef) or $err=$DBI::errstr;
$dbh->commit;
if ($err eq ''){
if (defined ($mailhost) and $mailhost ne ''){
$smtp->datasend("\n$table is empty");
}
if ($debug eq 'yes'){
print "<strong>$table is empty</strong><br />\n";
}
} else {
if (defined ($mailhost) and $mailhost ne ''){
$smtp->datasend("\n$table delete error: $err");
}
if ($debug eq 'yes'){
print "<strong>$table delete error: $err</strong><br />\n";
}
}
}
$err=0;

foreach my $file (@files){
if ($n == 0){
$sth=$dbh->prepare ("INSERT INTO `$dbtables[$n]` VALUES(?,?,?,?,?)");
} elsif ($n == 1) {
$sth=$dbh->prepare ("INSERT INTO `$dbtables[$n]` VALUES(?,?,?)");
} elsif ($n == 2) {
$sth=$dbh->prepare ("INSERT INTO `$dbtables[$n]` VALUES(?,?,?)");
} elsif ($n == 3) {
$sth=$dbh->prepare ("INSERT INTO `$dbtables[$n]` VALUES(?,?,?,?,?,?,?,?)");
} elsif ($n == 4) {
$sth=$dbh->prepare ("INSERT INTO `$dbtables[$n]` VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?, ?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)");
} elsif ($n == 5) {
$sth=$dbh->prepare ("INSERT INTO `$dbtables[$n]` VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?, ?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?, ?,?,?,?,?,?,?,?,?,?,?,?)");
} elsif ($n == 6) {
$sth=$dbh->prepare ("INSERT INTO `$dbtables[$n]` VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?, ?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)");
} else {
$sth=$dbh->prepare ("INSERT INTO `$dbtables[$n]` VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?, ?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)");
}
my $cmd="wget -q -O - ftp://$ftp_user\:$ftp_pass\@$ftphost$ftp_path"."/\"$file\" 2>/dev/null | gzip -dc 2>/dev/null";
my @array=qx/$cmd/;
print @array;
if (defined ($array[0]) and $array[0] ne ''){
if (defined ($mailhost) and $mailhost ne ''){
$smtp->datasend("\n$file was extracted OK");
}
if ($debug eq 'yes'){
print "<strong>$file was extracted OK</strong><br />\n";
}
foreach my $tmp (@array){
my @arr=split ("\t",$tmp);
my @tmp_arr;
if ($n < 4){
@tmp_arr= @arr;
} elsif ($n == 4){
@tmp_arr= (@arr[0..7], $arr[9], @arr[11..31], $arr[34], $arr[40], @arr[48..50], $arr[52], $arr[53], @arr[55..64], $arr[67], $arr[70], @arr[73..75], $arr[77], $arr[81], $arr[82], $arr[95], $arr[96]);
} elsif ($n == 5){
@tmp_arr= (@arr[0..7], $arr[9], @arr[11..31], $arr[34], $arr[40], @arr[48..52], $arr[54], $arr[55], $arr[57], $arr[58], @arr[60..83], $arr[85], @arr[87..90], $arr[103],$arr[113]);
} elsif ($n == 6){
@tmp_arr= (@arr[0..7], $arr[9], @arr[11..31], $arr[34], $arr[40], $arr[48], @arr[51..53], @arr[55..56], @arr[58..63], @arr[67..68], $arr[81], $arr[82]);
} else {
@tmp_arr= (@arr[0..7], $arr[9], @arr[11..31],$arr[34], $arr[40], $arr[49], @arr[51..52], @arr[54..64], @arr[66..67], $arr[80], $arr[81]);
}
$sth->execute (@tmp_arr) or $err=$DBI::errstr;
if (defined ($mailhost) and $mailhost ne '' and $err ne '0'){
$smtp->datasend("\n$file import error : $err");
}
if ($debug eq 'yes'){
if ($err ne '0'){
print "<strong>$file import error : $err</strong><br />\n";
}
tar -xvC }

$err=0;
}
} else {
if (defined ($mailhost) and $mailhost ne ''){
$smtp->datasend("\n$file wasn't extracted");
}
if ($debug eq 'yes'){
if ($err ne '0'){
print "<strong>$file wasn't extracted</strong><br />\n";
}
}
}
$n++;
}
my @tmpdate=localtime(time);
my $year=$tmpdate[5] + 1900;
my $month=$tmpdate[4] + 1;
my $day=$tmpdate[3] - 1;
foreach my $name (@tar_names){
if ( -e $pics_path$name){
$name.="$year$month$day.tar";
my $cmd="wget -q -O - ftp://$ftp_user\:$ftp_pass\@$ftphost$ftp_path$name 2>/dev/null| tar -xvC $pics_path 2>/dev/null";
my $status=qx/$cmd/;
} else {
my $f=$name;
$name.='old.tar';
my $cmd="wget -q -O - ftp://$ftp_user\:$ftp_pass\@$ftphost$ftp_path$name 2>/dev/null| tar -xvC $pics_path 2>/dev/null";
my $status=qx/$cmd/;
if (defined($status) and $status ne ''){
qx/touch $pics_path$f/;
}
}
if (defined ($mailhost) and $mailhost ne ''){
if (!defined($status) ){
$smtp->datasend("\nThere is no new pictures");
} elsif ( $status eq ''){
$smtp->datasend("\n$name wasn't extracted");
} else {
$smtp->datasend("\n$name was extracted OK");
}
$smtp->dataend();
$smtp->quit;
}
if ($debug eq 'yes'){
if (!defined($status)){
print "<strong>There is no new pictures</strong><br />\n";
} elsif ( $status eq ''){
print "<strong>$name wasn't extracted</strong><br />\n";
} else {
print "<strong>$tar_file was extracted OK</strong><br />\n";
}
print "</body></html>\n";
}
}
if ($debug eq 'yes'){
print "</body></html>\n";
}
$dbh->commit;
$dbh->disconnect;