thesailor
09-05-2007, 11:52 AM
Hey everyone... I am an old BBS guy and military geek turned real estate broker who tries too hard sometimes learning - but I do want to understand - so please bear with me... I am a rather visual learner... that said - I've got a basic on perl, xml, php, and mysql.... here is what my project is for the next couple of days...
Goal: Setup duplicate database structure (except using mysql on my servers) and data tables for about 15 tables. The provider of the data is a real estate related MLS service. The data is everything from the various listings (homes commercial etc) to office id's and other structures I might use in the future... I want to focus my learning on 2 or 3 right now...
1) data is provided to me via ftp in the form of tab delimited text files. It appears that for each table/data group - (for instance listings-residential.txt) is a straight dump as of like 4AM or so each day... file is .gz'd... this particular file has some 6,000 daily rows and about 80 collumns, many of which I am actually interested in saving long term... this file does not have any updates on which listings/rows were removed from day to day... this will need to be addressed in our script as well to keep my db accurate..
2) picture files associated with each of these listings are zipped on the same ftp server daily, the past 4 to 7 days of ziped pictures are available, noted in the zip file name itself is the date it was zipped... as well as a pictures-old.gz that is about 750MB containing all the older pics we need to grab and squeeze into a server directory for our scripts to search... the pictures are jpg mostly and are in the format of MLS#(relating to the mls # of the listing)x i.e. main picture is MLS###### .jpg then additional pictures (if provided) are MLS######A.jpg, MLS######B.jpg etc... this provider allows upto an additional 8 pictures so the last in the series is H... ( I mention all this as below I will ask a better way to concantonate the MLSprefix and trailing A-H as appropriate on the actual number part of the MLS number we select from the tables we create with the actual mls data part)....
3)I have a lot of ideas - and I know there are a lot of long term DB maintenance items to not overlook... here is what I have seen done elsewhere and what I am looking at - basically I want a cron script to run daily to grab these txt files and picture files, extract them to appropriate directory and parse the text file import to existing mysql databases.. but I do not yet have the table structures all setup right - I do have a text file from the provider that shows the various collumn names and categories as well as lenght... so far I have just been using manual table creation - very slow - and setting each collumn as varchar and 255... perhaps I should be looking for better ways there too...
So recap - I need to setup the tables from the txt files first... then write script to download and extract then import to mysql... if we get that far I will probably ask to hire ya for a short while to write some more stuff with me to actually use the data once we get it setup and storing the data daily... like I have the ability to receive data dumps from the county offices - I would like to import that data to a database, and offer web interface for users searching the mls data to also cross associate to the corresponding tax data - i.e. the script when displaying mls info will try to automatically associate mls address as given with the data we receive from the county office - if no exact match is found, then offer "near match or possible drop own selection to end user to pick or to browse/search to get correct data from the county)... but thats a ways down the road yet...
Any takers on this project of mine who don't mind taking donations now and then and can help me learn a little along the way?
I will log in from dev pc to post code in a moment for your review...
I am a bit lost as to what is the better way to go - bash script to get the files and extract to proper directory or just use perl or perhaps php? I am game for learning new things... been looking at ruby and python... but I am certainly no expert in any of them...
Goal: Setup duplicate database structure (except using mysql on my servers) and data tables for about 15 tables. The provider of the data is a real estate related MLS service. The data is everything from the various listings (homes commercial etc) to office id's and other structures I might use in the future... I want to focus my learning on 2 or 3 right now...
1) data is provided to me via ftp in the form of tab delimited text files. It appears that for each table/data group - (for instance listings-residential.txt) is a straight dump as of like 4AM or so each day... file is .gz'd... this particular file has some 6,000 daily rows and about 80 collumns, many of which I am actually interested in saving long term... this file does not have any updates on which listings/rows were removed from day to day... this will need to be addressed in our script as well to keep my db accurate..
2) picture files associated with each of these listings are zipped on the same ftp server daily, the past 4 to 7 days of ziped pictures are available, noted in the zip file name itself is the date it was zipped... as well as a pictures-old.gz that is about 750MB containing all the older pics we need to grab and squeeze into a server directory for our scripts to search... the pictures are jpg mostly and are in the format of MLS#(relating to the mls # of the listing)x i.e. main picture is MLS###### .jpg then additional pictures (if provided) are MLS######A.jpg, MLS######B.jpg etc... this provider allows upto an additional 8 pictures so the last in the series is H... ( I mention all this as below I will ask a better way to concantonate the MLSprefix and trailing A-H as appropriate on the actual number part of the MLS number we select from the tables we create with the actual mls data part)....
3)I have a lot of ideas - and I know there are a lot of long term DB maintenance items to not overlook... here is what I have seen done elsewhere and what I am looking at - basically I want a cron script to run daily to grab these txt files and picture files, extract them to appropriate directory and parse the text file import to existing mysql databases.. but I do not yet have the table structures all setup right - I do have a text file from the provider that shows the various collumn names and categories as well as lenght... so far I have just been using manual table creation - very slow - and setting each collumn as varchar and 255... perhaps I should be looking for better ways there too...
So recap - I need to setup the tables from the txt files first... then write script to download and extract then import to mysql... if we get that far I will probably ask to hire ya for a short while to write some more stuff with me to actually use the data once we get it setup and storing the data daily... like I have the ability to receive data dumps from the county offices - I would like to import that data to a database, and offer web interface for users searching the mls data to also cross associate to the corresponding tax data - i.e. the script when displaying mls info will try to automatically associate mls address as given with the data we receive from the county office - if no exact match is found, then offer "near match or possible drop own selection to end user to pick or to browse/search to get correct data from the county)... but thats a ways down the road yet...
Any takers on this project of mine who don't mind taking donations now and then and can help me learn a little along the way?
I will log in from dev pc to post code in a moment for your review...
I am a bit lost as to what is the better way to go - bash script to get the files and extract to proper directory or just use perl or perhaps php? I am game for learning new things... been looking at ruby and python... but I am certainly no expert in any of them...