Hi I was wondering if anyone has experience with parsing large XML files to MySQL up to 5G.
So far I wasn't able to come with efficient code that will do that.
Last solution I come up with is cut the file on multiple files 100mg each and process them in batch but it takes too much time so I have to come up with something else but the major problem is I'm on shared host so If script takes too much server resources process get automatically killed.
I just don't think XML feeds are efficient for store large amount of information.
I'm not sure there's any way to handle text files that large in PHP that will qualify as "efficient". Reading the whole file into memory via file() or file_get_contents() is likely going to cause memory usage problems, and reading a line at a time is probably going to be too slow.
For that matter, I'm not sure that any language is going to efficiently read and parse a 5 GB XML file (at least not without a high-quality, dedicated server to do the work). It might be time to give the whole concept a careful look and determine if this whole concept makes sense or if there's a simpler way to get the data you need. (For instance, how much duplicate and/or unwanted data is included in each XML file?)
"Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
~ Terry Pratchett in Nation
No we do not storing our data in XML format.
But we display data from other websites most of them provide feeds in XML format. I have no have problem parsing regular comma or tab delimited text file feeds even if they big. I dumping some feeds form HALF.com they provide feeds in regular text file file size 500mg no problem there.
Maybe I should convert XML data to just regular text file first?
well how exactly are you parsing it? i hope not by hand, as that is very inefficient from a coding and processing stand point. i don't see how any site on the web could possibly send 5 gb of data via XML and expect 1) for anyone to parse it 2) for them to not have inexorbinant amounts of wasted bandwidth.