Click to See Complete Forum and Search --> : XML - identify identical elements


zlatan
01-07-2009, 06:55 PM
Hello,

just want to say i am a complete and utter n00bie with XML, so talk to me like a noobie:)

I have a massive XML file, within this XML file i have

<COMPANY_NAME>NAME OF COMPANY</COMPANY_NAME> tags

There are about 4000 COMPANY_NAME elements in the XML file in total.

I know 5 of these COMPANY_NAME elements contain the same data

Basically i need to identify (i must identify and not just remove the duplicate because the child elements contain different data)

How can I do this in a simple way, does anyone know of a program that can do this for me? or can i write a very simple script file to pull the duplicate data. I prefer the software method as it just (usually) makes things easier to see in a clear way.

Your help much appreciated

Charles
01-07-2009, 07:05 PM
The whole point of XML is to give you lots of tools for attacking any problem. That's also its greatest weakness. There are so many ways to tackle the problem it's hard to know which one to recommend.

One simple method would be to open the file in a text editor and use the search feature. If the file is on a web server you would use whatever scripting language you have there. If it's a local file sitting on a Windows box I would use JScript and ActiveX to parse the file and grab the data. If the data isn't more than two dimensional you should be able to connect to it as if it were a data base and then you could run queries in SQL.

So, what exactly is your situation and what tools do you have on hand?

zlatan
01-08-2009, 08:27 AM
Let me try and explain a bit further as it may help to understand what I am trying to acheive.


I have exactly 4023 entries in this huge XML file (4023 x <COMPANY_NAME> elements).
I pass the XML file through a program, this program chops up the large XML file into 4017 small XML files. Each of these XML files get named uniquely according to the <COMPANY_NAME> value
Therefore i am missing 6 files
The reason i miss 6 files is because 6 of the <COMPANY_NAME> values are identical. So when the program extracts the small XML files it just overwrites any existing XML files that have the same name
So essentially if i can find the 6 <COMPANY_NAME> that are identical then i can modify them slightly thus producing 4023 out XML files instead of 4017.


I have editpad pro at my disposal which is a text editor that can handle regular expressions.
I have no clue about JS or XSLT, but am willing to put some effort into those as long as it isnt a very long process - i would need somebody to point me in the correct direction.

I cant use a search option in a text editor because I dont know which of the COMPANY_NAME elements are identical and I dont really fancy doing 4023 searches - I have about 12 of these massive XML files to process.


I know a little about SQL so maybe this is an option, but again I need someone to point me in the right direction.

Charles
01-08-2009, 08:30 AM
I need to know: Where is this file now? On your local box? What OS are you using? Windows? What scripting do you have available?

zlatan
01-08-2009, 11:30 AM
its a XML file on my PC, Windows XP

Scripting, I have err a Text editor and half a brain to learn some scripting, I also have webspace with PHP enabled and a mysql database,

Scriptage
01-10-2009, 12:25 PM
What's the program written in?