bogocles
11-20-2008, 12:34 PM
First off,
I'm just guessing that this is the proper forum for my question. Let me apologize in advance for the length of this question, as well.
Anyway, I've got a simple task to perform at work. Actually, we're already performing it, but my boss wants to me to explore other means of reaching the same end. Specifically, I need to take an input XML file, and bulk insert it into a SQL Server 2005 database. Currently we do this with the SQL-XML Bulk Load architecture. My job is to see if SSIS can do it better (read: faster).
The (sample) XML File
<?xml version="1.0" encoding="utf-8"?>
<countries>
<country name="Canada" district-type="province">
<district name="Ontario">
<city>Toronto</city>
<city>Ottawa</city>
<city>St. Catharines</city>
</district>
<district name="British Columbia">
<city>Vancouver</city>
<city>Victoria</city>
</district>
</country>
<country name="United States" district-type="state">
<district name="California">
<city>Los Angeles</city>
<city>Bakersfield</city>
<city>San Clemente</city>
</district>
<district name="Arizona">
<city>Mesa</city>
<city>Flagstaff</city>
<city>Sedona</city>
</district>
</country>
<country name="Mexico" district-type="state">
<district name="Federal District">
<city>Mexico City</city>
</district>
<district name="Baja California">
<city>Mexicali</city>
<city>Tijuana</city>
</district>
<district name="Chihuahua">
<city>Chihuahua</city>
<city>Ciudad Juarez</city>
</district>
</country>
</countries>
SSIS Data Flow Components
My source of course is an XML Source component.
My destination will then be an OLE DB Connection.
The Problem
Because I'm using an XML Source component, and the XML input file is relatively complex (I guess ...), I end up with three separate tables: country, district, and city. The country table is automatically given a country_Id column to preserve its hierarchical relationship with district. Likewise, district is given a country_Id column as well, and also a district_Id column. Lastly, city is also given a district_Id column to preserve its relationship with district.
The OLE DB Destination component can only accept a single input--or table in this case. But I've got three, so I have to amalgamate them into a single one. In SQL, this is easy. I could just write a query like ...
SELECT
*
FROM
country
INNER JOIN district
ON country.country_Id = district.country_Id
INNER JOIN city
ON district.district_Id = city.district_Id
... to a single table with all the data I need.
In SSIS, however, I don't think its possible to just give the package a literal SQL query. Everywhere I've looked online, people say to use Merge Joins to transform the multiple tables resulting from an XML Source object into a single table. The problem with Merge Joins is that they only accept two inputs (tables), and I have three. Therefore: I figured I would just use two Merge Joins. This is how I did it (ordered from first to last action):
XML Source
Set the isSorted property of country, district, and city tables to TRUE.
Set the SortKeyPosition on country.country_Id, district.country_id and city.district_Id to 1 (notice this last value)
Set the SortKeyPosition on district.district_Id table to 2
First Merge Join
Set the inputs of the first Merge Join to country (LEFT) and district (RIGHT)
Selected all columns except country.country_Id for the output
Result: a table INNER JOINed on country_Id
Sort
This output needs to be sorted before being used, so I sent it to Sort component, and sorted by country_Id (1) and district_Id (2)
Second Merge Join
Set the first input (LEFT) to the output of the Sort component
Set the second input (RIGHT) to the city table from the XML Source component
In order to make this work without the IDE throwing errors at me, I had to set city.district_Id 's SortKeyPosition to 1. This yields a table that is joined wrong. In fact, the same output table would result from the following query:
SELECT
*
FROM
country
INNER JOIN district
ON country.country_Id = district.country_Id
INNER JOIN city
ON district.country_Id = city.district_Id
This, of course, is semantically wrong. I want it to join on the District ID's of the District and City tables, like in the query near the top of this post.
Now, here's what I tried:
1. Set the SortKeyPosition of city.district_Id to 2. This results in an IDE error that reads:
Error at Data Flow Task [DTS.Pipeline]: The IsSorted property of output "city" (45) is set to TRUE, but the absolute values of the non-zero output column SortKeyPositions do not form a monotonically increasing sequence, starting at one.
2. Set the SortKeyPosition of city.district_Id to 0. This results in a warning AND an error:
Validation warning. Data Flow Task: DTS.Pipeline: The IsSorted property of output "city" (45) is set to TRUE, but all of its output columns' SortKeyPositions are set to zero.
Validation error. Data Flow Task: Merge Join 1 [523]: The input is not sorted. The "input "Merge Join Right Input" (525)" must be sorted.
And that's it! I don't know what else to do. Can anyone help me figure out how to get these tables to join properly? I wish I could just push an actual query into the SSIS package, but sadly I don't think its possible.
Here's a link to a screen-shot I took of the (almost) finished product (http://www.imagehostingsite.com/gallery.php?entry=images/mtvumyj3ym2whtw2ijln.jpg)
I'm just guessing that this is the proper forum for my question. Let me apologize in advance for the length of this question, as well.
Anyway, I've got a simple task to perform at work. Actually, we're already performing it, but my boss wants to me to explore other means of reaching the same end. Specifically, I need to take an input XML file, and bulk insert it into a SQL Server 2005 database. Currently we do this with the SQL-XML Bulk Load architecture. My job is to see if SSIS can do it better (read: faster).
The (sample) XML File
<?xml version="1.0" encoding="utf-8"?>
<countries>
<country name="Canada" district-type="province">
<district name="Ontario">
<city>Toronto</city>
<city>Ottawa</city>
<city>St. Catharines</city>
</district>
<district name="British Columbia">
<city>Vancouver</city>
<city>Victoria</city>
</district>
</country>
<country name="United States" district-type="state">
<district name="California">
<city>Los Angeles</city>
<city>Bakersfield</city>
<city>San Clemente</city>
</district>
<district name="Arizona">
<city>Mesa</city>
<city>Flagstaff</city>
<city>Sedona</city>
</district>
</country>
<country name="Mexico" district-type="state">
<district name="Federal District">
<city>Mexico City</city>
</district>
<district name="Baja California">
<city>Mexicali</city>
<city>Tijuana</city>
</district>
<district name="Chihuahua">
<city>Chihuahua</city>
<city>Ciudad Juarez</city>
</district>
</country>
</countries>
SSIS Data Flow Components
My source of course is an XML Source component.
My destination will then be an OLE DB Connection.
The Problem
Because I'm using an XML Source component, and the XML input file is relatively complex (I guess ...), I end up with three separate tables: country, district, and city. The country table is automatically given a country_Id column to preserve its hierarchical relationship with district. Likewise, district is given a country_Id column as well, and also a district_Id column. Lastly, city is also given a district_Id column to preserve its relationship with district.
The OLE DB Destination component can only accept a single input--or table in this case. But I've got three, so I have to amalgamate them into a single one. In SQL, this is easy. I could just write a query like ...
SELECT
*
FROM
country
INNER JOIN district
ON country.country_Id = district.country_Id
INNER JOIN city
ON district.district_Id = city.district_Id
... to a single table with all the data I need.
In SSIS, however, I don't think its possible to just give the package a literal SQL query. Everywhere I've looked online, people say to use Merge Joins to transform the multiple tables resulting from an XML Source object into a single table. The problem with Merge Joins is that they only accept two inputs (tables), and I have three. Therefore: I figured I would just use two Merge Joins. This is how I did it (ordered from first to last action):
XML Source
Set the isSorted property of country, district, and city tables to TRUE.
Set the SortKeyPosition on country.country_Id, district.country_id and city.district_Id to 1 (notice this last value)
Set the SortKeyPosition on district.district_Id table to 2
First Merge Join
Set the inputs of the first Merge Join to country (LEFT) and district (RIGHT)
Selected all columns except country.country_Id for the output
Result: a table INNER JOINed on country_Id
Sort
This output needs to be sorted before being used, so I sent it to Sort component, and sorted by country_Id (1) and district_Id (2)
Second Merge Join
Set the first input (LEFT) to the output of the Sort component
Set the second input (RIGHT) to the city table from the XML Source component
In order to make this work without the IDE throwing errors at me, I had to set city.district_Id 's SortKeyPosition to 1. This yields a table that is joined wrong. In fact, the same output table would result from the following query:
SELECT
*
FROM
country
INNER JOIN district
ON country.country_Id = district.country_Id
INNER JOIN city
ON district.country_Id = city.district_Id
This, of course, is semantically wrong. I want it to join on the District ID's of the District and City tables, like in the query near the top of this post.
Now, here's what I tried:
1. Set the SortKeyPosition of city.district_Id to 2. This results in an IDE error that reads:
Error at Data Flow Task [DTS.Pipeline]: The IsSorted property of output "city" (45) is set to TRUE, but the absolute values of the non-zero output column SortKeyPositions do not form a monotonically increasing sequence, starting at one.
2. Set the SortKeyPosition of city.district_Id to 0. This results in a warning AND an error:
Validation warning. Data Flow Task: DTS.Pipeline: The IsSorted property of output "city" (45) is set to TRUE, but all of its output columns' SortKeyPositions are set to zero.
Validation error. Data Flow Task: Merge Join 1 [523]: The input is not sorted. The "input "Merge Join Right Input" (525)" must be sorted.
And that's it! I don't know what else to do. Can anyone help me figure out how to get these tables to join properly? I wish I could just push an actual query into the SSIS package, but sadly I don't think its possible.
Here's a link to a screen-shot I took of the (almost) finished product (http://www.imagehostingsite.com/gallery.php?entry=images/mtvumyj3ym2whtw2ijln.jpg)