A New Twist for Hierarchical Data
If you've ever had to work with a database model where rows of data represent hierarchical relationships then SQL Server 2005 has a hidden gem you may be interested. Here are some domain examples to help you spot the type of problem this can solve:
- Tasks: where a task may be part of a larger task or can be made up of subtasks
- Bill of Materials: in a manufacturing environment where an item is made up of many other component items
- Family trees: Father has a Grandfather and Children but all are people
- Company Org Chart: the company org chart where employee Y is managed by employee X and manages employees z1,z2 and z3
Before I introduce the old approach here are a few ways you might have handled it.
- Joe Celko's nested set model: don't be surprised if you haven't heard of this approach - it's generally acknowledged as the fastest way to access hierarchical datasets but the complexity of implementing it scares many programmers away. Using this approach requires you to re-architect the way the data is stored, accessed, inserted and updated so if your database already exists this would be a big undertaking. If your data is both wide and deep, and you routinely access the complete dataset then read one of Joe's many articles on nested sets or buy his book "SQL for Smarties".
- Recursive stored procedures or functions: write a stored procedure or function to cursor over the children of the node you are looking at and then recursively call itself to go deeper in the recordset to build a table (need to write a function) or process each row (need to use a stored proc).
The new feature of SQL Server 2005 is that it supports recursive queries. It replaces the need to write a recursive table valued user defined function with a construct to allow you to quickly access the recordset you need to work with. This recordset could then be returned by the procedure or used to drive a cursor to iterate over each node in the sub-tree in a particular order.
Enough talk - here is a concrete example based on the family tree notion:
/* Get the base case */
SELECT personid, parent_personid, 0 as generation
FROM family_members
WHERE parent_personid IS NULL
/* Union the recursive case */
UNION ALL
/* Find all family members (fm) who are descendants (d) of someone already in the descendants temp table */
SELECT fm.personid, fm.parent_personid, generation+1
FROM family_members fm
INNER JOIN descendants d
ON fm.parent_personid = d.personid
)
SELECT personid, parent_personid, generation
FROM descendants
/* Order the rows by generation ascending so that grandfather comes before father */
ORDER BY generation ASC
What is going on here?
The WITH table_name(column_list) AS (QUERY) construct allows you to define an inline view and can be used for non-recursive purposes too but in this case it defines a base case and then joins against itself using temporary tables until there are no more joins to be made or a criteria is met (we could limit it to WHERE dateDeceased IS NOT NULL to find only deceased generations). The first generation is the base case and represents our starting points in the tree. Note that this example will grab everyone who has no parent but we could just as easily have asked for a particular ancestor and found descendants of that person.
Still no clearer?
If our table family members contained 3 rows:
Person Parent_Personid Name
1 NULL Grandfather
2 1 Father
3 2 Son
We could represent this with SQL as shown below but this only works for the top 3 levels. Clearly this would be unmanageable for more layers. This is very similar to the way the recursive query works - it builds each recordset working it's way down adding rows to a temp table and then using that temp table to get the next set of rows to add.
FROM family_members
WHERE personid = 1
UNION
SELECT personid, parent_personid, 0 as generation
FROM family_members
WHERE parent_personid = 1
UNION
SELECT personid, parent_personid, 0 as generation
FROM family_members
WHERE parent_personid IN (
SELECT personid, parent_personid, 0 as generation
FROM family_members
WHERE parent_personid = 1
)
Performance Counts
This approach will work as fast, if not slightly faster than any other recursive approach. As Steve Nelson told me yesterday - there is no magic going on here. What it does is give you a more convenient way to write and think about recursive queries. If you are looking for < 10ms calculation times all the time at any cost then the Celko way is the way to go. If you write some use cases for your application and test this approach it may be that you are filtering the dataset enough to get great execution times anyway. For example, in the family example we are interested in just one ancestor, the query times shouldn't be much slower than the Celko approach but if you are routinely looking for all people with no parent_personid for vast datasets then you will see big performance problems. As always, test each approach yourself and find out what works for you.