Adam Howitt's Blog

Nov 29
2007

A New Twist for Hierarchical Data

If you've ever had to work with a database model where rows of data represent hierarchical relationships then SQL Server 2005 has a hidden gem you may be interested.  Here are some domain examples to help you spot the type of problem this can solve:

  1. Tasks: where a task may be part of a larger task or can be made up of subtasks
  2. Bill of Materials: in a manufacturing environment where an item is made up of many other component items
  3. Family trees: Father has a Grandfather and Children but all are people
  4. Company Org Chart: the company org chart where employee Y is managed by employee X and manages employees z1,z2 and z3
Many ways to skin a cat
Before I introduce the old approach here are a few ways you might have handled it.  
  1. Joe Celko's nested set model: don't be surprised if you haven't heard of this approach - it's generally acknowledged as the fastest way to access hierarchical datasets but the complexity of implementing it scares many programmers away.  Using this approach requires you to re-architect the way the data is stored, accessed, inserted and updated so if your database already exists this would be a big undertaking.  If your data is both wide and deep, and you routinely access the complete dataset then read one of Joe's many articles on nested sets or buy his book "SQL for Smarties".
  2. Recursive stored procedures or functions: write a stored procedure or function to cursor over the children of the node you are looking at and then recursively call itself to go deeper in the recordset to build a table (need to write a function) or process each row (need to use a stored proc).  
The new rock and roll: recursive queries
The new feature of SQL Server 2005 is that it supports recursive queries.  It replaces the need to write a recursive table valued user defined function with a construct to allow you to quickly access the recordset you need to work with.  This recordset could then be returned by the procedure or used to drive a cursor to iterate over each node in the sub-tree in a particular order.

Enough talk - here is a concrete example based on the family tree notion:
WITH descendants(personid, parent_personid, generation) AS (
    /* Get the base case */
    SELECT personid, parent_personid, 0 as generation
    FROM family_members
    WHERE parent_personid IS NULL
    /* Union the recursive case */
    UNION ALL
    /* Find all family members (fm) who are descendants (d) of someone already in the descendants temp table */
    SELECT fm.personid, fm.parent_personid, generation+1
    FROM family_members fm
    INNER JOIN descendants d
    ON fm.parent_personid = d.personid
)
SELECT personid, parent_personid, generation
FROM descendants
/* Order the rows by generation ascending so that grandfather comes before father */
ORDER BY generation ASC


What is going on here?
The WITH table_name(column_list) AS (QUERY) construct allows you to define an inline view and can be used for non-recursive purposes too but in this case it defines a base case and then joins against itself using temporary tables until there are no more joins to be made or a criteria is met (we could limit it to WHERE dateDeceased IS NOT NULL to find only deceased generations).  The first generation is the base case and represents our starting points in the tree. Note that this example will grab everyone who has no parent but we could just as easily have asked for a particular ancestor and found descendants of that person.

Still no clearer?
If our table family members contained 3 rows:

Person  Parent_Personid   Name
1       NULL              Grandfather
2       1                 Father
3       2                 Son

We could represent this with SQL as shown below but this only works for the top 3 levels.  Clearly this would be unmanageable for more layers.  This is very similar to the way the recursive query works - it builds each recordset working it's way down adding rows to a temp table and then using that temp table to get the next set of rows to add.
SELECT personid, parent_personid, 0 as generation
FROM family_members
WHERE personid = 1
UNION
SELECT personid, parent_personid, 0 as generation
FROM family_members
WHERE parent_personid = 1
UNION
SELECT personid, parent_personid, 0 as generation
FROM family_members
WHERE parent_personid IN (
    SELECT personid, parent_personid, 0 as generation
    FROM family_members
    WHERE parent_personid = 1
)


Performance Counts
This approach will work as fast, if not slightly faster than any other recursive approach.  As Steve Nelson told me yesterday - there is no magic going on here.  What it does is give you a more convenient way to write and think about recursive queries.  If you are looking for < 10ms calculation times all the time at any cost then the Celko way is the way to go.  If you write some use cases for your application and test this approach it may be that you are filtering the dataset enough to get great execution times anyway.  For example, in the family example we are interested in just one ancestor, the query times shouldn't be much slower than the Celko approach but if you are routinely looking for all people with no parent_personid for vast datasets then you will see big performance problems. As always, test each approach yourself and find out what works for you.

Comments (Comment Moderation is enabled. Your comment will not appear until approved.)
[Add Comment] [Subscribe to Comments]
  1. Yes, these are called Common Table Expressions (CTE) and are awesome!

    You should also include an "option (maxrecursion XX)" clause in the final select where "XX" is the top number of levels you expect. It defaults to 100, but if you know you don't have more than 10 or 15, then it helps to stop more quickly when you have a parent of a child that points back to a child of that child (infinite loop).

    Of course, you can also use "0", which means no limit! But use very carefully....

  2. Hi Paul, First thanks for the tip on the option clause and second - did you go to Mile End School? Sounds random I know but your name is the same as a kid I went to school with in England and I would expect it's not a very common name... Adam

  3. Hi Adam. You are welcome. I learned my lesson on the maxrecursion the hard way! No, I attended schools in the U.S. I have met a few Paul Carney's in the world.

  4. Hi Adam,

    Great article & example. Keep up the good work.

  5. Raymond - thanks for the note!

[Add Comment]