XML in SQL Server 2005

Dr Michael Rys, Microsoft Corporation

SQL Server 2005 includes deep XML integration, including a native, indexed XML datatype along with commands that accept or return XML. Tim Anderson interviews Dr Michael Rys about the thinking behind the new features along with guidance on when to use them.

Michael Rys on XML in SQL Server 2005

Michael Rys is one of two program managers responsible for the XML features in SQL Server. He also represents Microsoft on the W3C XQuery Working Group, and on the ANSI working group for SQL. I asked him what is distinctive about the XML support in SQL Server 2005, as opposed to that found in rival database management systems. As Rys acknowledges, "all the three major vendors, Oracle, IBM and Microsoft, are moving towards second or third generation support, which means that you get full XML native support, native in the sense that you actually preserve the whole XML document, and provide some form of XQuery support over the data." However, a design goal with SQL Server is that XML is just another datatype. "You have the exact same programming model. With our approach you can evolve the structure of the data over time seamlessly." The idea, explains Rys, is that "you can arbitrarily manage your data regardless of whether it fits into a relational framework or not. And we try to make it really easy to use the functionality in a performant and scaleable way."

That sounds good, but it begs the question of when it is approriate to stuff XML into the database. On the face of it, it breaks the tidy and disciplined relational structure. Is there a risk that using an XML data type will complicate data management?

"I think it makes it harder to model the data," says Rys. "People have to get down to understanding what it means to model hierarchical data in a relational context. Actually, this discussion has been going on since the early 1980’s, at least in the research community. In the early 1980s some people came up with what is called the non first normal form database, which basically is doing nested tables. Some people said that in order to do clear design in the database you have to have first normal form. Other people, including the professor I did my PhD with, said no, the first normal form is not required for doing clear database design. You can use second normal form and third normal form and BCNF [Boyce-Codd Normal Form] and all these other normal forms, without necessarily having to have a first normal form available. But it adds another dimension to your modelling, which means that you have to be very careful to understand how you can utilise it.

"If your data fits the relational model, you’re probably better off probably still taking the XML and shredding into relational form, because all the data really is relational data and you want to process it that way.

"On the other hand, if you actually have a collection of values, if you have object structures of things that you want to treat almost as an atomic item, but which you might want to query to find individual values, then having the ability to manage that in an accessible way inside the database is good. Look at arrays. If you have data that fits into an array paradigm, like a series of measurements for example, you might want to manage that series of measurements as an unique whole. Trying to do that design on the relational level is more problematic than if you can store it within one of these non first normal datatypes."

A new way to store data in SQL Server

What this means is that the XML datatype is more than merely a more convenient way to store XML. It means that SQL Server can effectively model a wider range of real-world data. In some ways it is catching up with Oracle and DB2. Oracle supports a nested table column type, while DB2 has a similar feature called a structured type, so both these database managers already support non first normal form to an extent. However, the XML datatype is far richer than these. Of course you could always store XML, or any type you like, in a SQL Server text or binary column, but in doing so you severely limit the capability to index and search the data efficiently.

Departing from the pure relational model may be a culture shock. Rys notes that our typical approach to modeling data does not take account of all possibilities. "The modelling approach has to be extended to take into account that if you have an object that has no separate functional dependencies to things outside that object, you might want to treat it separately. So the question becomes, do you need a new theory of normalization? There are researchers working on that. These researchers often only look at modelling functional dependencies within an XML document, but I think the interesting part is where you have multiple different documents as well as relational data and you want to try to find out how to best model it. That’s an interesting research topic. But you still can start out with relational modelling, look at your XML documents and determine whether you want to break them up because they fit the relational model, or keep them together because they are more like objects."

One implication of the XML datatype is that storing objects of an arbitrary size in the database will become more common. Traditionally, the advice has been to store such objects in the file system for better peformance. I asked Rys if these technical issues have been solved?

"I don’t think all the technical problems have been solved, but certainly Microsoft, and also I think the other major relational database vendors, are working hard to make management of really large objects inside the database as good as managing smaller text or binary objects. If you look at SQL Server 2005, it now has new nvarchar(max) and varbinary(max) extensions of the varchar and varbinary types to replace these dreadful ntext and image types that we are deprecating. But you will still have the issue that if the row gets bigger than 8K, or whatever your buffer size is, you will have to deal with in-row and out-of-row storage of the data. In turn you are gaining all the benefits of storing it in the database, which means you get concurrent access against the data, backup and recovery, logging of operations, and being able to undo operations on such types, which you don’t get if you keep them out of the database.

"However if your main goal is just to have fast file-type access to these large objects, then obviously file system is still more performant. One of the main problems people have in general with large objects is that it’s hard to unlock the information in the data. With XML being a structured format, with a query language, you have additional benefits from putting the XML into the database versus keeping it on the file system. You can query into the data, and potentially even do partial updates, which will be much harder to do if you have it outside the database in the file system."

Tim Anderson's ITWriting