Daniel Lemire's blog

, 1 min read

Indexing XML

It is often important to index XPath queries. Not only is XPath useful on its own, but it is also the basis for the FLWOR expressions in XQuery.

A typical XPath expression will select only a small fraction of any XML document (such as the value of a particular attribute). Thus, a sensible strategy is to represent the XML documents as tables. There are several possible maps from XML documents to tables. One of the most common is ORDPATH.

In the ORDPATH model, the root node receives the identifier 1, the first node contained in the root node receives the identifier 1.1, the second one receives the identifier 1.2, and so on. Given the ORDPATH identifiers, we can easily determine whether two nodes are neighbors, or whether they have a child-parent relationship.

As an example, here’s an XML document and its (simplified) ORDPATH representation:


<liste temps="janvier" >
<bateau />
<bateau >
<canard />
</bateau>
</liste>

ORDPATH name type value
1 liste element
1.1 temps attribute janvier
1.2 bateau element
1.3 bateau element
1.3.1 canard element

Given a table, we can easily index it using standard indexes such as B trees or hash tables. For example, if we index the value column, we can quickly process the XPath expression @temps=”janvier”.

Effectively, we can map XPath and XQuery queries into SQL. This leaves relatively little room for XML-specific indexes. I am certain that XML database designers have even smarter strategies, but do they work significantly better?

Reference: P. O’Neil, et al.. ORDPATHs: insert-friendly XML node labels. 2004.

Further reading: Native XML databases: have they taken the world over yet?