This post is the first in a series of posts related to
XPathDocument and
XPathNavigator. I will highlight the qualities and drawbacks of the standard .Net implementations and go through the design and development of a new implementation that fits better to my needs.
First, what is an XPathDocument ?
An
XPathDocument is used when you need to query xml data using XPath. For example, you can get a list of article id and ordered quantity from the following xml file:
|
<order> <article id="1"> <quantity>12</quantity> </article> <article id="5"> <quantity>8</quantity> </article> <article id="6"> <quantity>1</quantity> </article> </order> |
using the following code:
|
class Program { static void Main(string[] args) { XPathDocument doc = new XPathDocument(@"order.xml"); XPathNavigator nav = doc.CreateNavigator(); var articles = nav.Select("//article"); foreach (XPathNavigator article in articles) { Console.WriteLine("{0}:{1}", article.SelectSingleNode("@id").Value, article.SelectSingleNode("quantity").Value); } } } |
XPath, once you get a hang of it, is very powerful and flexible for accessing Xml data. It allows for complex queries and computation.
Where’s the catch?
This is great but there is a drawback. As per the documentation,
XPathDocument provides a fast, read-only, in-memory representation of an XML document by using the XPath data model. This does not scale well with file sizes. If your files grow to tens of MB or larger, it will be as much data that will be loaded in memory. I recently built a mapping utility based on XPath. The starting requirements were to handle lots of small files. It turned out that once in the field, clients were using feeding it a small number of large files instead. Loading these large files in memory caused a lot of issues from bad responsiveness due to excessive swapping to plain
OutOfMemoryException.
The good news is that we can do something about it.
How does XPath work in .Net?
In the above code snippet, you can see that the only reference to
XPathDocument is to create it. We then use it only once to create an
XPathNavigator. The rest of the XPath querying involves only XPathNavigators.
XPathNavigator class
An
XPathNavigator is a cursor on an xml data structure. As a cursor, it provides basic operations to move the cursor, to query information about the data it points to and also to clone itself.
XPathNavigator can also be used to update the underlying data if the implementation supports it.
The data of the node pointed to by a navigator can be accessed using a set of properties. The most commonly used would be LocalName, NamespaceURI, Prefix but most importantly Value and its variants. NodeType is also important. The type of a node determines the allowed move operations supported.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
|
public abstract string BaseURI { get; } public virtual bool HasAttributes { get; } public virtual bool HasChildren { get; } public virtual string InnerXml { get; set; } public abstract bool IsEmptyElement { get; } public abstract string LocalName { get; } public abstract string Name { get; } public abstract string NamespaceURI { get; } public abstract XmlNameTable NameTable { get; } public abstract XPathNodeType NodeType { get; } public virtual string OuterXml { get; set; } public abstract string Prefix { get; } public virtual IXmlSchemaInfo SchemaInfo { get; } public override object TypedValue { get; } public virtual object UnderlyingObject { get; } public override bool ValueAsBoolean { get; } public override DateTime ValueAsDateTime { get; } public override double ValueAsDouble { get; } public override int ValueAsInt { get; } public override long ValueAsLong { get; } public override Type ValueType { get; } public virtual string XmlLang { get; } public override XmlSchemaType XmlType { get; } |
To move a validator around, the following methods can be used. Notice that all but one of them return a boolean. It indicates whether the move operation succeeded. True means the navigator now points to the new node, false, means it has not moved and still point to the original node. The only method that does not return a boolean is MoveToRoot() because it always succeeds.
An operation may fail for various reasons. For example, MoveToNext() will fail if the current node has no next sibling (eg. the last element of a sequence) or if the current node is an attribute. MoveToChild() will fail if there is no child of the current node that satisfies the conditions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
|
public abstract bool MoveTo(XPathNavigator other); public virtual bool MoveToAttribute(string localName, string namespaceURI); public virtual bool MoveToChild(XPathNodeType type); public virtual bool MoveToChild(string localName, string namespaceURI); public virtual bool MoveToFirst(); public abstract bool MoveToFirstAttribute(); public abstract bool MoveToFirstChild(); public bool MoveToFirstNamespace(); public abstract bool MoveToFirstNamespace(XPathNamespaceScope namespaceScope); public virtual bool MoveToFollowing(XPathNodeType type); public virtual bool MoveToFollowing(string localName, string namespaceURI); public virtual bool MoveToFollowing(XPathNodeType type, XPathNavigator end); public virtual bool MoveToFollowing(string localName, string namespaceURI, XPathNavigator end); public abstract bool MoveToId(string id); public virtual bool MoveToNamespace(string name); public abstract bool MoveToNext(); public virtual bool MoveToNext(XPathNodeType type); public virtual bool MoveToNext(string localName, string namespaceURI); public abstract bool MoveToNextAttribute(); public bool MoveToNextNamespace(); public abstract bool MoveToNextNamespace(XPathNamespaceScope namespaceScope); internal bool MoveToNonDescendant(); public abstract bool MoveToParent(); public abstract bool MoveToPrevious(); public virtual void MoveToRoot(); |
XPath queries?
That’s all very good but you might ask ‘what about XPath queries?’. XPath queries can be executed using the following functions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
|
public virtual object Evaluate(string xpath); public virtual object Evaluate(XPathExpression expr); public virtual object Evaluate(string xpath, IXmlNamespaceResolver resolver); public virtual object Evaluate(XPathExpression expr, XPathNodeIterator context); public virtual bool Matches(string xpath); public virtual bool Matches(XPathExpression expr); public virtual XPathNodeIterator Select(string xpath); public virtual XPathNodeIterator Select(XPathExpression expr); public virtual XPathNodeIterator Select(string xpath, IXmlNamespaceResolver resolver); public virtual XPathNodeIterator SelectAncestors(XPathNodeType type, bool matchSelf); public virtual XPathNodeIterator SelectAncestors(string name, string namespaceURI, bool matchSelf); public virtual XPathNodeIterator SelectChildren(XPathNodeType type); public virtual XPathNodeIterator SelectChildren(string name, string namespaceURI); public virtual XPathNodeIterator SelectDescendants(XPathNodeType type, bool matchSelf); public virtual XPathNodeIterator SelectDescendants(string name, string namespaceURI, bool matchSelf); public virtual XPathNavigator SelectSingleNode(string xpath); public virtual XPathNavigator SelectSingleNode(XPathExpression expression); public virtual XPathNavigator SelectSingleNode(string xpath, IXmlNamespaceResolver resolver); |
Evaluate() returns a value dependent on the XPath query. The result can be an integer, a string or a node set etc. Matches() tells you whether the current satisfies conditions expressed as an XPath expression. The Select() functions return a node iterator over their result. That is a set of XPathNavigators each pointing to a node in the XPath expression result set.
So how do we solve our problem?
The key element that will help us solve our scaling issue lies in the implementation of the XPath querying methods (Evaluate, Match and Select). Their implementation is actually expressed in terms of Move() operations and property checks on XPathNavigators.
The following example uses on one hand Select() to find the <article> nodes of root element <order>, on the other hand, it uses a series of Move() operations to do the same.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
|
private static void Example2Select() { XPathDocument doc = new XPathDocument(@"order.xml"); XPathNavigator nav = doc.CreateNavigator(); var articles = nav.Select("/order/article"); foreach (XPathNavigator article in articles) { Console.WriteLine("node inner xml : {0}", article.OuterXml); } } private static void Example2Move() { XPathDocument doc = new XPathDocument(@"order.xml"); XPathNavigator nav = doc.CreateNavigator(); nav.MoveToChild("order", ""); // move to element order nav.MoveToChild("article", ""); // move to first element article do { Console.WriteLine("node inner xml : {0}", nav.OuterXml); } while (nav.MoveToNext("article", "")); } |
All XPath queries can be expressed as a series of Move() and Clone() operations. This is exactly what Select() does behind the scene. This is where the design of the
XPathNavigator class shines. Select() is implemented exclusively in terms of Move() and Clone() operations. This means that any implementation of
XPathNavigator that supports these operations can benefit from the XPath query language.
Did you notice earlier that some of the Move() operations are virtual, others abstract ? In the same manner that XPath queries can be expressed as a series of Move() operations, most Move() operations can be expressed as a series of some of the most basic move operations. For example, the default implementation of MoveToRoot() is simply
while (this.MoveToParent()) {} Properties of
XPathNavigator also follow the same pattern. Virtual properties have a default implementation that relies on the abstract properties.
This design hepls a lot in our case. We can get rid of the default .Net-provided
XPathNavigator implementation without changing our usage. We will create a new implementation that will not load all the xml data in memory ; instead, it will cache this information to disk. Of course, since disk IO will occur, our implementation will probably be slower. We will see what we can do about it in a later post.
Below are the limited list of methods and properties that must be implemented in order to support XPath querying.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
|
public abstract string BaseURI { get; } public abstract bool IsEmptyElement { get; } public abstract string LocalName { get; } public abstract string Name { get; } public abstract string NamespaceURI { get; } public abstract XmlNameTable NameTable { get; } public abstract XPathNodeType NodeType { get; } public abstract string Prefix { get; } public abstract bool MoveTo(XPathNavigator other); public abstract bool MoveToFirstAttribute(); public abstract bool MoveToFirstChild(); public abstract bool MoveToFirstNamespace(XPathNamespaceScope namespaceScope); public abstract bool MoveToId(string id); public abstract bool MoveToNext(); public abstract bool MoveToNextAttribute(); public abstract bool MoveToNextNamespace(XPathNamespaceScope namespaceScope); public abstract bool MoveToParent(); public abstract bool MoveToPrevious(); |
As you can see, the minimum interface we need to support is not as big as we could have thought. We still have a lot to do though. We have to design our solution and implement it but more importantly, we need to write tests for it.
Conclusion
In a next post, we will setup a series of unit tests. These tests will be run against both the standard implementation (to ensure we understand the requirements correctly) and our new implementation (to make sure we stick to the requirements).
The design of
XPathNavigator is quite clever. Basing the implementation of XPath queries on the abstract implementation of primitive Move() and Clone() operations enables implementors to keep their internal representation of the data completely decoupled. An implementation could very well provide an Xml-compatible view on a data structure completely unrelated to Xml. For instance, it is quite simple to expose the information of a tree of POCOs using Reflection. Another example would be to expose other data formats such as JSON to XPath-only consumers.