Monthly Archives: May 2013

XPathDocument and whitespaces

Writing code is fun. At least it is for me. But sometimes it gets irritating. You know, you’re busy on something, you write the code, you know it’s right but it doesn’t work… You keep your focus on that one piece of code you just wrote and it keeps on not working. Sometimes, the reason it doesn’t work is obvious but sometimes, you keep reviewing your code, its surrounding, you debug away several variants of your solution and it keeps on not working …

I just had one of those moments…

And then, bang ! The solution jumps at me and it’s so obvious I almost felt shame :-|

I was writing the unit tests in preparation for my next article on creating a XPathNavigator implementation. The code basically boils down to this:

I am testing MoveToParent() from a whitespace node. "/root/text()"  is expected to give me an XPathNavigator located on the whitespace node inside the <root> element. And nav just keeps on being null. Since I had been busy writing pairs of xml samples and xpath queries to put my test in each situation I needed to test. I immediately assumed my xpath query was not correct. i just kept on tweaking here, there. nav is null still …

After some time, I decided to not put more effort into it and come back to it later, once I can take the necessary step back. I posted a question on stackoverflow.com and worked on something else.

I was busy on something completely different when it struck me. One of those “Haha” moments. XPathDocument has a constructor that takes a XmlSpace enum value. By default, if you don’t specify it, XPathDocument will simply skip all non-significant whitespace node.

That’s it … annoying.

 

 

So what’s wrong with XPathDocument ?

This post is the first in a series of posts related to XPathDocument and XPathNavigator. I will highlight the qualities and drawbacks of the standard .Net implementations and go through the design and development of a new implementation that fits better to my needs.

First, what is an XPathDocument ?

An XPathDocument is used when you need to query xml data using XPath. For example, you can get a list of article id and ordered quantity from the following xml file:

using the following code:

XPath, once you get a hang of it, is very powerful and flexible for accessing Xml data. It allows for complex queries and computation.

Where’s the catch?

This is great but there is a drawback. As per the documentation, XPathDocument provides a fast, read-only, in-memory representation of an XML document by using the XPath data model. This does not scale well with file sizes. If your files grow to tens of MB or larger, it will be as much data that will be loaded in memory. I recently built a mapping utility based on XPath. The starting requirements were to handle lots of small files. It turned out that once in the field, clients were using feeding it a small number of large files instead. Loading these large files in memory caused a lot of issues from bad responsiveness due to excessive swapping to plain  OutOfMemoryException.

The good news is that we can do something about it.

How does XPath work in .Net?

In the above code snippet, you can see that the only reference to  XPathDocument is to create it. We then use it only once to create an XPathNavigator. The rest of the XPath querying involves only XPathNavigators.

XPathNavigator class

An XPathNavigator is a cursor on an xml data structure. As a cursor, it provides basic operations to move the cursor, to query information about the data it points to and also to clone itself. XPathNavigator can also be used to update the underlying data if the implementation supports it.

The data of the node pointed to by a navigator can be accessed using a set of properties. The most commonly used would be LocalName, NamespaceURI, Prefix but most importantly Value and its variants. NodeType is also important. The type of a node determines the allowed move operations supported.

To move a validator around, the following methods can be used. Notice that all but one of them return a boolean. It indicates whether the move operation succeeded. True means the navigator now points to the new node, false, means it has not moved and still point to the original node. The only method that does not return a boolean is MoveToRoot() because it always succeeds.

An operation may fail for various reasons. For example, MoveToNext() will fail if the current node has no next sibling (eg. the last element of a sequence) or if the current node is an attribute. MoveToChild() will fail if there is no child of the current node that satisfies the conditions.

XPath queries?

That’s all very good but you might ask ‘what about XPath queries?’. XPath queries can be executed using the following functions:

Evaluate() returns a value dependent on the XPath query. The result can be an integer, a string or a node set etc. Matches() tells you whether the current satisfies conditions expressed as an XPath expression. The Select() functions return a node iterator over their result. That is a set of XPathNavigators each pointing to a node in the XPath expression result set.

So how do we solve our problem?

The key element that will help us solve our scaling issue lies in the implementation of the XPath querying methods (Evaluate, Match and Select). Their implementation is actually expressed in terms of Move() operations and property checks on XPathNavigators.

The following example uses on one hand Select() to find the <article> nodes of root element <order>, on the other hand, it uses a series of Move() operations to do the same.

All XPath queries can be expressed as a series of Move() and Clone() operations. This is exactly what Select() does behind the scene. This is where the design of the XPathNavigator class shines. Select() is implemented exclusively in terms of Move() and Clone() operations. This means that any implementation of XPathNavigator that supports these operations can benefit from the XPath query language.

Did you notice earlier that some of the Move() operations are virtual, others abstract ? In the same manner that XPath queries can be expressed as a series of Move() operations, most Move() operations can be expressed as a series of some of the most basic move operations. For example, the default implementation of MoveToRoot() is simply  while (this.MoveToParent()) {} Properties of XPathNavigator also follow the same pattern. Virtual properties have a default implementation that relies on the abstract properties.

This design hepls a lot in our case. We can get rid of the default .Net-provided XPathNavigator implementation without changing our usage. We will create a new implementation that will not load all the xml data in memory ; instead, it will cache this information to disk. Of course, since disk IO will occur, our implementation will probably be slower. We will see what we can do about it in a later post.

Below are the limited list of methods and properties that must be implemented in order to support XPath querying.

As you can see, the minimum interface we need to support is not as big as we could have thought. We still have a lot to do though. We have to design our solution and implement it but more importantly, we need to write tests for it.

Conclusion

In a next post, we will setup a series of unit tests. These tests will be run against both the standard implementation (to ensure we understand the requirements correctly) and our new implementation (to make sure we stick to the requirements).

The design of XPathNavigator is quite clever. Basing the implementation of XPath queries on the abstract implementation of primitive Move() and Clone() operations enables implementors to keep their internal representation of the data completely decoupled. An implementation could very well provide an Xml-compatible view on a data structure completely unrelated to Xml. For instance, it is quite simple to expose the information of a tree of POCOs using Reflection. Another example would be to expose other data formats such as JSON to XPath-only consumers.