20121206

XML tuning

Going off on whole different path than normal, I'm going to discuss something technical. My work involves a lot of text processing, which means a lot of XML parsing. Since we have a lot of Groovy code, we started off using its XMLSlurper for parsing. And from a theoretical perspective, it's a great parser. It does everything we need, and was fairly simple to extend to track offsets into a document (ie: the values in node <node> go from offset 25-75), but has turned out to be fairly slow on large documents.

We have some files we use that regularly reach 8MB, and we might have to pull a few thousand values out of those files. After parsing, access was quite speedy, but initial parsing was taking seven to eight minutes. This was well over 99% of our execution time, so I finally went to search for alternatives a couple of days ago (I got on a bit of a performance kick after we got dressed down a bit on lack thereof causing problems).

So, after doing some searching, I found VTD-XML, by ximpleware. It claims to be fast and small, but with little or no support for namespaces. That suits our needs, at least for those documents, quite well. And it works by tracking offsets, which means no additional book-keeping for me. Sweet.

It took a bit of work to get it right; the documentation isn't great, and focuses on getting Strings out of the document, not offsets.

It's fairly minimal in how many classes you need (four, really, including the main Exception class), which is good.  But the javadoc has entries (sometimes unexplained) for just about everything, including all the internal stuff. So this guide is pretty useful, but glossed over some stuff I needed.

What I found out was that what I needed was VTDNav.getContentFragment() for nodes (and some bit-shifting/masking on its result). For some reason, though, this provides no useful values for attributes. For those, I needed VTDNav.getTokenOffset(int) and some String.indexOf(int, int) calls. But when I finally figured that out, the code turned out to be very small and fast, indeed. Just over one hundred lines of code on my part (most of which was fitting things into my existing interfaces), and speed on those same documents dropped down to about two seconds. Not too shabby, especially since the refactoring of my code to use it was just changing a couple of class references to interface references.

I can't wait to see people's faces when this code makes it into production.

(One minor word of warning to people thinking of using this code: it dumps some error messages via System.out.println(). Not sure what they were thinking, there. Thankfully, that's not a big problem for us, but it will be for some people. Also, make sure namespaces aren't going to trip you up.)

No comments:

Post a Comment