Daniel Lemire's blog

, 3 min read

You probably misunderstand XML

When I took my current position, I was invited to teach a course on unstructured data. It is a sensible topic for a course: some say that between 80% to 90% of all enterprise data is unstructured. But I objected to the title for marketing reasons. How many students would take a course on unstructured data? I can hear the students asking “what’s that course about?” Thus, I proposed a better title for the course: information retrieval and filtering. Indeed, everyone wants to filter and retrieve data, right?

Meanwhile, there were already courses on structured data (that is, on databases and information systems). However, there was no course on semi-structured data. So I proposed one. But I couldn’t call it semi-structured data as hardly any student would know what the title meant. Instead, I proposed a course which, roughly translated, is called “Information Management with XML.”

Immediately, I got into trouble: how could I dare omit SOAP and web services from a course on XML? I was annoyed by these comments. With some sense of irony, I decided to start dumping on my students some SOAP examples so that they could see the “beauty” [I’m being ironic] of using XML for data exchange on the web. So, there I was, trying to teach my students about semi-structured data, and I was asked to tell them about remote procedure calls, an irrelevant topic for my purposes.

Thankfully, it appears that history is on my side. Developers got tired of getting these annoying XML payloads. In time, they started using JSON, a much more appropriate format for passing small loads of structured data between a server and an ECMAScript client. It uses fewer bytes and, more importantly, JSON is an order of magnitude faster than XML. When you ask on Stack Overflow whether you should be using SOAP you are being told to avoid SOAP at all costs. The developers have spoken. And as a result, the organization behind the SOAP stack decided to close shop.

Where does that leave XML at? Precisely where it started. XML is a great meta-example on how to deal with semi-structured data. And it is just as useful as ever. Want to deal with documents? DocBook and OpenDocument are great formats. Want to add semantic information to web pages? Microformats can do it. You want to exchange complex business data? The Universal Business Language probably does what you need. Some people are having luck with the SVG image format. You want to subscribe to my blog? Grab my atom feed. For these applications, you couldn’t easily replace XML by flat files or JSON. Nor should you try.

Alas, we ended up torturing XML by applying it to ill-suited purposes. We must learn how to select the best format. Does your data look like a table? Can a flat file do the job? Do you need a key-value format like JSON? Or maybe a simple text file? Or is your data more like an XML document? Take a good look at your data before picking a format for it.

Further reading: Indexing XML and Native XML databases: have they taken the world over yet?

Update: I don’t include configuration files in my list of proper XML applications.