XML is a hierarchical markup language. It uses opening and closing tags to define data. It's used to store and exchange data, and because of its extreme flexibility, it's used for everything from documentation to graphics.
Here's a sample XML document:
<xml>
<os>
<linux>
<distribution>
<name>Fedora</name>
<release>8</release>
<codename>Werewolf</codename>
</distribution>
<distribution>
<name>Slackware</name>
<release>12.1</release>
<mascot>
<official>Tux</official>
<unofficial>Bob Dobbs</unofficial>
</mascot>
</distribution>
</linux>
</os>
</xml>
Reading the sample XML, you might find there's an intuitive quality to the format. You can probably understand the data in this document whether you're familiar with the subject matter or not. This is partly because XML is considered verbose. It uses lots of tags, the tags can have long and descriptive names, and the data is ordered in a hierarchical manner that helps explain the data's relationships. You probably understand from this sample that the Fedora distribution and the Slackware distribution are two different and unrelated instances of Linux because each one is "contained" inside its own independent <distribution>
tag.
XML is also extremely flexible. Unlike HTML, there's no predefined list of tags. You are free to create whatever data structure you need to represent.
Components of XML
Data exists to be read, and when a computer "reads" data, the process is called parsing. Using the sample XML data again, here are the terms that most XML parsers consider significant.
- Document: The
<xml>
tag opens a document, and the</xml>
tag closes it. - Node: The
<os>
,<distribution>
, and<mascot>
are nodes. In parsing terminology, a node is a tag that contains other tags. - Element: An entity such as
<name>Fedora</name>
and<official>Tux</official>
, from the first<
to the last>
is an element. - Content: The data between two element tags is considered content. In the first
<name>
element, the stringFedora
is the content.
XML schema
Tags and tag inheritance in an XML document are known as schema.
Some schemas are made up as you go (for example, the sample XML code in this article was purely improvised), while others are strictly defined by a standards group. For example, the Scalable Vector Graphics (SVG) schema is defined by the W3C, while the DocBook schema is defined by Norman Walsh.
A schema enforces consistency. The most basic schemas are usually also the most restrictive. In my example XML code, it wouldn't make sense to place a distribution name within the <mascot>
node because the implied schema of the document makes it clear that a mascot must be a "child" element of a distribution.
Data object model (DOM)
Talking about XML would get confusing if you had to constantly describe tags and positions (e.g., "the name tag of the second distribution tag in the Linux part of the OS section"), so parsers use the concept of a Document Object Model (DOM) to represent XML data. The DOM places XML data into a sort of "family tree" structure, starting from the root element (in my sample XML, that's the os
tag) and including each tag.
This same XML data structure can be expressed as paths, just like files in a Linux system or the location of web pages on the internet. For instance, the path to the <mascot>
tag can be represented as //os/linux/distribution/slackware/mascot
.
The path to both <distribution>
tags can be represented as //os/linux/distribution
. Because there are two distribution nodes, a parser loads both nodes (and the contents of each) into an array that can be queried.
Strict XML
XML is also known for being strict. This means that most applications are designed to intentionally fail when they encounter errors in XML. That may sound problematic, but it's one of the things developers appreciate most about XML because unpredictable things can happen when applications try to guess how to resolve an error. For example, back before HTML was well defined, most web browsers included a "quirks mode" so that when people tried to view poor HTML code, the web browser could load what the author probably intended. The results were wildly unpredictable, especially when one browser guessed differently than another.
XML disallows this by intentionally failing when there's an error. This lets the author fix errors until they produce valid XML. Because XML is well-defined, there are validator plugins for many applications and standalone commands like xmllint
and xmlstarlet
to help you locate errors early.
Transforming XML
Because XML is often used as an interchange format, it's common to transform XML into some other data format or into some other XML schema. Classic examples include XSLTProc, xmlto, and pandoc, but technically there are many other applications designed, at least in part, to convert XML.
In fact, LibreOffice uses XML to layout its word processor and spreadsheet documents, so any time you export or convert a file from LibreOffice, you're transforming XML.
Ebooks in the open source EPUB format use XML, so any time you convert a document into an EPUB or from an EPUB, you're transforming XML.
Inkscape, the vector-based illustration application, saves its files in SVG, which is an XML schema designed for graphics. Any time you export an image from Inkscape as a PNG file, you're transforming XML.
The list could go on and on. XML is a data storage format, and it's designed to ensure that your data, whether it's points and lines on a canvas, nodes on a chart, or just words in a document, can be easily and accurately extracted, updated, and converted.
Learning XML
Writing XML is a lot like writing HTML. Thanks to the hard work of Jay Nick, free and fun XML lessons are available online that teach you how to create graphics with XML.
In general, very few special tools are required to explore XML. Thanks to the close relationship between HTML and XML, you can view XML using a web browser. In addition, open source text editors like QXmlEdit, NetBeans, and Kate make typing and reading XML easy with helpful prompts, autocompletion, syntax verification, and more.
Choose XML
XML may look like a lot of data at first, but it's not that much different than HTML (in fact, HTML has been reimplemented as XML in the form of XHTML). XML has the unique benefit that the components forming its structure also happen to be metadata providing information about what it's storing. A well-designed XML schema contains and describes your data, allowing a user to understand it at a glance and parse it quickly, and enabling developers to parse it efficiently with convenient programming libraries.
4 Comments