Pitt-Greensburg Digital Studies Logo: I Code!

Maintained by: Elisa E. Beshero-Bondar (ebb8 at pitt.edu) Creative Commons License
Last modified:

So. . . What exactly is "XML" anyway? When It's Useful and How It Works

XML: Structured, Informational Markup

XML is short for eXtensible Markup Language,and it's a standard system for storing and accessing information used practically everywhere around the world. It's the informational markup (or "code") that makes Microsoft Office software run, and it's the foundation of many online network applications around the world. For our purposes as researchers, it's an excellent method for storing information, and for preparing to share it with the public. XML is independent of proprietary software applications—which means that what you write in XML is freely exchangeable between computers of different kinds (across platform—as in Macs and PCs). It outlasts software obsolescence, because it's a standard that can be read universally.

You've maybe heard of HTML (hypertext markup language), which is the code that makes web pages presentable in web browsers. That's a kind of XML, designed specifically and only for presentation and publication on the world-wide web: HTML is for presentation and display, but XML is primarily for storage of information, and we can call it informational markup, as opposed to presentational markup. We can write code to take information written in XML and transform it for presentation online—and you'll gain experience with doing that as we proceed with our class this semester.

XML is great for researchers in the Humanities and Social Sciences because it's very effective at storing and cataloging information systematically. You can write XML to set up hierarchies (or nested structures) of information, and also to locate and extract that information later when you need it. So, if we were going to store a book in XML, we'd pay attention to the way it's structured, maybe with chapters—and inside those chapters we'd have chapter titles, and paragraphs, and inside those, sentences, and then unit words and punctuation. If wanted to, we could systematically mark all the action verbs in those sentences, and all the exclamation points using XML, if this was important—and we could design a hierarchical system using XML to capture and hold the information we want to collect. (Sidenote: You might be surprised how important this kind of detailed tagging is, for example, in legal cases about who authored a book: Is J. K. Rowling the author of a crime novel under another name? Can an individual's writing style be identified as unique to them, based on their patterned use of certain kinds of words, like prepositions?)

When we do research in the humanities, we're working with documents written by human beings, and XML is useful for preserving them for reading and studying, and for extracting information from them later. We can do this close-up (through "close reading" by reading with our eyes, one by one. Or we can code documents systematically (which also involves "close reading"), in order to step BACK and view them from a distance: to let a computer discover patterns we couldn't so easily see on our own. In Digital Humanities, this practice of working with computers to make them show us patterns across enormous, complicated texts or many, many texts, is called "distant reading." XML helps us prepare texts for this, for two reasons:

  1. XML is a formal model that represents an orderly structure—a hierarchy of information. To the extent that human documents are ordered in a systematic way, this can be represented and described using XML.
  2. Computers work very quickly on orderly hierarchies of information. So if we model the documents we want to study as hierarchies in XML, this makes it ideal for us to use computers to count related things, help us find patterns.

We have to start by studying our documents to see how they're structured, and identify what matters to us in describing a structure. This practice is called document analysis, and it's basically what you're doing when you have to make decisions about how to code a recipe, a voyage log, a poem, or a letter in our first XML exercises in this course.

XML is Nested Boxes, or a "Tree"

In technical terms, we can think of every XML document as a tree, sprouting from a single root, which contains and identifies the whole thing. That outermost layer is the start-tag and end-tag, the alpha and the omega of an XML file. I tend to think of this as a single box that contains everything else, with all its branching complexity inside.

<vendor>Grocery Store</vendor>
<vendor>Clothing Store</vendor>

XML marks a structure, or the hierarchy of a document, by using elements, such as shopping_list, and food_item. Each element consists of the following:

A start tag is defined with angle brackets, and an end tag looks like a start tag, except it has a forward slash after the opening angle bracket. When we refer to tags, we're talking about those start and end tags. When we talk about an element, we're referring to the whole thing: start-tag, CONTENT, and end-tag. Make sense?

Elements may also include something called attributes—an additional markup that gives supplementary information about an element. So, say we had ingredient names in French and Spanish in our shopping list, and we wanted to mark those: One option for doing this would be, say, like this:

<foreign language="French">escargot</foreign>

<foreign language="Spanish">sofrito</foreign>

See how this works: Attributes are written inside a start tag of an element (but NOT inside the end-tag). They consist of an attribute name and an attribute value. The attribute name, here, is language, and the attribute value is (you guessed it!) "French" or "Spanish." (Attributes are sort of like adjectives, or descriptive modifiers!) Notice there's a rule for HOW to write attributes: attribute values must be enclosed inside quotation marks—These can either be double straight quotation marks (") or single straight apostrophes ('). either one works, but try to use them consistently. Later on, when we're writing other kinds of code that reads and extracts from your XML, you'll find you'll find that you need to work with single quotes to refer to attribute values—more on that later. For now, as you write XML, double quotes are what we commonly use. Note that these are straight quotation marks (") and not the curly ones that you see in a word processor.

"Well Formed" and "Well Formedness" in XML

XML must be "well formed" in order to be parsed by a computer. That means it must follow the syntax rules for writing XML: It must have a single root element, and its elements must be nested, without any overlap. Also, where attributes are used, these need to be signalled according to expected XML syntax (as above). These are necessary for the document to be XML. Well-formed XML is simply, correct XML.

The following example is NOT well-formed XML. Can you tell why not? (There are multiple reasons!)


This is NOT well-formed XML either. Why not?

<paragraph>He responded emphatically in French: <emph><foreign language="french">oui</emph></foreign>!</paragraph>

Special Reserved Characters in XML

Computers (as well as people) need to be able to read XML and tell tags apart from text, to distinguish elements from their content. So, we run into formedness problems (problems with well-formed XML) when we want to represent certain characters, like left and right angle brackets AS text. What if you want to write, as I'm doing here, about code and you need to represent tagging AS text? View my page source, look for the example passages and you'll see that I've used a work-around solution. What we need to do is escape the special characters (or the "reserved characters") that indicate to a computer that these are tags. There are three special characters that we need to escape, and we do this by replacing them in with character entities which tell the computer to display these characters as text only. We must always escape three characters. We'll show you in class how to do this:

Validity in XML: Based on Schema Rules

XML is adaptable and flexible for organizing information, because it is up to the person writing it HOW they want to define their elements, and what they want to call them When people work in XML in communities, though, they'll work with specific tagging conventions in order to easily connect and communicate with each other—and for our Pacific and Mitford projects, we're working with one of those community languages with XML called "TEI." TEI is both a community and a language within XML with a standard set of rules, called a schema. If you work together with a group on an XML project, one thing you'll need to do is define your project's schema (or work within a pre-existing schema like the TEI's) so that you're all coding consistently. When a project's XML is correct according to its defined schema, we say that the XML is "valid" and we run what's called a "validity check" to determine this, in which we check the XML against the project's schema file. You'll be learning a little later how to write your own schemas for XML using a language called Relax-NG, but for now, we'd like you to get used to the actual writing of XML code and to learn some key concepts about it: well-formedness, nesting hierarchy, working around special reserved characters, and validity.