Monday, February 16, 2009

XML Orientation Tutorial

following are fragments from my article - available also at http://www.assistars.com

XML is a Markup Language.

XML stands for eXtensible Markup Language. If you've read through any HTML page source you probably know what markup language is about. If not read on - we will explain. Imagine plain text document - it is built from words, sentences, paragraphs, chapters. Now, if you would like to print this document and for example make some parts stand out from the rest of the text (by bolding them for example) - you need to provide printer with information to do so. The simplest and most effective way to achieve that is to put some special markers into the text (themselves being text of special meaning) which will not be printed out but will indicate printer to switch mode and print fonts with increased weight. Such 'tags' form a markup language. HTML was developed that way to enable annotating text presented on the web page with some formatting/presentation instructions. The very same idea is one of the foundation stones for XML.

There is more general pattern here: XML was built based on the ideas from HTML (as it was well tested and widely adopted format) but with different objective in mind: instead of using tags to provide information on how to display the data, let's use them to represent logical structure of the data.

Let's take a look at the example of XML:




<contact>
<firstname>John</firstname>
<surname>Doe</surname>
<email>John.Doe@assistars.com</email>
<phone country="UK">020 1234 4321</phone>
</contact>


First thing you may notice is that you understand the document from just reading it, it is pretty self-descriptive, the names of the tags suggest how content should be interpreted.

After brief reading of the example you may notice the following:

- tags are defined using '<' and '>' brackets like for example '' (similarly to notation used in HTML)

- for each tag we have corresponding 'closing' tag which is used to mark the end of specific element (by using '/' preceding the tag name) like for example ''

- elements of the document define tree-like structure (tags are nested) which allow to easily represent part-to-whole relationships among data. For example '' is a part of '' etc.

- by the names chosen for tags we can deduce that there are no limitations to how tags are being named

which leads us to next very important important characteristic:

XML is eXtensible.

There is no limits to the kind of information XML document will represent, and how it would represent it. It is your freedom (and responsibility) to invent your own tags and how to decompose your data into hierarchical structure. That's because you will be the one reading and interpreting it. Why it was impossible for HTML in which all tags are defined by the format? Because HTML had to be understood by general purpose browsers. With XML as long all intended data readers agree on format sky is the limit.

XML - the purpose.

The main purposes for XML is data storing and exchange.

Following characteristics made it very good for this task:

- XML is flexible - you can model pretty much anything you can think of, tree structures are powerful models for data representations (vide folders structure on the disk, hyperlink structures on the websites)

- XML is extensible - you can evolve your models as you gather or require additional data

- XML is text - as it is defined as plain text - there is no platform specific element to it - document is completely platform independent

- XML is human readable - you can understand it from reading it

- XML is simple - you understand the concept right away

And that's why this format was so widely accepted and adopted and made it to the standard.

XML - what it is and what it isn't.

It is important to take XML for what it is: a file format which accomodates hierarchical structure of data using tags to represent logical dependencies and meaning of the data. Nothing more. There is such a buzz around XML that people tend to take it for something more. XML itself is just a file format. Period.

Very powerful because of its simplicity.