HXL tagging conventions (version 1.0 alpha)
Note: this is an out-of-date version of the HXL tagging conventions, preserved for historical reference. Please see the latest version if you are planning to use or support HXL.
Release 1.0alpha (2014-04-07)
1. Introduction
This document is part of the Humanitarian Exchange Language (HXL), a proposed standard for increasing the efficiency and effectiveness of data exchange during humanitarian crises. The HXL standard consists of two parts:
- HXL tagging conventions (this document) — instructions for adding HXL tags to spreadsheets.
- HXL tag dictionary — a list of hashtags for identifying humanitarian data fields.
1.1. Target audience
HXL is designed for exchanging tabular-style data within the humanitarian community. Our primary audience is information-management specialists who are familiar with spreadsheets or relational databases; our second audience is computer programmers and database specialists looking to consume data produced by those information-management specialists.
1.2. Terms of use
HXL is available as an open standard — we have created it especially for use with humanitarian data, but you are welcome to use it for any purpose you want, as long as you don’t claim any support or endorsement from any members of the HXL working group or the organisations for which they work. We offer no warranty of any kind, so please use the standard at your own risk.
1.3. Creators
HXL is a group effort of many people and organisations, especially (but not exclusively) within the United Nations family of organisations, ably supported by a wider community of over 100 members of the hxlproject@googlegroups.com public mailing list.
As of June 2014, the core HXL Working Group included the following members: Albert Gembara (USAID), Andrej Verity (OCHA), Andrew Alspach (UNHCR), Gavin Wood (UNICEF), John Crowley (World Bank), Lauren Burns (Save the Children), Maurizio Blasilli (WFP), Muhammad Rizki (IOM), and Paul Currion (Humanitarian Innovation Fund). David Megginson (OCHA) served as standards lead and chair, while Sarah Telford (OCHA) was project manager. Generous funding for the first year of HXL research and development came from the Humanitarian Innovation Fund.
2. Base data model
This section describes the basic underlying idiom for HXL’s data model. HXL builds on a tabular data model familiar from spreadsheets and relational databases, rather than an hierarchical data model (e.g. XML, JSON, mind maps, file systems) or a graph data model (e.g. RDF, the World Wide Web). Tabular data models cover the majority of data that humanitarians exchange in the field, and have the most widely-deployed tool support via spreadsheet applications like Microsoft Excel, Google Spreadsheets, and OpenOffice Calc.
2.1. Constraints
- HXL data will be in tabular form (e.g., spreadsheets, CSV files).
- One or more rows of each data table will consist of human-readable headers.
- Immediately after the last row of headers, there will be a row of HXL tags, where each cell is either empty or begins with the character “#”.
- All rows after the HXL tag row consist of data.
2.2. (Non-HXL) Table model example
Except for the additional row of HXL tags (described in detail later), the above constraints match the most-common spreadsheet conventions, with data arranged like this:
A | B | C | D |
001 | xaax | 11 | 5.7 |
002 | xbbx | 23 | 3.1 |
003 | xccx | 3 | 7.8 |
In this case, there are three items described, each of which has four properties (“A”, “B”, “C”, and “D”). The first item has the assignments A=001, B=xaax, C=11, and D=5.7; the second item has A=002, B=xbbx, C=23, and D=3.1; the third item has A=003, B=xccx, C=3, and D=7.8.
HXL builds on top of this model by adding a row of machine-readable tags, as explained below.
3. Tagging conventions
Standardising field names (column headings) in data spreadsheets is complicated for three reasons:
- Different humanitarian country teams prefer different common working languages (e.g., Spanish in Latin America, French in West Africa, Arabic in the Levant and northern Africa, Russian in parts of Central Asia).
- Sometimes it is necessary to include data in more than one language in the same spreadsheet; for example, in Afghanistan, it’s usual to include both Dari and Pashto versions of any text.
- Different naming conventions make sense in different contexts; for example, the first administrative level in Syria is a “Governorate,” while in the Philippines, it’s a “Region.”
As a result, relying on matching ordinary text to identify each field/column is unlikely to be effective.
3.2. Language codes
For multilingual content, HXL tags allow an optional “/” followed by an ISO 639-1 language code. For example, #activity is the HXL tag for an activity or project name, so you can add “/en” or “/fr” to distinguish the English and French versions:
Project title | Titre du projet |
#activity/en | #activity/fr |
Malaria treatments | Traitement du paludisme |
Teacher training | Formation des enseignant(e)s |
3.3. Repeated fields
A tabular format works poorly for repeated fields (e.g. an activity taking place in more than one location); however, using HXL tags, it is possible to design a spreadsheet format that allows for a fixed amount of repetition.
For example, the HXL tag for a geographical code is #loc_id. A 3W spreadsheet for a specific country could allow room for up to three geocodes like this:
P-code 1 | P-code 2 | P-code 3 |
#loc_id | #loc_id | #loc_id |
020503 | ||
060107 | 060108 | |
173219 | ||
530012 | 530013 | 530015 |
279333 |
By reading the HXL tag, processing software can easily recognize that the three columns represent (up to) three values for the same field, even though the full column titles differ, and even if the authors of the processing software knew nothing about the specific conventions in use in this country.
3.4. Compact disaggregated data and the “+” notation
Consider the following (non-HXL) spreadsheet example listing the number of people injured by a crisis each year in two different regions:
Region | 2008 | 2009 | 2010 | 2011 |
Coast District | 0 | 30 | 100 | 250 |
Mountain District | 15 | 75 | 30 | 45 |
This table does not fit the standard HXL data model of one row for each record, because some of the column headers actually represent data (the year). HXL provides a special notation for this situation, using the “+” character (optionally surrounded by whitespace) to combine two HXL tags, one representing the text of the column header, and one representing the values under the header:
Region | 2008 | 2009 | 2010 | 2011 |
#adm1 | #period_date + #injured_num | #period_date + #injured_num | #period_date + #injured_num | #period_date + #injured_num |
Coast District | 0 | 30 | 100 | 250 |
Mountain District | 15 | 75 | 30 | 45 |
The tag+tag notation provides a compact way of labelling a large amount of data: in this case, the tags “#period_date+#injured_num” mean that the column header contains the Time period applicable (“#period_date”) while the cells beneath contain the Number of people injured (“#injured_num”).
Repeated codes work differently with the tag+tag notation; instead of repeating fields within the same record, each repeated “+” tag with a different header generates a new virtual data record. The following table shows the same data in a more-conventional HXL format:
Region | Year | People injured |
#adm1 | #period.date[a] | #injured_num |
Coast District | 2008 | 0 |
Mountain District | 2008 | 15 |
Coast District | 2009 | 30 |
Mountain District | 2009 | 75 |
Coast District | 2010 | 100 |
Mountain District | 2010 | 30 |
Coast District | 2011 | 250 |
Mountain District | 2011 | 45 |
To a computer system processing HXL, the two tables should result in identical parsed data.
4. Implementation notes
4.1. Detecting the tag row
HXL tags may not always appear after the first row of a spreadsheet. There are situations where spreadsheets contain multiple header rows, often for grouping, as in the following example:
Who | What | Where | ||
Organisation | Sector | Subsector | Department | Municipality |
UNICEF | Education | Teacher training | Chocó | Quidbó |
UNICEF | Education | Teacher training | Chocó | Bojayá |
Since the HXL tag row must always come after the headers, it is possible to detect the tag row (and the beginning of the data content) by looking for the first row where every cell is either empty or begins with the character “#”, as in the following example:
Who | What | Where | ||
Organisation | Sector | Subsector | Department | Municipality |
#org | #sector | #subsector | #adm1 | #adm2 |
UNICEF | Education | Teacher training | Chocó | Quidbó |
UNICEF | Education | Teacher training | Chocó | Bojayá |
A HXL parsing library could scan (for example) the first 20 rows of a spreadsheet, looking for a row that matches the constraints mentioned above. In this case, the first two rows would fail to match, but in the third row, every cell either begins with “#” or is empty, so it must be the HXL tag row, and all the rows after it are data rows.
4.2. Formal grammar of a HXL tag
The following Backus-Naur Form grammar (with regular expressions) defines the allowed content of a HXL tag (terminals are in upper case):
<hxl-tag> ::= WS <single-tag> WS | WS <compound-tag> WS <compound-tag> ::= <single-tag> WS "+" WS <single-tag> <single-tag> ::= <base-tag> | <base-tag> "/" LANGUAGE <base-tag> ::= "#" TOKEN TOKEN ::= /[a-zA-Z][a-zA-Z0-9_]*/ LANGUAGE ::= /[a-z]{2}/ WS ::= /[ \t\n\r]*/