HXL tagging conventions

Note: there are some significant changes and simplifications proposed for HXL 1.0 beta. See the Draft specification for details (open for public commenting).

Release 1.0alpha (2014-04-07)

1. Introduction

This document is part of the Humanitarian Exchange Language (HXL), a proposed standard for increasing the efficiency and effectiveness of data exchange during humanitarian crises. The HXL standard consists of two parts:

  1. HXL tagging conventions (this document) — instructions for adding HXL tags to spreadsheets.
  2. HXL tag dictionary — a list of hashtags for identifying humanitarian data fields.

1.1. Target audience

HXL is designed for exchanging tabular-style data within the humanitarian community. Our primary audience is information-management specialists who are familiar with spreadsheets or relational databases; our second audience is computer programmers and database specialists looking to consume data produced by those information-management specialists.

1.2. Terms of use

HXL is available as an open standard — we have created it especially for use with humanitarian data, but you are welcome to use it for any purpose you want, as long as you don’t claim any support or endorsement from any members of the HXL working group or the organisations for which they work. We offer no warranty of any kind, so please use the standard at your own risk.

1.3. Creators

HXL is a group effort of many people and organisations, especially (but not exclusively) within the United Nations family of organisations, ably supported by a wider community of over 100 members of the hxlproject@googlegroups.com public mailing list.

As of June 2014, the core HXL Working Group included the following members: Albert Gembara (USAID), Andrej Verity (OCHA), Andrew Alspach (UNHCR), Gavin Wood (UNICEF), John Crowley (World Bank), Lauren Burns (Save the Children), Maurizio Blasilli (WFP), Muhammad Rizki (IOM), and Paul Currion (Humanitarian Innovation Fund). David Megginson (OCHA) served as standards lead and chair, while Sarah Telford (OCHA) was project manager. Generous funding for the first year of HXL research and development came from the Humanitarian Innovation Fund.

2. Base data model

This section describes the basic underlying idiom for HXL’s data model. HXL builds on a tabular data model familiar from spreadsheets and relational databases, rather than an hierarchical data model (e.g. XML, JSON, mind maps, file systems) or a graph data model (e.g. RDF, the World Wide Web). Tabular data models cover the majority of data that humanitarians exchange in the field, and have the most widely-deployed tool support via spreadsheet applications like Microsoft Excel, Google Spreadsheets, and OpenOffice Calc.

2.1. Constraints

  • HXL data will be in tabular form (e.g., spreadsheets, CSV files).
  • One or more rows of each data table will consist of human-readable headers.
  • Immediately after the last row of headers, there will be a row of HXL tags, where each cell is either empty or begins with the character “#”.
  • All rows after the HXL tag row consist of data.

2.2. (Non-HXL) Table model example

Except for the additional row of HXL tags (described in detail later), the above constraints match the most-common spreadsheet conventions, with data arranged like this:

A B C D
001 xaax 11 5.7
002 xbbx 23 3.1
003 xccx 3 7.8

In this case, there are three items described, each of which has four properties (“A”, “B”, “C”, and “D”). The first item has the assignments A=001, B=xaax, C=11, and D=5.7; the second item has A=002, B=xbbx, C=23, and D=3.1; the third item has A=003, B=xccx, C=3, and D=7.8.

HXL builds on top of this model by adding a row of machine-readable tags, as explained below.

3. Tagging conventions

Standardising field names (column headings) in data spreadsheets is complicated for three reasons:

  1. Different humanitarian country teams prefer different common working languages (e.g., Spanish in Latin America, French in West Africa, Arabic in the Levant and northern Africa, Russian in parts of Central Asia).
  2. Sometimes it is necessary to include data in more than one language in the same spreadsheet; for example, in Afghanistan, it’s usual to include both Dari and Pashto versions of any text.
  3. Different naming conventions make sense in different contexts; for example, the first administrative level in Syria is a “Governorate,” while in the Philippines, it’s a “Region.”

As a result, relying on matching ordinary text to identify each field/column is unlikely to be effective.

3.1. HXL tags

Instead of relying on matching header text, HXL defines a short tag for each standard type of spreadsheet column, and includes those tags in addition to human-readable text headers. The HXL tags resemble hashtags used in social media (all starting with “#”), and are defined in the HXL tag dictionary.

For example, the tag #aff_num refers to “number of people affected.” The tag always makes it clear what we’re talking about, no matter what text appears in the main header

  • Number affected
  • # de personnes concernées
  • Afectadas/os
  • عدد الأشخاص المتضررين

Here is a simplified example of HXL tags added to a spreadsheet:

Location name Location code People affected
#loc #loc_id #aff_num
Town A 01000001 2000
Town B 01000002 750
Town C 01000003 1920

A computer can process the column automatically simply by reading the HXL tag, no matter what text appears in the header above. That allows individual crisis teams and clusters to adapt to local conventions, while still allowing for automated data processing, merging, and validation.

3.2. Language codes

For multilingual content, HXL tags allow an optional “/” followed by an ISO 639-1 language code. For example, #activity is the HXL tag for an activity or project name, so you can add “/en” or “/fr” to distinguish the English and French versions:

Project title Titre du projet
#activity/en #activity/fr
Malaria treatments Traitement du paludisme
Teacher training Formation des enseignant(e)s

3.3. Repeated fields

A tabular format works poorly for repeated fields (e.g. an activity taking place in more than one location); however, using HXL tags, it is possible to design a spreadsheet format that allows for a fixed amount of repetition.

For example, the HXL tag for a geographical code is #loc_id. A 3W spreadsheet for a specific country could allow room for up to three geocodes like this:

P-code 1 P-code 2 P-code 3
#loc_id #loc_id #loc_id
020503
060107 060108
173219
530012 530013 530015
279333

By reading the HXL tag, processing software can easily recognize that the three columns represent (up to) three values for the same field, even though the full column titles differ, and even if the authors of the processing software knew nothing about the specific conventions in use in this country.

3.4. Compact disaggregated data and the “+” notation

Consider the following (non-HXL) spreadsheet example listing the number of people injured by a crisis each year in two different regions:

Region 2008 2009 2010 2011
Coast District 0 30 100 250
Mountain District 15 75 30 45

This table does not fit the standard HXL data model of one row for each record, because some of the column headers actually represent data (the year). HXL provides a special notation for this situation, using the “+” character (optionally surrounded by whitespace) to combine two HXL tags, one representing the text of the column header, and one representing the values under the header:

Region 2008 2009 2010 2011
#adm1 #period_date + #injured_num #period_date + #injured_num #period_date + #injured_num #period_date + #injured_num
Coast District 0 30 100 250
Mountain District 15 75 30 45

The tag+tag notation provides a compact way of labelling a large amount of data: in this case, the tags “#period_date+#injured_num” mean that the column header contains the Time period applicable (“#period_date”) while the cells beneath contain the Number of people injured (“#injured_num”).

Repeated codes work differently with the tag+tag notation; instead of repeating fields within the same record, each repeated “+” tag with a different header generates a new virtual data record. The following table shows the same data in a more-conventional HXL format:

Region Year People injured
#adm1 #period.date[a] #injured_num
Coast District 2008 0
Mountain District 2008 15
Coast District 2009 30
Mountain District 2009 75
Coast District 2010 100
Mountain District 2010 30
Coast District 2011 250
Mountain District 2011 45

To a computer system processing HXL, the two tables should result in identical parsed data.

4. Implementation notes

4.1. Detecting the tag row

HXL tags may not always appear after the first row of a spreadsheet. There are situations where spreadsheets contain multiple header rows, often for grouping, as in the following example:

Who What Where
Organisation Sector Subsector Department Municipality
UNICEF Education Teacher training Chocó Quidbó
UNICEF Education Teacher training Chocó Bojayá

Since the HXL tag row must always come after the headers, it is possible to detect the tag row (and the beginning of the data content) by looking for the first row where every cell is either empty or begins with the character “#”, as in the following example:

Who What Where
Organisation Sector Subsector Department Municipality
#org #sector #subsector #adm1 #adm2
UNICEF Education Teacher training Chocó Quidbó
UNICEF Education Teacher training Chocó Bojayá

A HXL parsing library could scan (for example) the first 20 rows of a spreadsheet, looking for a row that matches the constraints mentioned above. In this case, the first two rows would fail to match, but in the third row, every cell either begins with “#” or is empty, so it must be the HXL tag row, and all the rows after it are data rows.

4.2. Formal grammar of a HXL tag

The following Backus-Naur Form grammar (with regular expressions) defines the allowed content of a HXL tag (terminals are in upper case):

<hxl-tag> ::= WS <single-tag> WS
                  | WS <compound-tag> WS

<compound-tag> ::= <single-tag> WS "+" WS <single-tag>

<single-tag> ::= <base-tag>
                     | <base-tag> "/" LANGUAGE

<base-tag> ::= "#" TOKEN

TOKEN ::= /[a-zA-Z][a-zA-Z0-9_]*/

LANGUAGE ::= /[a-z]{2}/

WS ::= /[ \t\n\r]*/