HXL tagging conventions

Release 1.0 final, 2016-03-18 (permalink, previous release)

1. Introduction

This document is part of the Humanitarian Exchange Language (HXL), a standard for increasing the efficiency and effectiveness of data exchange during humanitarian crises.

The HXL standard consists of two normative parts:

  1. HXL tagging conventions (this document) — instructions for adding HXL tags to spreadsheets.
  2. HXL hashtag dictionary — a list of hashtags for identifying humanitarian data fields.

There are also two non-normative annexes:

  1. HXL postcards — two-sided 10×15 cm (4×6 in) cards in multiple languages, for quick reference.
  2. Classification codes — recommended taxonomies and code lists for use in HXL-encoded datasets.

1.1. Design philosophy

HXL is a cooperative rather than a competitive standard. Most data standards dictate to users how they should collect and format their data; HXL, on the other hand, encourages organisations to add hashtags to their existing datasets, without requiring new skills, software tools, and business processes.

The primary focus of HXL is tabular-style data such as spreadsheets or API output from database tables, which represent the vast majority of the operational data collected in the humanitarian sphere; however, HXL hashtags can potentially have other applications, including labelling attributes for map layers or identifying data types in SMS messages.

1.2. Target audience

The standard’s primary audience is information-management specialists who are familiar with spreadsheets or relational databases; its secondary audience is computer programmers and database specialists looking to consume data produced by those information-management specialists.

1.3. Terms of use

HXL is available as an open standard — the working groups have designed it for use with humanitarian data, but people and organisations are welcome to use it for any purpose they choose, as long as they do not claim support or endorsement from any members of the HXL working group or the organisations for which they work. The authors offer no warranty of any kind, so implementors use the standard at their own risk.

The text of the standard itself is released into the public domain.

2. Adding HXL tags to data

Consider the following simple spreadsheet:

Location name Location code Number affected
Camp A 01000001 2000
Camp B 01000002 750
Camp C 01000003 1920

Datasets like this — longer, of course, and with more columns — are the backbone of humanitarian information management, and they provide the input for most reports, maps, and visualisations coming out of a crisis. Unfortunately, creating those data products is time-consuming, and responders have to duplicate the work from crisis to crisis and even dataset to dataset, because it is hard to build reusable software tools that can understand the many different ways responders may choose to label their data. For example, the text header of the last column could have appeared in dozens of variants, and in several different languages:

  • Number affected
  • Affected
  • People affected
  • # de personnes concernées
  • Afectadas/os
  • عدد الأشخاص المتضررين

Software tools need to be able to recognise that the figures in the third column refer to the number of people affected — regardless of how the data provider has decided to label it in the spreadsheet — so HXL adds a second header row containing short hashtags:

Location name Location code Number affected
#loc+name #loc+code #affected
Camp A 01000001 2000
Camp B 01000002 750
Camp C 01000003 1920

Now, whether the text at the top of the column reads “Number affected” or “عدد الأشخاص المتضررين”, software for cleaning, validating, analysing, mapping, or visualising the data can automatically recognise the hashtag #affected and use the figures below accordingly.

More than one row of headers may appear above the HXL hashtag row — the hashtags themselves act as a marker to show automated systems where the headers end and the data begins:

Camp information Needs
Location name Location code Number affected
#loc+name #loc+code #affected
Camp A 01000001 2000
Camp B 01000002 750
Camp C 01000003 1920

HXL software should expect to find the hashtag row anywhere within the first 25 rows of a dataset.

3. Structure of a HXL tag

The root of a HXL hashtag follows the same syntactic rules as a Twitter hashtag: it begins with the octothorpe/pound sign (“#”) and contains only unaccented Roman alphabetic characters (so-called “ASCII letters,” “a” to “z”), Arabic numerals (“0” to “9”), and the underscore symbol (“_”). The first character must be alphabetic, and character case does not matter (#ADM1 and #adm1 are the same hashtag, though lower case is preferred for stylistic reasons). Here are some examples of syntactically-valid HXL hashtags:

  • #sector
  • #org
  • #households
  • #impact

3.1. Hashtag attributes

The core shared HXL hashtags describe high-level concepts like an organisation (#org), geographical coordinates (#geo), a humanitarian cluster or sector (#sector), the number of people affected (#affected), or a subdivision of a country (#adm1). Humanitarian datasets, however, often need to make finer-grained distinctions. For example, is an organisation the funder (donor) or the implementing agency? Does a column contain the name of an administrative subdivision or its code?

HXL allows data providers to make these distinctions by attaching attributes to a hashtag.

3.1.1. Attribute syntax

Attributes follow the same syntactic rules as hashtags, except that they begin with plus (“+”) rather than the octothorpe/pound sign (“#”), and follow the hashtag , with optional whitespace separating the attributes. A hashtag may have any number of attributes, and order does not matter, so #org+funder+code has exactly the same meaning as #org+code+funder. The following examples show attributes attached to hashtags to refine their meaning:

  • #org+funder — the funding organisation (e.g. a donor)
  • #org+funder+code — a code for a funding organisation
  • #adm1+name+fr — the name of an administrative level-one subdivision, in French
  • #adm1+code+pcode — the p-code (place code) of an administrative level-one subdivision (adding +pcode to +code to further refine the code type)
  • #adm1+code+iso — the ISO code of an administrative level-one subdivision (adding +iso to +code to further refine the code type)

Software processing HXL data may ignore any attributes it does not recognise and simply process the core hashtag.

3.1.2. Common attributes

Data providers may invent their own attributes to suit their local data needs; however, there are some recommended common attributes that will be useful across many data types. There is a full list in the HXL core tagset, of which the following are some highlights:

+displaced +idp +injured +reached +refugees
Classifications for counts or descriptions of people, e.g. #affected+idp for the number of internally-displaced people.
The value is a unique, machine-readable code, e.g. #adm1+code for an administrative level-one P-code.
+f +m +i
The value (usually a number) refers specifically to people of a specific gender, e.g. +affected+f for the number of female people affected.
+start +end
The value refers to the beginning or end, e.g. #date+end for the end date of an activity.

Note that none of these attributes is two letters long. Attributes consisting of two alphabetic letters, such as +ar or +es, are reserved for a special purpose, as described in the next section.

3.1.3. Attributes for languages

Humanitarian crises often take place in multicultural areas, where different local groups speak different languages; furthermore, international responders helping with a crisis may need to work in their own languages as well. As a result, humanitarian datasets are sometimes multilingual, listing the same information in e.g. French and Arabic, or Dari and Pashto.

To make it easy to identify languages in HXL, the standard recommends that all two-character alphabetic attributes be reserved to represent ISO 639-1 language codes, such as +en for English. Data providers can use the attributes to mark the language of a column:

Project title Titre du projet
#activity+en #activity+fr
Malaria treatments Traitement du paludisme
Teacher training Formation des enseignant(e)s

The following language attributes (not a comprehensive list) are examples of those that might appear in international humanitarian datasets:

+en English +fa Dari / Farsi / Persian
+fr French +ps Pashto
+ar Arabic +ms Malay
+es Spanish +ur Urdu
+ru Russian +tl Tagalog

3.2. Creating extension tags

The HXL core tags include tags that will be generally applicable to many humanitarian datasets, but it is impossible to anticipate every tag for every humanitarian need. This standard makes the following four recommendations for extending HXL tags and attributes:

  1. Whenever possible, take an existing tag with a broader meaning, and narrow it down with an attribute, e.g. #loc+hospital.
  2. When there is no applicable core tag, begin an extension tag with “x_”, so that it will not conflict with any future HXL core hashtags, e.g. #x_toxicity.
  3. When software finds a HXL hashtag that it does not recognise, e.g. #x_toxicity, it should simply ignore the column of data.
  4. When software finds a HXL attribute that it does not recognise, e.g. #loc+hospital, it should ignore the attribute but still process the tag and any other attributes it does recognise: in this case, it should behave as if the dataset had contained simply #loc.

Note: software designers may choose to warn about unrecognised tags and attributes, to help with error detection and quality control.

4. Special cases

This section describes how to use HXL to deal with special cases that do not normally fit well into a tabular data model.

4.1. Repeating fields

A tabular format works poorly for repeated fields (e.g. an activity taking place in more than one location); however, using HXL tags, it is possible to design a spreadsheet format that allows for a fixed amount of repetition.

For example, the HXL tag for a generic geographical code (like a P-code) is #loc+code. A 3W spreadsheet for a specific country could allow room for up to three geocodes like this:

P-code 1 P-code 2 P-code 3
#loc+code #loc+code #loc+code
060107 060108
530013 530015 279333

By reading the HXL tag, processing software can easily recognize that the three columns represent (up to) three values for the same field, even though the full column titles differ, and even if the authors of the processing software knew nothing about the specific conventions in use in this country.

4.1. “Wide” (series) data

“Wide” datasets are optimised for reading rather than machine processing. They place a series of data (usually numbers) across a row, showing how information varies over time, geographical area, demographic groups, or some other criterion. Here is a simple, non-HXL example listing the number of people of concern in each region during four different years.

Region 2008 2009 2010 2011
Coast District 0 30 100 250
Mountain District 15 75 30 45

This table presents some special challenges for tagging, because the columns headed “2008” to “2011” all represent the same kind of data, but in different years. Tagging them all as simply #affected loses important information about the series. The solution is to use an extra attribute, +label, to specify that the information in the header is a label for the data series:

Region 2008 2009 2010 2011
#adm1 #affected+label #affected+label #affected+label #affected+label
Coast District 0 30 100 250
Mountain District 15 75 30 45

Appendix A: Changes from previous versions

Changes from 1.0 beta to 1.0 final:

  • Explicitly state that the text of the standard is released into the public domain.
  • Many minor copy-editing and text-formatting changes.

Changes from 1.0 alpha to 1.0 beta:

  • Removed compact disaggregated syntax and language extensions.
  • Added tag attributes.
  • Added recommendations for language attributes.

Appendix B: Formal grammar of a HXL tag

The following Backus-Naur Form grammar, with regular expressions, defines the allowed content of a HXL tag (terminals are in uppercase):

<hxl-tag>               ::= <tag>
    | <tag> <attributes>
    | <tag> WHITESPACE <attributes>

    <attributes>            ::= <attribute>
    | <attributes> <attribute>
    | <attributes> WHITESPACE <attribute>

    <tag>                   ::=    "#" TOKEN

    <attribute>             ::= "+" TOKEN

    TOKEN                   ::=      /[a-zA-Z][a-zA-Z0-9_]*/

    WHITESPACE              ::=    /[ \t\n\r]+/

Appendix C: Credits

HXL is a group effort of many people and organisations, including a wider community of over 100 members of the hxlproject@googlegroups.com public mailing list. Chad Hendrix (OCHA) was the founder of the HXL standards effort, and Carsten Keßler (Hunter College) was the original technical lead. Since 2013, Sarah Telford (OCHA) has been overall programme manager and David Megginson (OCHA) has served as standards lead and chair, while John Crowley helped with outreach and governance in late 2014 and early 2015, and Aidan McGuire (ScraperWiki) has provided project management since 2015. Generous funding for HXL research and development in 2014 came from the Humanitarian Innovation Fund, and OCHA has supported continuing work. The Paul Allen Foundation has generously agreed to fund HXL work through 2016.

During 2014, the HXL Working Group included Albert Gembara (USAID), Andrej Verity (OCHA), Andrew Alspach (UNHCR), David Megginson (OCHA), Gavin Wood (UNICEF), John Crowley (World Bank), Lauren Burns (Save the Children), Maurizio Blasilli (WFP), Muhammad Rizki (IOM, later replaced by Ivan Vukovic), and Paul Currion (Humanitarian Innovation Fund).

Beginning in 2015, the HXL Working Group has included Andrej Verity (OCHA), David Megginson (OCHA), John Adams (DFID), John Crowley (originally, World Bank), Justine Mackinnon (Standby Task Force), Laurent Pitoiset (UNHCR), Sara-Jayne Terp (ThoughtWorks), and Simon Johnson (British Red Cross).