HXL tagging conventions
Note: this is an out-of-date version of the HXL tagging conventions, preserved for historical reference. Please see the latest version if you are planning to use or support HXL.
Release 1.0 beta, last updated 2015-06-03 (see previous release)
1. Introduction
This document is part of the Humanitarian Exchange Language (HXL), a standard for increasing the efficiency and effectiveness of data exchange during humanitarian crises.
The HXL standard consists of two parts:
- HXL tagging conventions (this document) — instructions for adding HXL tags to spreadsheets.
- HXL core tags — a list of hashtags for identifying humanitarian data fields.
1.1. Design philosophy
HXL is a cooperative rather than a competitive standard. Most data standards attempt to dictate to users how they should collect and format their data; HXL, on the other hand, is designed so that organisations can add tags to their existing datasets, without requiring new skills, software tools, and business processes.
The primary focus of HXL is tabular-style data such as spreadsheets and API output from database tables, which represent the vast majority of the data collected in the humanitarian sphere; however, while not specified here, HXL hashtags can potentially have other applications, including labelling attributes for map layers or identifying data types in SMS messages.
1.2. Target audience
Our primary audience is information-management specialists who are familiar with spreadsheets or relational databases; our second audience is computer programmers and database specialists looking to consume data produced by those information-management specialists.
1.3. Terms of use
HXL is available as an open standard — we have created it especially for use with humanitarian data, but you are welcome to use it for any purpose you want, as long as you don’t claim any support or endorsement from any members of the HXL working group or the organisations for which they work. We offer no warranty of any kind, so please use the standard at your own risk.
2. Adding HXL tags to data
Consider the following simple spreadsheet:
Location name | Location code | Number affected |
---|---|---|
Camp A | 01000001 | 2000 |
Camp B | 01000002 | 750 |
Camp C | 01000003 | 1920 |
Datasets like these — longer, of course, and with more columns — are the backbone of humanitarian information management, and they provide the input for most reports, maps, and visualisations coming out of a crisis. Unfortunately, creating those products is time-consuming, and responders have to duplicate the work from crisis to crisis and even dataset to dataset, because it is hard to build reusable software tools that can understand the column headers. For example, the text header of the last column could have appeared in dozens of variants, and in several different languages:
- Number affected
- Affected
- People affected
- # de personnes concernées
- Afectadas/os
- عدد الأشخاص المتضررين
Software tools need to be able to recognise that the figures in the third column refer to the number of people affected regardless of how people have decided to label it in the spreadsheet, so HXL adds a second header row containing short hashtags:
Location name | Location code | Number affected |
---|---|---|
#loc | #loc+code | #affected |
Camp A | 01000001 | 2000 |
Camp B | 01000002 | 750 |
Camp C | 01000003 | 1920 |
Now, whether the text at the top of the column reads “Number affected” or “عدد الأشخاص المتضررين”, software for cleaning, validating, analysing, mapping, or visualising the data can automatically recognise the hashtag #affected
and use the figures below accordingly.
More than one row of headers may appear above the HXL hashtag row — the hashtags themselves act as a marker to show automated systems where the headers end and the data begins:
Camp information | Needs | |
---|---|---|
Location name | Location code | Number affected |
#loc | #loc+code | #affected |
Camp A | 01000001 | 2000 |
Camp B | 01000002 | 750 |
Camp C | 01000003 | 1920 |
HXL software should expect to find the hashtag row anywhere within the first 25 rows of a dataset.
3. Structure of a HXL tag
The root of a HXL hashtag follows the same syntactic rules as a Twitter hashtag: it begins with the octothorpe/pound sign (“#”) and contains only unaccented Roman alphabetic characters (so-called “ASCII letters,” “a” to “z”), Arabic numerals (“0” to “9”), and the underscore symbol (“_”). The first character must be alphabetic, and character case does not matter (#ADM1
and #adm1
are the same hashtag, though lower case is preferred for stylistic reasons). Here are some examples of syntactically-valid HXL hashtags:
#sector
#org
#households
#impact
3.1. Hashtag attributes
The core shared HXL hashtags describe high-level concepts like an organisation (#org
), geographical coordinates (#geo
), a humanitarian cluster or sector (#sector
), the number of people affected (#affected
), or a subdivision of a country (#adm1
). Humanitarian datasets, however, often need to make finer-grained decisions. For example, is an organisation the funder (donor) or the implementing agency? Does a column contain the name of an ADM1 or its p-code?
HXL allows dataset creators to make finer distinctions by attaching attributes to a hashtag.
3.1.1. Attribute syntax
Attributes follow the same syntactic rules as hashtags, except that they begin with plus (“+”) rather than the octothorpe/pound sign (“#”), and follow the hashtag immediately, with optional whitespace separating the attributes. A hashtag may have any number of attributes, and order does not matter, so #org+funder+code
has exactly the same meaning as #org+code+funder
. The following examples show attributes attached to hashtags to refine their meaning:
#org+funder
— the funding organisation (e.g. a donor)#org+funder+code
— a code for a funding organisation#adm1+name+fr
— the name of an administrative level-one subdivision, in French#adm1+code+pcode
— the p-code (place code) of an administrative level-one subdivision (adding +pcode to +code to further refine the code type)#adm1+code+iso
— the ISO code of an administrative level-one subdivision (adding +iso to +code to further refine the code type)
Software processing HXL data may ignore any attributes it does not recognise and simply process the core hashtag.
3.1.2. Common attributes
Data creators may invent their own attributes to suit their local data needs; however, there are some recommended common attributes that will be useful across many data types. There is a full list in the HXL core tagset, of which the following are some highlights:
+displaced +idp +injured +reached +refugees
- Classifications for counts or descriptions of people, e.g.
#affected+idp
for the number of internally-displaced people. +code
- The value is a unique, machine-readable code, e.g.
#adm1+code
for an administrative level-one P-code. +f +m
- The value (usually a number) refers specifically to females or males, e.g. +affected+f for the number of female people affected.
+start +end
- The value refers to the beginning or end, e.g.
#date+end
for the end date of an activity.
Note that none of these attributes is two letters long. Attributes consisting of two alphabetic letters, such as +ar or +es, are reserved for a special purpose, as described in the next section.
3.1.3. Attributes for languages
Humanitarian crises often take place in multicultural areas, where different local groups speak different languages; furthermore, international responders helping with a crisis may need to work in their own languages as well. As a result, humanitarian datasets are sometimes multilingual, listing the same information in e.g. French and Arabic, or Dari and Pashto.
To make it easy to identify languages in HXL, the standard recommends that all two-character alphabetic attributes be reserved to represent ISO 639-1 language codes, such as “en” for English. Dataset creators can use the attributes to mark the language of a column:
Project title | Titre du projet |
---|---|
#activity+en | #activity+fr |
Malaria treatments | Traitement du paludisme |
Teacher training | Formation des enseignant(e)s |
The following language attributes (not a comprehensive list) are examples of those that might appear in international humanitarian datasets:
+en |
English | +fa |
Dari / Farsi / Persian |
+fr |
French | +ps |
Pashto |
+ar |
Arabic | +ms |
Malay |
+es |
Spanish | +ur |
Urdu |
+ru |
Russian | +tl |
Tagalog |
3.2. Creating extension tags
The HXL core tags include tags that will be generally applicable to many humanitarian datasets, but it is impossible to anticipate every tag for every humanitarian need. This standard makes the following four recommendations for extending HXL tags and attributes:
- Whenever possible, take an existing tag with a broader meaning, and narrow it down with an attribute, e.g.
#loc+hospital
. - When there is no applicable core tag, begin an extension tag with “x_”, so that it will not conflict with any future HXL core hashtags, e.g.
#x_toxicity
. - When software finds a HXL hashtag that it does not recognise, e.g.
#x_toxicity
, it should simply ignore the column of data. - When software finds a HXL attribute that it does not recognise, e.g.
#loc+hospital
, it should ignore the attribute but still process the tag (as if the dataset had contained simply#loc
).
Note: some types of software may warn about unrecognised tags, to help with error detection and quality control.
4. Special cases
This section describes how to use HXL to deal with special cases that do not normally fit well into a tabular data model.
4.1. Repeating fields
A tabular format works poorly for repeated fields (e.g. an activity taking place in more than one location); however, using HXL tags, it is possible to design a spreadsheet format that allows for a fixed amount of repetition.
For example, the HXL tag for a generic geographical code (like a P-code) is #loc+code
. A 3W spreadsheet for a specific country could allow room for up to three geocodes like this:
P-code 1 | P-code 2 | P-code 3 |
---|---|---|
#loc+code | #loc+code | #loc+code |
020503 | ||
060107 | 060108 | |
173219 | ||
530012 | ||
530013 | 530015 | 279333 |
By reading the HXL tag, processing software can easily recognize that the three columns represent (up to) three values for the same field, even though the full column titles differ, and even if the authors of the processing software knew nothing about the specific conventions in use in this country.
4.1. “Wide” (series) data
“Wide” datasets are optimised for reading rather than machine processing. They place a series of data (usually numbers) across a row, showing how information varies over time, geographical area, demographic groups, or some other criterion. Here is a simple, non-HXL example listing the number of people of concern in each region during four different years.
Region | 2008 | 2009 | 2010 | 2011 |
---|---|---|---|---|
Coast District | 0 | 30 | 100 | 250 |
Mountain District | 15 | 75 | 30 | 45 |
This table presents some special challenges for tagging, because the columns headed “2008” to “2011” all represent the same kind of data, but in different years. Tagging them all as simply #affected
loses important information about the series. The solution is to use an extra attribute, +label
, to specify that the information in the header is a label for the data series:
Region | 2008 | 2009 | 2010 | 2011 |
---|---|---|---|---|
#adm1 | #affected+label | #affected+label | #affected+label | #affected+label |
Coast District | 0 | 30 | 100 | 250 |
Mountain District | 15 | 75 | 30 | 45 |
Appendix A: Major changes from HXL 1.0 alpha
- Removed compact disaggregated syntax and language extensions.
- Added tag attributes.
- Added recommendations for language attributes.
Appendix B: Formal grammar of a HXL tag
The following Backus-Naur Form grammar, with regular expressions, defines the allowed content of a HXL tag (terminals are in uppercase):
<hxl-tag> ::= <tag> | <tag> <attributes> | <tag> WHITESPACE <attributes> <attributes> ::= <attribute> | <attributes> <attribute> | <attributes> WHITESPACE <attribute> <tag> ::= "#" TOKEN <attribute> ::= "+" TOKEN TOKEN ::= /[a-zA-Z][a-zA-Z0-9_]*/ WHITESPACE ::= /[ \t\n\r]+/
Appendix C: Credits
HXL is a group effort of many people and organisations, including a wider community of over 100 members of the hxlproject@googlegroups.com public mailing list. Chad Hendrix (OCHA) was the founder of the HXL standards effort, and Carsten Keßler (Hunter College) was the original technical lead. Since 2013, Sarah Telford (OCHA) has been overall programme manager and David Megginson (OCHA) has served as standards lead and chair, while John Crowley (World Bank) has helped with outreach and governance beginning in late 2014. Generous funding for HXL research and development in 2014 came from the Humanitarian Innovation Fund, and OCHA has supported continuing work.
During 2014, the HXL Working Group included Albert Gembara (USAID), Andrej Verity (OCHA), Andrew Alspach (UNHCR), David Megginson (OCHA), Gavin Wood (UNICEF), John Crowley (World Bank), Lauren Burns (Save the Children), Maurizio Blasilli (WFP), Muhammad Rizki (IOM, later replaced by Ivan Vukovic), and Paul Currion (Humanitarian Innovation Fund).
To date in 2015, the HXL Working Group has included Andrej Verity (OCHA), David Megginson (OCHA), John Adams (DFID), John Crowley (World Bank), Justine Mackinnon (Standby Task Force), Laurent Pitoiset (UNHCR), Sara-Jayne Terp (Ushahidi), and Simon Johnson (British Red Cross).