HXL tagging conventions (version 1.1)

Release 1.1, 2018-04-30 (permalink, previous release) Part of the HXL 1.1 standard.

1. Introduction

This document is part of the Humanitarian Exchange Language (HXL) version 1.1, a standard for increasing the efficiency and effectiveness of data exchange during humanitarian crises. This new version is fully backwards-compatible with data produced using HXL 1.0 (released 18 March 2016), and adds several new features, including JSON-based encodings and a standard way to refer to taxonomies/controlled vocabularies. There are also several new hashtags and attributes in the hashtag dictionary. The intended audience for this specification is information-management professionals and software developers who require a formal definition of the HXL syntax. Most users who simply want to add hashtags to their data may prefer the HXL postcards and the tutorial information at hxlstandard.org, as well as interactive HXL tool support under development at the Humanitarian Data Exchange (HDX). The HXL standard consists of two normative parts:

HXL tagging conventions (this document) — instructions for adding HXL hashtags to spreadsheets.
HXL hashtag dictionary — a list of hashtags for identifying humanitarian data fields.

1.1. Design philosophy

HXL is a lightweight standard by design. Most data standards dictate to users how they should collect and format their data; HXL, on the other hand, encourages organisations to add hashtags to their existing datasets, without requiring new skills or software tools, and interferes as little as possible in their current ways of working. The primary focus of HXL is tabular-style data such as spreadsheets or API output from database tables, which represent the vast majority of the operational data collected in the humanitarian sphere; however, HXL hashtags can potentially have other applications, including labelling attributes for map layers or identifying data types in SMS messages.

1.2. Target audience

The standard’s primary audiences are information-management specialists who are familiar with spreadsheets or relational databases, and computer programmers and database specialists looking to consume data produced by those information-management specialists.

1.3. Terms of use

HXL is available as an open standard — the working groups have designed it for use with humanitarian data, but people and organisations are welcome to use it for any purpose they choose. Note, however, that users may not claim support or endorsement from any members of the HXL working group or the organisations for which they work. The authors offer no warranty of any kind, so implementors use the standard at their own risk. The text of the standard itself is released into the public domain.

2. Adding HXL hashtags to data

2.1 Spreadsheet Data eg. csv, xls, xlsx

Consider the following simple spreadsheet:

LOCATION NAME	LOCATION CODE	NUMBER AFFECTED
Camp A	01000001	2000
Camp B	01000002	750
Camp C	01000003	1920

Datasets like this — longer, of course, and with more columns — are the backbone of humanitarian information management, and they provide the input for most reports, maps, and visualisations coming out of a crisis. Unfortunately, creating those data products is time-consuming, and responders have to duplicate the work from crisis to crisis and even dataset to dataset, because it is hard to build reusable software tools that can understand the many different ways responders may choose to label their data. For example, the text header of the last column could have appeared in dozens of variants, and in several different languages:

Number affected
Affected
People affected
# de personnes concernées
Afectadas/os
عدد الأشخاص المتضررين

Software tools need to be able to recognise that the figures in the third column refer to the number of people affected — regardless of how the data provider has decided to label it in the spreadsheet — so HXL adds a second header row containing short hashtags:

LOCATION NAME	LOCATION CODE	NUMBER AFFECTED
#loc +name	#loc +code	#affected
Camp A	01000001	2000
Camp B	01000002	750
Camp C	01000003	1920

Now, whether the text at the top of the column reads “Number affected” or “عدد الأشخاص المتضررين”, software for cleaning, validating, analysing, mapping, or visualising the data can automatically recognise the hashtag #affected and use the figures below accordingly. More than one row of headers may appear above the HXL hashtag row — the hashtags themselves act as a marker to show automated systems where the headers end and the data begins:

CAMP INFORMATION		NEEDS
LOCATION NAME	LOCATION CODE	NUMBER AFFECTED
#loc +name	#loc +code	#affected
Camp A	01000001	2000
Camp B	01000002	750
Camp C	01000003	1920

HXL software should expect to find the hashtag row anywhere within the first 25 rows of a dataset and should assume that all rows below the hashtag row contain data.

2.2 JSON data

It is becoming increasingly common for organisations to share data through APIs. HXL is well placed to add interoperability to that data through its support for JSON, the format most-commonly used by APIs. HXL is purposely restricted to a simplified subset of the full JSON specification. In this simplified subset, the data must be laid out in a non-hierarchical and tabular form. Two such forms are currently supported.

2.2.1. Array of objects JSON style

This is a very common way for data to be presented where each row is a lookup between a hashtag key and a value:

[
  {
    "#hashtag": value,
    "#hashtag": value
  },
  {
    "#hashtag": value,
    "#hashtag": value
  }
]

An example of this is shown below:

[
  {
    "#event+id": 1,
    "#affected+killed": 1,
    "#region": "Mediterranean",
    "#meta+source+reliability": "Verified",
    "#date+reported": "05/11/2015",
    "#geo+lat": 36.891500,
    "#geo+lon": 27.287700
  },
  {
    "#event+id": 3,
    "#affected+killed": 1,
    "#region": "Central America incl. Mexico",
    "#meta+source+reliability": "Partially Verified",
    "#date+reported": "03/11/2015",
    "#geo+lat": 15.956400,
    "#geo+lon": -93.663100
  }
]

For repeated same named hashtags eg. to express multiple sectors using repeated #sector columns, the equivalent in this format is a comma separated list of sectors (see 4.1.1. The +list attribute), e.g.

"#sector": "WASH,health"

Note that the array of objects form does not allow for human-readable headers. If there is a demand for these — and the array of arrays form outlined below does not suffice — then they will appear in a future version of the standard. Note: HXL allows hashtag attributes to appear in any order, case-insensitive, with or without whitespace separating them, so these are all considered equivalent: “#affected+f+children”, “#affected +children +f”, and “#affected+Children+F”. In JSON objects, it is essential that the property names be consistent, so you should take the following steps when converting a HXL hashtag specification (hashtag and attributes) for use as a JSON object property:

Convert to lowercase.
Remove all whitespace.
Present the attributes in US-ASCII alphabetical order.

Following these rules, the JSON property-name representation of the above HXL hashtag specification will always be “#affected+children+f”.

2.2.2. Array of arrays JSON style

Although not widely used, this form is ideally suited to visualisations because it is significantly more compact than the Array of Objects format as the hashtags are only defined once in the first element of the Array:

[
  ["#hashtag", "#hashtag"],
  [value,value],
  [value,value]
]

Below is an example:

[
  ["#event+id","#affected+killed","#region","#meta+source+reliability", "#date+reported","#geo+lat","#geo+lon"],
  [1, 1, "Mediterranean", "Verified", "2015-11-05", 36.891500,27.287700],
  [3, 1, "Central America incl. Mexico", "Partially Verified", "2015-11-03", 15.956400, -93.663099]
]

If headers are needed, they can be added as an extra array prior to the hashtags.

3. Structure of a HXL hashtag

The root of a HXL hashtag follows the same syntactic rules as a Twitter hashtag: it begins with the octothorpe/pound sign (“#”) and contains only unaccented Roman alphabetic characters (so-called “ASCII letters,” “a” to “z”), Arabic numerals (“0” to “9”), and the underscore symbol (“_”). The first character must be alphabetic, and character case does not matter (#ADM1 and #adm1 are the same hashtag, though lowercase is preferred for stylistic reasons). Here are some examples of syntactically-valid HXL hashtags:

#sector
#org
#households
#impact

3.1. Hashtag attributes

The core shared HXL hashtags describe high-level concepts like an organisation (#org), geographical coordinates (#geo), a humanitarian cluster or sector (#sector), the number of people affected (#affected), or a subdivision of a country (#adm1). Humanitarian datasets, however, often need to make finer-grained distinctions. For example, is an organisation the funder (donor) or the implementing agency? Does a column contain the name of an administrative subdivision or its code? HXL allows data providers to make these distinctions by attaching attributes to a hashtag.

3.1.1. Attribute syntax

Attributes follow the same syntactic rules as hashtags, except that they begin with plus (“+”) rather than the octothorpe/pound sign (“#”), and follow the hashtag , with optional whitespace separating the attributes. A hashtag may have any number of attributes, and order does not matter, so #org +funder +code has exactly the same meaning as #org +code +funder. The following examples show attributes attached to hashtags to refine their meaning:

#org +funder — the funding organisation (e.g. a donor)
#org +funder +code — a machine-readable code (of some type) for a funding organisation
#adm1 +name +fr — the name of an administrative level-one subdivision, in French
#adm1 +code +v_pcode — the P-code (place code) of an administrative level-one subdivision (adding +v_pcode to +code to further refine the code type)
#adm1 +code +v_iso3 — the ISO code of an administrative level-one subdivision (adding +v_iso3 to +code to further refine the code type)

Software processing HXL data may ignore any attributes it does not recognise and simply process the core hashtag. For more information on +v_ attributes, see 3.1.4. Attributes for controlled vocabularies. Note: all attributes beginning with a single alphabetic character followed by an underscore (e.g. +v_ and +x_) are reserved for special use in HXL.

3.1.2. Common attributes

Data providers may invent their own attributes to suit their local data needs; however, there are some recommended common attributes that will be useful across many data types. There is a full list in the HXL core tagset^[a], of which the following are some highlights:

+displaced +idps +injured +reached +refugees

Classifications for counts or descriptions of people, e.g. #affected +idps for the number of internally-displaced people.

+code

The value is a unique, machine-readable code, e.g. #adm1+code for an administrative level-one P-code.

+f +m +i

The value (usually a number) refers specifically to people of a specific gender (or +i for non-binary), e.g. +affected +f for the number of female people affected.

+start +end

The value refers to the beginning or end, e.g. #date+end for the end date of an activity.

3.1.3. Attributes for languages

Humanitarian crises often take place in multicultural areas, where different local groups speak different languages; furthermore, international responders helping with a crisis may need to work in their own languages as well. As a result, humanitarian datasets are sometimes multilingual, listing the same information in e.g. French and Arabic, or Dari and Pashto. To make it easy to identify languages in HXL, the standard recommends that all two-character alphabetic attributes be reserved to represent ISO 639-1 language codes, such as +i_en for English. Data providers can use the attributes to mark the language of a column:

PROJECT TITLE	TITRE DU PROJET
#activity +i_en	#activity +i_fr
Malaria treatments	Traitement du paludisme
Teacher training	Formation des enseignant(e)s

The following language attributes (not a comprehensive list) are examples of those that might appear in international humanitarian datasets:

+i_en	English	+i_fa	Dari / Farsi / Persian
+i_fr	French	+i_ps	Pashto
+i_ar	Arabic	+i_ms	Malay
+i_es	Spanish	+i_ur	Urdu
+i_ru	Russian	+i_tl	Tagalog

3.1.4. Attributes for controlled vocabularies

New in version 1.1. While the +code HXL attribute indicates that a value is a machine-readable code of some kind, it does not tell exactly what vocabulary, code list, or taxonomy is in use. Beginning with release 1.1, any HXL attribute beginning in +v_ represents a short identifier for a controlled vocabulary, which you can look up (if desired) from a master HXL dataset at [url]^[b]. For example, the HXL hashtag specification #country +code +v_iso3 indicates that the column contains country codes from the “v_iso3” vocabulary. To find more information about that vocabulary, software (or a human reader) may look it up in the master dataset, and find information like this:

Vocabulary identifier	Vocabulary name	Controlling org	Home page
#vocab +att	#vocab +name	#org +controlling	#vocab +url +home
+v_iso3	ISO 3166-1 alpha 3: Codes for the representation of names of countries and their subdivisions — Part 1: Country codes (3-letter identifiers)	International Organization for Standardization (ISO)	https://www.iso.org/standard/63545.html

The #vocab hashtag is specific to this purpose, and not part of the core HXL hashtags. The maintainers will be adding additional columns to the dataset in the future, but these four core columns, at a minimum, should always be present. To request registering a new vocabulary identifier, please post a message to the public hxlproject@googlegroups.com mailing list for discussion.

3.2. Creating extension hashtags

The HXL core hashtags include hashtags that will be generally applicable to many humanitarian datasets, but it is impossible to anticipate every hashtag for every humanitarian need. This standard makes the following four recommendations^[c]^[d]^[e] for extending HXL hashtags and attributes:

Whenever possible, take an existing hashtag with a broader meaning, and narrow it down with an attribute, e.g. #loc +hospital.
When there is no applicable core hashtag, begin an extension hashtag with “x_”, so that it will not conflict with any future HXL core hashtags, e.g. #x_toxicity.
When software finds a HXL hashtag that it does not recognise, e.g. #x_toxicity, it should simply ignore the column of data.
When software finds a HXL attribute that it does not recognise, e.g. #loc+hospital, it should ignore the attribute but still process the hashtag and any other attributes it does recognise: in this case, it should behave as if the dataset had contained simply #loc.

Note: software designers may choose to warn about unrecognised hashtags and attributes, to help with error detection and quality control.

4. Special cases

This section describes how to use HXL to deal with special cases that do not normally fit well into a tabular data model.

4.1. Repeating fields

A tabular format works poorly for repeated fields (e.g. an activity taking place in more than one location); however, using HXL hashtags, it is possible to design a spreadsheet format that allows for a fixed upper-limit of repetition. For example, the HXL hashtag for a generic geographical code (like a P-code) is #loc+code. A 3W spreadsheet for a specific country could allow room for up to three geocodes, like this:

P-CODE 1	P-CODE 2	P-CODE 3
#loc+code	#loc+code	#loc+code
020503
060107	060108
173219
530012
530013	530015	279333

This approach is not currently possible with the Array of objects JSON style, since any key may appear only once in a JSON object. Using the Array of arrays JSON encoding, the above example would look like this:

[
  ["P-CODE 1", "P-CODE 2", "P-CODE 3"],              
  ["#loc+code", "#loc+code", "#loc+code"],
  ["020503"],
  ["060107", "060108"],
  ["173219"],
  ["530013", "530015", "279333"]
]

By reading the HXL hashtag, processing software can easily recognize that the three columns represent (up to) three values for the same field, even though the full column titles differ, and even if the authors of the processing software knew nothing about the specific conventions in use in this country.

4.1.1. The +list attribute

New in version 1.1. As an alternative, HXL also supports the inclusion of multiple values in a single field, such as a spreadsheet cell, separated by a comma (or optionally, other punctuation). Note that this method—while popular with spreadsheet users—is not as reliable as using separate columns, and will not allow you to use standard spreadsheet functions for filtering and sorting while creating data. It may also not be as well-supported by HXL-aware tools, such as mapping and other visualisation services. To include multiple values in a single cell, add the +list attribute to the hashtag. Here is the earlier example recast with multiple values in a single cell:

P-CODE
#loc +code +list
020503
060107, 060108
173219
530012
530013, 530015, 279333

This is the only approach currently possible for representing repeating fields using the Array of objects JSON style:

[
  {
    "#loc+code+list": "020503"
  },
  {
    "#loc+code+list": "060107,060108"
  },
  {
    "#loc+code+list": "173219"
  },
  {
    "#loc+code+list": "530012"
  },
  {
    "#loc+code+list": "530013,530015,279333"
  }
]

With the Array of arrays JSON style, the above example would look much like the spreadsheet table:

[
  ["#loc+code"],
  ["020503"],
  ["060107,060108"],
  ["173219"],
  ["530013,530015,279333"]
]

Note, again, that some HXL-aware software might not process this approach correctly, and that it can affect the display of reports, charts, and maps.

4.2. “Wide” (series) data

“Wide” datasets are optimised for reading rather than machine processing. They place a series of data (usually numbers) across a row, showing how information varies over time, geographical area, demographic groups, or some other criterion. Here is a simple, non-HXL example listing the number of people of concern in each region during four different years.

REGION	2008	2009	2010	2011
Coast District	0	30	100	250
Mountain District	15	75	30	45

This table presents some special challenges for tagging, because the columns headed “2008” to “2011” all represent the same kind of data, but in different years. Tagging them all as simply #affected loses important information about the series. The solution is to use an extra attribute, +label, to specify that the information in the header is a label for the data series:

REGION	2008	2009	2010	2011
#adm1 +name	#affected+label	#affected+label	#affected+label	#affected+label
Coast District	0	30	100	250
Mountain District	15	75	30	45

Appendix A: Changes from previous versions

Major changes from 1.0 final to 1.1 final:

Added JSON encodings.
Added +v_* vocabulary attributes.
Prefix language attributes with +i_ so that +fr in HXL 1.0 becomes +i_fr in HXL 1.1.
Added +list attribute and convention for multiple values in a single cell.
Reserved all attributes beginning with a single letter followed by underscore for future use.

Changes from 1.0 beta to 1.0 final:

Explicitly state that the text of the standard is released into the public domain.
Many minor copy-editing and text-formatting changes.

Changes from 1.0 alpha to 1.0 beta:

Removed compact disaggregated syntax and language extensions.
Added hashtag attributes.
Added recommendations for language attributes.

Appendix B: Formal grammar of a HXL hashtag

The following Backus-Naur Form grammar, with regular expressions, defines the allowed content of a HXL hashtag (terminals are in uppercase):

<hxl-tagspec>           ::= <hashtag>
    | <hashtag> <attributes>
    | <hashtag> WHITESPACE <attributes>

<attributes>            ::= <attribute>
    | <attributes> <attribute>
    | <attributes> WHITESPACE <attribute>

<hashtag>               ::=    "#" TOKEN

<attribute>             ::= "+" TOKEN

TOKEN                   ::=      /[a-zA-Z][a-zA-Z0-9_]*/

WHITESPACE              ::=    /[ \t\n\r]+/

Appendix C: Credits

HXL is a group effort of many people and organisations, including a wider community of over 100 members of the hxlproject@googlegroups.com public mailing list. C.J. Hendrix (OCHA) was the founder of the HXL standards effort, and Carsten Keßler (Hunter College) was the original technical lead. Since 2013, Sarah Telford (OCHA) has been overall programme manager and David Megginson (OCHA) has served as standards lead and chair, while John Crowley helped with outreach and governance in late 2014 and early 2015, and Aidan McGuire (ScraperWiki) has provided project management since 2015. Generous funding for HXL research and development in 2014 came from the Humanitarian Innovation Fund, and OCHA has supported continuing work. The Paul Allen Foundation has generously agreed to fund HXL work through 2016. During 2014, the HXL Working Group included Albert Gembara (USAID), Andrej Verity (OCHA), Andrew Alspach (UNHCR), David Megginson (OCHA), Gavin Wood (UNICEF), John Crowley (World Bank), Lauren Burns (Save the Children), Maurizio Blasilli (WFP), Muhammad Rizki (IOM, later replaced by Ivan Vukovic), and Paul Currion (Humanitarian Innovation Fund). From 2015 to the time of the HXL 1.1 release, the HXL Working Group has at various times included Aidan McGuire (OCHA), Andrej Verity (OCHA), CJ Hendrix (OCHA), David Megginson (OCHA), Guillaume Nanin, Helen Campbell (IFRC), Jan Rapp (INSO), John Adams (DFID), John Crowley (originally, World Bank), Justine Mackinnon, Kashif Nadeem (IOM), Laurent Pitoiset (UNHCR), Mike Rans (OCHA), Sara-Jayne Terp (ThoughtWorks), Simon Johnson (British Red Cross), Vincent Trousseau (CaLP and World Vision), and Wesley DeWitt (World Bank). Thank you also to the Standby Taskforce for their participation.

OCHA Services