HXL tagging conventions, version 1.1 beta

Release 1.1 beta, 2017-12-22 (permalink, previous release)

Contents

1. Introduction

This document is part of the Humanitarian Exchange Language (HXL) version 1.1, a standard for increasing the efficiency and effectiveness of data sharing during humanitarian crises. This new version is fully backwards-compatible with data produced using HXL 1.0 (released 18 March 2016), and adds several new features, including JSON-based encodings and a standard way to refer to taxonomies/controlled vocabularies. There are also several new hashtags and attributes in the hashtag and attribute dictionary.

The intended audience for this specification is information-management professionals and software developers who require a formal definition of the HXL syntax. Most users who simply want to add hashtags to their data may prefer the HXL postcards and the tutorial information at hxlstandard.org, as well as interactive HXL tool support under development at the Humanitarian Data Exchange (HDX).

The HXL standard consists of two normative parts:

  1. HXL tagging conventions (this document) — instructions for adding HXL hashtags to spreadsheets.
  2. HXL hashtag and attribute dictionary — a list of core hashtags and attributes for identifying humanitarian data fields.

There are also two non-normative annexes:

  1. HXL postcards — two-sided 10×15 cm (4×6 in) cards in multiple languages, for quick reference.
  2. Vocabulary registry — recommended taxonomies and code lists for use in HXL-encoded datasets.

1.1. Design philosophy

HXL is a lightweight standard by design. Most data standards dictate to users how they should collect and format their data; HXL, on the other hand, encourages organisations to add hashtags to their existing datasets, without requiring new skills or software tools, and interferes as little as possible in their current ways of working.

The primary focus of HXL is tabular-style data such as spreadsheets or API output from database tables, which represent the vast majority of the operational data collected in the humanitarian sphere; however, HXL hashtags can potentially have other applications, including labelling attributes for map layers or identifying data types in SMS messages. Starting with version 1.1, HXL also supports simple JSON-based representations of tabular data.

1.2. Terms of use

HXL is available as an open standard — the working group has designed it for use with humanitarian data, but people and organisations are welcome to use it for any purpose they choose. Note, however, that users may not claim support or endorsement from any members of the HXL working group or the organisations for which they work. The authors offer no warranty of any kind, so implementors use the standard at their own risk.

The text of the standard itself is released into the public domain.

2. Adding HXL hashtags to data

Updated in version 1.1 to add JSON encodings.

The section includes instructions for adding HXL hashtags to spreadsheet-style (tabular) or JSON data.

2.1 Spreadsheet-style (tabular) data

Consider the following simple spreadsheet:

LOCATION NAME LOCATION CODE NUMBER AFFECTED
Camp A 01000001 2000
Camp B 01000002 750
Camp C 01000003 1920

Datasets like this — longer, of course, and with more columns — are the backbone of humanitarian information management, and they provide the input for most reports, maps, and visualisations coming out of a crisis. Unfortunately, creating those data products is time-consuming, and responders have to duplicate the work from crisis to crisis and even dataset to dataset, because it is hard to build reusable software tools that can understand the many different ways responders may choose to label their data. For example, the text header of the last column could have appeared in dozens of variants, and in several different languages:

  • Number affected
  • Affected
  • People affected
  • # de personnes concernées
  • Afectadas/os
  • عدد الأشخاص المتضررين

Humanitarian software tools need to be able to recognise that the figures in the third column refer to the number of people affected — regardless of how the data provider has decided to label it in the spreadsheet — so HXL adds a second header row containing short hashtags:

LOCATION NAME LOCATION CODE NUMBER AFFECTED
#loc +name #loc +code #affected
Camp A 01000001 2000
Camp B 01000002 750
Camp C 01000003 1920

Now, whether the text at the top of the column reads “Number affected” or “عدد الأشخاص المتضررين”, software for cleaning, validating, analysing, mapping, or visualising the data can automatically recognise the hashtag #affected and use the figures below accordingly. A full list of core hashtags appears in the HXL hashtag and attribute dictionary.

Some of the hashtags have extra attributes like +name and +code to refine their meaning. See Hashtag attributes for more information.

More than one row of headers may appear above the HXL hashtag row — the hashtags themselves act as a marker to show automated systems where the headers end and the data begins:

CAMP INFORMATION NEEDS
LOCATION NAME LOCATION CODE NUMBER AFFECTED
#loc +name #loc +code #affected
Camp A 01000001 2000
Camp B 01000002 750
Camp C 01000003 1920

HXL software should expect to find the hashtag row anywhere within the first 25 rows of a dataset and should assume that all rows below the hashtag row contain data.

2.2 JSON data

New in version 1.1.

It is becoming increasingly common for web-based applications to share data with other applications online through application programming interfaces (APIs). A popular data input/output representation for APIs is JavaScript Object Notation (JSON), which is difficult for humans to use, but very simple for software to ingest.

Humanitarian organisations providing APIs for machine-to-machine data sharing can add HXL hashtags to their JSON data using the conventions in this section. Note that, as with the spreadsheet-style representation, HXL’s goal is not to cover every possible variation and edge case; instead, the standard defines two JSON HXL hashtagging styles that will work for a wide range of typical API use cases:

  1. Adding hashtags to a JSON array of objects (dictionary style).
  2. Adding hashtags to a JSON array of arrays (tabular style).

2.2.1. Array of objects JSON style

New in version 1.1.

With this style, the HXL-hashtagged dataset appears as a JSON array of “objects” (aka dictionaries, hash tables), where the unique keys are HXL hashtags and attributes, and the values represent the contents of one of the rows in the dataset:

[
  {
    "#event+id": 1,
    "#affected+killed": 1,
    "#region": "Mediterranean",
    "#date+reported": "2015-11-05",
    "#geo+lat": 36.891500,
    "#geo+lon": 27.287700
  },
  {
    "#event+id": 3,
    "#affected+killed": 1,
    "#region": "Central America incl. Mexico",
    "#date+reported": "2015-11-05",
    "#geo+lat": 15.956400,
    "#geo+lon": -93.663100
  }
]

This JSON style repeats the hashtags and attributes in each row. The data in the above example is exactly equivalent to the following tabulated HXL data:

#event +id #affected +killed #region #date +reported #geo +lat #geo +lon
1 1 Mediterranean 2015-11-05 36.891500 27.287700
3 1 Central America incl. Mexico 2015-11-05 15.956400 -93.663100

Caveats: while this JSON style aligns most closely with common web API use, it does lack some abilities of a tabular data style:

  1. You cannot include human-readable headers such as “Geographical region” in addition to the HXL hashtags.
  2. You cannot repeat a hashtag/attribute combination to represent repeated values (though you can still use the +list attribute as a work-around).
  3. You do not have the same flexibility to include HXL attributes in any order, with or without whitespace (see details below).

Note on attributes: HXL normally allows hashtag attributes to appear in any order, case-insensitive, with or without whitespace separating them, so these are all considered equivalent: #affected+f+children, #affected +children +f, and #affected+Children+F. In JSON objects, it is essential that the property names be consistent, so you should take the following steps when converting a HXL hashtag specification (hashtag and attributes) for use as a JSON object property:

  1. Convert to lowercase.
  2. Remove all whitespace.
  3. Present the attributes in US-ASCII alphabetical order.

Following these rules, the JSON property-name representation of the above HXL hashtag specification will always be #affected+children+f:

{
  "#affected+children+f": 27
}

2.2.2. Array of arrays JSON style

New in version 1.1.

An alternative JSON style is much closer to the tabular structure of a spreadsheet. Web APIs do not use this style as often as the array of objects style, but it has the advantage of being more compact and easier to import into a spreadsheet or database. In this style, each row of data—including the header row(s) and HXL hashtag row—simply appears as a JSON array of scalar values:

[
  ["#event +id", "#affected +killed", "#region", "#date +reported", "#geo +lat", "#geo +lon"],
  [1, 1, "Mediterranean", "2015-11-05", 36.891500,27.287700],
  [3, 1, "Central America incl. Mexico", "2015-11-03", 15.956400, -93.663099]
]

The example above is, again, exactly equivalent to the following tabulated HXL data:

#event +id #affected +killed #region #date +reported #geo +lat #geo +lon
1 1 Mediterranean 2015-11-05 36.891500 27.287700
3 1 Central America incl. Mexico 2015-11-05 15.956400 -93.663100

This JSON representation does not share the disadvantages of the array of objects approach: it allows multiple rows of human-readable headers, repeated columns, and flexibility in ordering and whitespace for attributes. However, because it is less common, it may not align as well with existing API output.

3. Structure of a HXL hashtag

The HXL hashtag itself follows the same syntactic rules as a Twitter hashtag: it begins with the octothorpe/pound sign (“#”) and contains only unaccented Roman alphabetic characters (so-called “ASCII letters,” “a” to “z”), Arabic numerals (“0” to “9”), and the underscore symbol (“_”). The first character must be alphabetic, and character case does not matter (#ADM1 and #adm1 are the same hashtag, but you should use lowercase stylistic reasons). Here are some examples of syntactically-valid HXL hashtags:

  • #sector
  • #org
  • #affected
  • #impact

A full list of core hashtags appears in the HXL hashtag and attribute dictionary. You may also create your own hashtags if you need them: see Creating extension hashtags for best practices.

3.1. Hashtag attributes

The core shared HXL hashtags describe high-level concepts like an organisation (#org), geographical coordinates (#geo), a humanitarian cluster or sector (#sector), the number of people affected (#affected), or a subdivision of a country (#adm1). Humanitarian datasets, however, often need to make finer-grained distinctions. For example, is an organisation the funder (donor) or the implementing agency? Does a column contain the name of an administrative subdivision or its code?

HXL allows data providers to make these distinctions by attaching attributes to a hashtag. Think of attributes as tags for the hashtags.

A full list of core attributes appears in the HXL hashtag and attribute dictionary. You may also create your own attributes if you need them: see Creating extension hashtags for best practices.

3.1.1. Attribute syntax

Attributes follow the same syntactic rules as hashtags, except that they begin with plus (“+”) rather than the octothorpe/pound sign (“#”), and follow the hashtag , with optional whitespace separating the attributes. A hashtag may have any number of attributes, and order does not matter (except in the case of the JSON array of objects style), so #org +funder +code has exactly the same meaning as #org +code +funder. The following examples show attributes attached to hashtags to refine their meaning:

#org +funder
The funding organisation (e.g. a donor). The +funder attribute tags #org to give more information.
#org +funder +code
Machine-readable code (of some type) for a funding organisation.
#adm1 +name +i_fr
The name of an administrative level-one subdivision (such as a province), in French.
#adm1 +code +v_pcode
The P-code (place code) of an administrative level-one subdivision (adding +v_pcode to +code to further refine the code type)

Software processing HXL data may ignore any attributes it does not recognise and simply process the core hashtag. For example, if you find #org +xyzzy and can’t interpret the +xyzzy attribute, just treat it as #org.

For more information on “+v_” attributes, see 3.1.4. Attributes for controlled vocabularies.

Note: all attributes beginning with a single alphabetic character followed by an underscore (e.g. “+i_”, “+v_”, and “+x_”) are reserved for special use in HXL.

3.1.2. Common attributes

Data providers may invent their own attributes to suit their local data needs (see Creating extension hashtags for best practices); however, there are some recommended common attributes that will be useful across many data types. There is a full list in the HXL hashtag and attribute dictionary, of which the following are some highlights:

+displaced +idps +injured +reached +refugees +abducted
Classifications for counts or descriptions of people, e.g. #affected +idps for the number of internally-displaced people.
+code
The value is a unique, machine-readable code, e.g. #adm1 +code for an administrative level-one P-code.
+f +m +i
The value (usually a number) refers specifically to people of a specific gender (or +i for non-binary), e.g. #affected +f for the number of female people affected.
+start +end
The value refers to the beginning or end of a time period, e.g. #date +end for the end date of an activity.

3.1.3. Attributes for languages

Changed in version 1.1 to add “+i_” before language codes.

Humanitarian crises often take place in multicultural areas, where different local groups speak different languages; furthermore, international responders helping with a crisis may need to work in their own languages as well. As a result, humanitarian datasets are sometimes multilingual, listing the same information in e.g. French and Arabic, or Dari and Pashto.

To make it easy to identify languages in HXL, the standard recommends attribute names beginning with “+i_” followed by the two-character ISO 639-1 language code, such as +i_en for English. Data providers can use the attributes to mark the language of a column:

PROJECT TITLE TITRE DU PROJET
#activity +i_en #activity +i_fr
Malaria treatments Traitement du paludisme
Teacher training Formation des enseignant(e)s

The following language attributes (not a comprehensive list) are examples of those that might appear in international humanitarian datasets:

+i_en English +i_fa Dari / Farsi / Persian
+i_fr French +i_ps Pashto
+i_ar Arabic +i_ms Malay
+i_es Spanish +i_ur Urdu
+i_ru Russian +i_tl Tagalog

Compatibility note: for best interoperability with older data, HXL-aware software should also accept language codes not prefixed by “+i_” such as +fr for French or +ar for Arabic. However, these are deprecated, and you should not use them in new HXL-hashtagged datasets.

3.1.4. Attributes for controlled vocabularies

New in version 1.1.

While the +code HXL attribute indicates that a value is a machine-readable code of some kind, it does not tell exactly what vocabulary, code list, or taxonomy is in use. Beginning with version 1.1, any HXL attribute beginning in “+v_” represents a short identifier for a controlled vocabulary, which you can look up (if desired) from a master HXL dataset available at https://data.humdata.org/dataset/hxl-master-vocabulary-list.

For example, the HXL hashtag specification #country +code +v_iso3 indicates that the column contains country codes from the “v_iso3” vocabulary. To find more information about that vocabulary, software (or a human reader) may look it up in the master dataset, and find information like this:

Attribute Vocabulary name Controlling org Home page
#vocab +att #vocab +name #org +controlling #vocab +url +home
+v_iso3 ISO 3166-1 alpha 3: Codes for the representation of names of countries and their subdivisions — Part 1: Country codes (3-letter identifiers) International Organization for Standardization (ISO) https://www.iso.org/standard/63545.html

The #vocab hashtag is specific to this purpose, and not part of the core HXL hashtags. The maintainers will be adding additional columns to the dataset in the future, but these four core columns, at a minimum, should always be present.

Vocabulary attributes are hints to HXL-enabled software to help with processing and validation, but do not mandate any specific processing model or changes to the underlying HXL data.

To request registering a new vocabulary identifier, please post a message to the public hxlproject@googlegroups.com mailing list for discussion.

3.2. Creating extension hashtags

Updated in version 1.1.

The HXL core hashtags include hashtags that will be generally applicable to many humanitarian datasets, but it is impossible to anticipate every hashtag for every humanitarian need. This standard makes the following five recommendations for extending HXL hashtags and attributes:

  1. Whenever possible, take an existing hashtag with a broader meaning, and narrow it down with an attribute, e.g. #loc +hospital.
  2. When tagging similar types of information from different controlled vocabularies, use the “+v_” attributes to make distinctions, e.g. #sector +code +v_ocha_clusters
  3. When there is no applicable core hashtag, begin an extension hashtag with “x_”, so that it will not conflict with any future HXL core hashtags, e.g. #x_toxicity (it is not necessary to use “x_” for extension attributes).
  4. When software finds a HXL hashtag that it does not recognise, e.g. #x_toxicity, it should simply ignore the column of data.
  5. When software finds a HXL attribute that it does not recognise, e.g. #loc +hospital, it should ignore the attribute but still process the hashtag and any other attributes it does recognise: in this case, it should behave as if the dataset had contained simply #loc.

Note: software designers may choose to warn about unrecognised hashtags and attributes, to help with error detection and quality control. However, HXL-enabled software should never reject a dataset because of an unrecognised hashtag or attribute, as long as it is able to process the remaining data in a meaningful way.

4. Special cases

This section describes how to use HXL to deal with special cases that do not normally fit well into a tabular data model.

4.1. Repeating fields

A tabular format works poorly for repeated fields (e.g. an activity taking place in more than one location); however, using HXL hashtags, it is possible to design a spreadsheet format that allows for a fixed upper-limit of repetition in multiple columns (e.g. up to five organisations or sectors), or that allows multiple values in a single field.

4.1.1. Multiple columns

For example, the HXL hashtag for a generic geographical code (like a P-code) is #loc +code. A 3W spreadsheet for a specific country could allow room for up to three geocodes, like this:

P-CODE 1 P-CODE 2 P-CODE 3
#loc +code #loc +code #loc +code
020503
060107 060108
173219
530012
530013 530015 279333

This approach is not currently possible with the Array of objects JSON style, since any key may appear only once in a JSON object. Using the Array of arrays JSON encoding, the above example would look like this:

[
  ["P-CODE 1", "P-CODE 2", "P-CODE 3"],              
  ["#loc +code", "#loc +code", "#loc +code"],
  ["020503"],
  ["060107", "060108"],
  ["173219"],
  ["530013", "530015", "279333"]
]

By reading the HXL hashtag, processing software can easily recognize that the three columns represent (up to) three values for the same field, even though the full column titles differ, and even if the authors of the processing software knew nothing about the specific conventions in use in this country.

4.1.2. The +list attribute

New in version 1.1.

As an alternative, HXL also supports the inclusion of multiple values in a single field, such as a spreadsheet cell, separated by a comma (or optionally, other punctuation). Note that this method—while popular with spreadsheet users—is not as reliable as using separate columns, and will not allow you to use standard spreadsheet functions for filtering and sorting while creating data. It may also not be as well-supported by HXL-aware tools, such as mapping and other visualisation services.

To include multiple values in a single cell, add the +list attribute to the hashtag. Here is the earlier example recast with multiple values in a single cell:

P-CODE
#loc +code +list
020503
060107, 060108
173219
530012
530013, 530015, 279333

This is the only approach currently possible for representing repeating fields using the Array of objects JSON style:

[
  {
    "#loc+code+list": "020503"
  },
  {
    "#loc+code+list": "060107,060108"
  },
  {
    "#loc+code+list": "173219"
  },
  {
    "#loc+code+list": "530012"
  },
  {
    "#loc+code+list": "530013,530015,279333"
  }
]

With the Array of arrays JSON style, the above example would look much like the spreadsheet table:

[
  ["#loc +code"],
  ["020503"],
  ["060107,060108"],
  ["173219"],
  ["530013,530015,279333"]
]

Some HXL-aware software might not process this approach correctly, and using it may affect the display of reports, charts, and maps. The +list attribute is nothing more than a hint to software that fields in a columns may contain lists of values inline; it does not mandate any approach to parsing those lists or to using their values.

4.2. “Wide” (series) data

“Wide” datasets are optimised for reading rather than machine processing. They place a series of data (usually numbers) across a row, showing how information varies over time, geographical area, demographic groups, or some other criterion. Here is a simple, non-HXL example listing the number of people of concern in each region during four different years.

REGION 2008 2009 2010 2011
Coast District 0 30 100 250
Mountain District 15 75 30 45

This table presents some special challenges for tagging, because the columns headed “2008” to “2011” all represent the same kind of data, but in different years. Tagging them all as simply #affected loses important information about the series. The solution is to use an extra attribute, +label, to specify that the information in the header is a label for the data series:

REGION 2008 2009 2010 2011
#adm1 +name #affected +label #affected +label #affected +label #affected +label
Coast District 0 30 100 250
Mountain District 15 75 30 45

As with the +list attribute, +label is simply a hint to HXL software that this convention is in use. It does not mandate any specific processing model.

Appendix A: Changes from previous versions

Major changes from 1.0 final to 1.1 beta:

  • Added JSON encodings.
  • Added “+v_” vocabulary attributes.
  • Prefix language attributes with “+i_” so that +fr in HXL 1.0 becomes +i_fr in HXL 1.1.
  • Added +list attribute and convention for multiple values in a single cell.
  • Reserved all attributes beginning with a single letter followed by underscore for future use.

Changes from 1.0 beta to 1.0 final:

  • Explicitly state that the text of the standard is released into the public domain.
  • Many minor copy-editing and text-formatting changes.

Changes from 1.0 alpha to 1.0 beta:

  • Removed compact disaggregated syntax and language extensions.
  • Added hashtag attributes.
  • Added recommendations for language attributes.

Appendix B: Formal grammar of a HXL hashtag

The following Backus-Naur Form grammar, with regular expressions, defines the allowed content of a HXL hashtag (terminals are in uppercase):

<hxl-tagspec>           ::= <hashtag>
    | <hashtag> <attributes>
    | <hashtag> WHITESPACE <attributes>

<attributes>            ::= <attribute>
    | <attributes> <attribute>
    | <attributes> WHITESPACE <attribute>

<hashtag>               ::= "#" TOKEN

<attribute>             ::= "+" TOKEN

TOKEN                   ::= /[a-zA-Z][a-zA-Z0-9_]*/

WHITESPACE              ::= /[ \t\n\r]+/

Appendix C: Credits

HXL is a group effort of many people and organisations, including a wider community of over 100 members of the hxlproject@googlegroups.com public mailing list. CJ Hendrix (UN OCHA) was the founder of the HXL standards effort, and Carsten Keßler (Hunter College) was the original technical lead. Since 2013, Sarah Telford (UN OCHA) has been overall programme manager and David Megginson (UN OCHA) has served as standards lead and working group chair.

Generous funding for HXL research and development has come from from UN OCHA, the Humanitarian Innovation Fund, the UK Department for International Development (DFID), the Paul G. Allen Family Foundation, the USAID Global Development Lab, and the Dutch Ministry of Foreign Affairs. HXL is now a workstream of the Centre for Humanitarian Data in The Hague.

The following people have been members of the HXL working group during the development of the 1.1 release: Aidan McGuire (UN OCHA), Andrej Verity (UN OCHA), CJ Hendrix (UN OCHA), David Megginson (UN OCHA), John Adams (UK DFID), Helen Campbell (British Red Cross; earlier UN OCHA), Jan Rapp (INSO), Guillaume Nanin (in a private capacity), John Crowley (IFRC; earlier World Bank), Justine Mackinnon, Laurent Pitoiset (UNHCR), Michael Rans (UN OCHA), Sara-Jayne Terp (ThoughtWorks; earlier Ushahidi), Simon Johnson (British Red Cross), and Wesley DeWitt (World Bank).

Past members of the HXL Working Group have included Albert Gembara (USAID), Andrew Alspach (UNHCR), Gavin Wood (UNICEF), Ivan Vukovic (IOM), Lauren Burns (Save the Children), Maurizio Blasilli (WFP), Muhammad Rizki (IOM), and Paul Currion (Humanitarian Innovation Fund). Their contributions to previous releases of the HXL standard support this release as well.