Historical & statistical data (item.historical)

This module contains the code that implements the iTEM Open Data project, the broader aims of which are described on the main iTEM website.

See also the Glossary, which gives precise terminology used on this page.

Sources

These are listed in sources.yaml, loaded as SOURCES, from the iTEM metadata repository.

Input data is retrieved using via OpenKAPSARC and SDMX APIs, according to the type supported by each data source. See item.remote.

Processing

The general function process() applies common cleaning steps to each dataset, while loading and making use of dataset-specific checks, processing steps, and configuration from a submodule like T001, as listed in MODULES. See the documentation of process() for a detailed description of the steps.

Previously, input data sets were cleaned and transformed by IPython notebooks in the item/historical/scripts directory, as listed in SCRIPTS. The notebook name corresponds to the input data set which it handles, e.g. T001.ipynb.

Diagnostics

diagnostic contains two kinds of diagnostics, which are descriptions of part or all of a data set:

  • coverage concerns which areas (countries or regions), time periods, and measures are included, or not.

  • quality includes sanity checks, such as computed/derived statistics for data, and their comparison to reference values.

Automated diagnostics

These can be run using the CLI command ixmp historical diagnostic FOLDER. Output is produced in a new folder named FOLDER.

On our continuous integration infrastructure, for every build, these diagnostics are run automatically and uploaded to cloud storage for reference. For instance:

Code reference

item.historical.REGION

Map from ISO 3166 alpha-3 code to region name.

item.historical.SOURCES contents of sources.yaml

The current version of the file is always accessible at https://github.com/transportenergy/metadata/blob/master/historical/sources.yaml

item.historical.COUNTRY_NAME = {'Bosnia-Herzegovina': 'Bosnia and Herzegovina', 'Korea': 'Korea, Republic of', 'Montenegro, Republic of': 'Montenegro', 'Serbia, Republic of': 'Serbia', 'The former Yugoslav Republic of Macedonia': 'North Macedonia'}

Non-ISO names appearing in 1 or more data sets. These are used in iso_and_region() to replace names before they are looked up using pycountry.

item.historical.MODULES = {0: <module 'item.historical.scripts.T000' from '/home/docs/checkouts/readthedocs.org/user_builds/transportenergy/envs/feature-sdmx-dsd/lib/python3.7/site-packages/item/historical/scripts/T000.py'>, 1: <module 'item.historical.scripts.T001' from '/home/docs/checkouts/readthedocs.org/user_builds/transportenergy/envs/feature-sdmx-dsd/lib/python3.7/site-packages/item/historical/scripts/T001.py'>, 3: <module 'item.historical.scripts.T003' from '/home/docs/checkouts/readthedocs.org/user_builds/transportenergy/envs/feature-sdmx-dsd/lib/python3.7/site-packages/item/historical/scripts/T003.py'>, 4: <module 'item.historical.scripts.T004' from '/home/docs/checkouts/readthedocs.org/user_builds/transportenergy/envs/feature-sdmx-dsd/lib/python3.7/site-packages/item/historical/scripts/T004.py'>, 9: <module 'item.historical.scripts.T009' from '/home/docs/checkouts/readthedocs.org/user_builds/transportenergy/envs/feature-sdmx-dsd/lib/python3.7/site-packages/item/historical/scripts/T009.py'>}

Submodules usable with process().

item.historical.OUTPUT_PATH = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/transportenergy/envs/feature-sdmx-dsd/lib/python3.7/site-packages/item/data/historical/output')

Path for output from process().

item.historical.SCRIPTS = ['T002', 'T005', 'T006', 'T007', 'T008']

List of data processing Jupyter/IPython notebooks.

item.historical.cache_results(id_str, df)[source]

Write df to OUTPUT_PATH in two file formats.

The files written are:

  • id_str-clean.csv, in long or ‘programming-friendly’ format, i.e. with a a ‘Year’ column.

  • id_str-clean-wide.csv, in wide or ‘user-friendly’ format, with one column per year.

item.historical.fetch_source(id, use_cache=True)[source]

Fetch data from source id.

The remote data is fetched using the API for the particular source. A network connection is required.

Parameters

use_cache (bool, optional) – If given, use a cached local file, if available. No check of cache validity is performed.

item.historical.input_file(id: int)[source]

Return the path to a cached, raw input data file for data source id.

CSV files are located in the ‘historical input’ data path. If more than one file has a name beginning with “T{id}”, the last sorted file is returned.

item.historical.iso_and_region(name)[source]

Return (ISO 3166 alpha-3 code, iTEM region) for a country name.

Parameters

name (str) – Country name. This is looked up in the pycountry ‘name’, ‘official_name’, or ‘common_name’ field. Replacements from COUNTRY_NAME are applied.

item.historical.process(id)[source]

Process a data set given its id.

Performs the following common processing steps:

  1. Load the data from cache.

  2. Load a module defining dataset-specific processing steps. This module is in a file named e.g. T001.py.

  3. Call the dataset’s (optional) check() method. This method receives the input data frame as an argument, and can make one or more assertions to ensure the data is in the expected format.

  4. Drop columns in the dataset’s (optional) COLUMNS['drop'] list.

  5. Call the dataset-specific (required) process() method. This method receives the data frame from step (4), and performs any additional processing.

  6. Assign ISO 3166 alpha-3 codes and the iTEM region based on a column containing country names; either COLUMNS['country_name'] or the default, ‘Country’. See iso_and_region().

  7. Assign common dimensions from the dataset’s (optional) COMMON_DIMS dict.

  8. Order columns according to ColumnName.

  9. Output data to two files. See cache_results().

Parameters

id (int) – Data source id, as listed in MODULES. E.g. 0 imports data from from file T000.csv.

Returns

DataFrame – Apart from generating cache_results(), returns dataset as a DataFrame.

Return type

pandas.DataFrame

item.historical.source_str(id)[source]

Return the canonical string name (e.g. ‘T001’) for a data source.

Parameters

id (int or str) – Integer ID of the data source.

Diagnostics for historical data sets.

item.historical.diagnostic.coverage(df, area='COUNTRY', measure='VARIABLE', period='TIME_PERIOD')[source]

Return information about the coverage of a data set.

item.historical.diagnostic.run_all(output_path)[source]

Run all diagnostics.

class item.historical.scripts.util.managers.dataframe.ColumnName(value)[source]

Bases: enum.Enum

Column names for processed historical data.

The order of definition below is the standard order.

SOURCE = 'Source'
COUNTRY = 'Country'
ISO_CODE = 'ISO Code'
ITEM_REGION = 'Region'
VARIABLE = 'Variable'
UNIT = 'Unit'
SERVICE = 'Service'
MODE = 'Mode'
VEHICLE_TYPE = 'Vehicle Type'
TECHNOLOGY = 'Technology'
FUEL = 'Fuel'
VALUE = 'Value'
YEAR = 'Year'
ID = 'ID'

Specific data sets

T000

Data cleaning code and configuration for T000.

T000:
  url: https://stats.oecd.org/index.aspx?queryid=79863
  name: "Passenger transport: Inland passenger transport"
  fetch:
    type: SDMX
    source: OECD
    resource_id: ITF_PASSENGER_TRANSPORT
    key: .T-PASS-TOT-INLD+T-PASS-RL-TOT+T-PASS-RD-TOT+T-PASS-RD-CAR+T-PASS-RD-BUS
    validate: false
  active: true
item.historical.scripts.T000.COLUMNS = {'drop': ['COUNTRY', 'VARIABLE', 'YEAR', 'Unit', 'Unit Code', 'PowerCode Code', 'PowerCode', 'Reference Period Code', 'Reference Period', 'Flag Codes', 'Flags']}

Columns to drop from the raw data.

item.historical.scripts.T000.COMMON_DIMS = {'fuel': 'All', 'service': 'Passenger', 'source': 'International Transport Forum', 'technology': 'All', 'unit': '10^9 passenger-km / yr', 'variable': 'Passenger Activity'}

Dimensions and attributes which do not vary across this data set.

item.historical.scripts.T000.mode_and_vehicle_type(variable_name)[source]

Determine ‘mode’ and ‘vehicle type’ from ‘variable’.

The rules implemented are:

Variable

Mode

Vehicle type

Rail passenger transport

Rail

All

Road passenger transport by buses and coaches

Road

Bus

Road passenger transport by passenger cars

Road

LDV

Total inland passenger transport

All

All

item.historical.scripts.T000.process(df)[source]

Process data set T000.

T001

Data cleaning code and configuration for T001.

This module:

  • Detects and corrects #32, a data error in the upstream source where China observation values for years 1990 to 2001 inclusive are too low by 2 orders of magnitude.

T001:
  name: Coastal Transport
  fetch:
    type: SDMX
    source: OECD
    resource_id: ITF_GOODS_TRANSPORT
    key: .T-SEA-CAB
    validate: false
  active: true
item.historical.scripts.T001.COLUMNS = {'drop': ['COUNTRY', 'VARIABLE', 'YEAR', 'Flag Codes', 'Flags', 'PowerCode Code', 'PowerCode', 'Reference Period Code', 'Reference Period', 'Unit Code', 'Unit']}

Columns to drop from the raw data.

item.historical.scripts.T001.COMMON_DIMS = {'fuel': 'All', 'mode': 'Shipping', 'service': 'Freight', 'source': 'International Transport Forum', 'technology': 'All', 'variable': 'Freight Activity', 'vehicle_type': 'Coastal'}

Dimensions and attributes which do not vary across this data set.

item.historical.scripts.T001.check(df)[source]

Check data set T001.

item.historical.scripts.T001.process(df)[source]

Process data set T001.

  • Drop null values.

  • Convert from Mt km / year to Gt km / year.

T003

Data cleaning code and configuration for T003.

The input data contains the variable names in VARIABLE_MAP. A new sum is computed, mode=”Inland ex. pipeline” that is the sum of the variables in PARTIAL, i.e. excluding “Pipelines transport”.

T003:
  name: Inland Freight Transport
  fetch:
    type: SDMX
    source: OECD
    resource_id: ITF_GOODS_TRANSPORT
    key: .T-GOODS-TOT-INLD+T-GOODS-RL-TOT+T-GOODS-RD-TOT+T-GOODS-RD-REW+T-GOODS-RD-OWN+T-GOODS-IW-TOT+T-GOODS-PP-TOT
    validate: false
  active: true
item.historical.scripts.T003.COLUMNS = {'drop': ['COUNTRY', 'VARIABLE', 'YEAR', 'Flag Codes', 'Flags', 'PowerCode', 'PowerCode Code', 'Reference Period Code', 'Reference Period', 'Unit Code', 'Unit']}

Columns to drop from the raw data.

item.historical.scripts.T003.COMMON_DIMS = {'fuel': 'All', 'source': 'International Transport Forum', 'technology': 'All', 'unit': 'Gt km / year'}

Dimensions and attributes which do not vary across this data set.

item.historical.scripts.T003.PARTIAL = ['Rail freight transport', 'Road freight transport', 'Inland waterways freight transport']

Variables to include in a partial sum.

item.historical.scripts.T003.VARIABLE_MAP = {'Inland waterways freight transport': {'mode': 'Shipping', 'vehicle_type': 'Inland'}, 'Pipelines transport': {'mode': 'Pipeline', 'vehicle_type': 'Pipeline'}, 'Rail freight transport': {'mode': 'Rail', 'vehicle_type': 'All'}, 'Road freight transport': {'mode': 'Road', 'vehicle_type': 'All'}, 'Road freight transport for hire and reward': {'mode': 'Road', 'vehicle_type': 'For Hire and Reward'}, 'Road freight transport on own account': {'mode': 'Road', 'vehicle_type': 'For Own Account'}, 'Total inland freight transport': {'mode': 'Inland', 'vehicle_type': 'All'}}

Mapping from Variable to mode and vehicle_type dimensions.

T004

Data cleaning code and configuration for T004.

Notes:

  • The input data is does not express the units, which are single vehicles.

Todo

  • The input data have labels like “- LPG” in the “Fuel type” column, with the hyphen possibly indicating a hierarchical code list. Find a reference to this code list.

  • The code currently uses some inconsistent labels, such as:

    • “Liquid-Bio” (no spaces) vs. “Liquid - Fossil” (spaces).

    • “Natural Gas Vehicle” vs. “Conventional” (word “Vehicle” is omitted).

    Fix these after PR #62 is merged by using code lists for these dimensions.

  • Add code to fetch this source automatically. It does not have a clearly-defined API.

  • Capture and preserve the metadata provided by the UNECE data interface.

T004:
  url: https://w3.unece.org/PXWeb2015/pxweb/en/STAT/STAT__40-TRTRANS__03-TRRoadFleet/08_en_TRRoadNewVehF_r.px/?rxid=674effaa-3926-4d2e-9d6d-abfd7dd196b8
  name: New Road Vehicle Registrations by Vehicle Category and Fuel Type
  active: true
item.historical.scripts.T004.COLUMNS = {'drop': ['Frequency']}

Columns to drop from the raw data.

item.historical.scripts.T004.COMMON_DIMS = {'mode': 'Road', 'source': 'UNECE', 'unit': 'vehicle', 'variable': 'Sales (New Vehicles)'}

Dimensions and attributes which do not vary across this data set.

item.historical.scripts.T004.CSV_SEP = ';'

Separator character for pandas.read_csv().

item.historical.scripts.T004.MAP = {'Fuel type': {'- Bi-fuel vehicles': ('Conventional', 'Liquid-Bio'), '- Biodiesel': ('Conventional', 'Liquid - Fossil'), '- Bioethanol': ('Conventional', 'Liquid-Bio'), '- Compressed natural gas (CNG)': ('Natural Gas Vehicle', 'Natural Gas'), '- Diesel (excluding hybrids)': ('Conventional', 'Liquid - Fossil'), '- Electricity': ('BEV', 'Electricity'), '- Hybrid electric-diesel': ('Conventional', 'Liquid - Fossil'), '- Hybrid electric-petrol': ('Conventional', 'Liquid - Fossil'), '- Hydrogen and fuel cells': ('Fuel Cell', 'Hydrogen'), '- LPG': ('Natural Gas Vehicle', 'Natural Gas'), '- Liquefied natural gas (LNG)': ('Natural Gas Vehicle', 'Natural Gas'), '- Petrol (excluding hybrids)': ('Conventional', 'Liquid - Fossil'), '- Plug-in hybrid diesel-electric': ('PHEV', 'Electricity'), '- Plug-in hybrid petrol-electric': ('PHEV', 'Electricity'), 'Alternative (total)': ('Alternative', 'Alternative'), 'Diesel': ('Conventional', 'Liquid - Fossil'), 'Petrol': ('Conventional', 'Liquid - Fossil'), 'Total': ('All', 'All'), '_dims': ('TECHNOLOGY', 'FUEL')}, 'Type of vehicle': {'New light goods vehicles': ('Freight', 'Light Truck'), 'New lorries (vehicle wt over 3500 kg)': ('Freight', 'Heavy Truck'), 'New motor coaches, buses and trolley buses': ('Freight', 'Bus'), 'New passenger cars': ('Passenger', 'LDV'), 'New road tractors': ('Freight', 'Medium Truck'), '_dims': ('SERVICE', 'VEHICLE')}}

Mapping between existing values and values to be assigned.

item.historical.scripts.T004.map_column(value, column)[source]

Apply mapping to value in column.

T009

Data cleaning code and configuration for T009.

T009:
  name: Road Vehicle Fleet by Vehicle Category and Fuel Type
  fetch:
    type: OpenKAPSARC
    dataset_id: road-vehicle-fleet-by-vehicle-category-and-fuel-type
  url: https://datasource.kapsarc.org/explore/dataset/road-vehicle-fleet-by-vehicle-category-and-fuel-type/
item.historical.scripts.T009.FETCH = True

Fetch directly from the source, or cache.

item.historical.scripts.T009.service(value)[source]

Determine ‘service’ dimension based on a vehicle type.

Quality diagnostics

A003

item.historical.diagnostic.A003.compute(activity, stock)[source]

Quality diagnostic for freight load factor.

Returns the ratio of road freight traffic from T003 and the total number of freight vehicles from T009.

Parameters