Historical & statistical data (item.historical
)¶
This module contains the code that implements the iTEM Open Data project, the broader aims of which are described on the main iTEM website.
See also the Glossary, which gives precise terminology used on this page.
Sources¶
These are listed in sources.yaml
, loaded as SOURCES
, from the iTEM metadata repository.
Input data is retrieved using via OpenKAPSARC and SDMX APIs, according to the type supported by each data source. See item.remote
.
Processing¶
The general function process()
applies common cleaning steps to each dataset, while loading and making use of dataset-specific checks, processing steps, and configuration from a submodule like T001
, as listed in MODULES
.
See the documentation of process()
for a detailed description of the steps.
Previously, input data sets were cleaned and transformed by IPython notebooks in the item/historical/scripts
directory, as listed in SCRIPTS
.
The notebook name corresponds to the input data set which it handles, e.g. T001.ipynb
.
Diagnostics¶
diagnostic
contains two kinds of diagnostics, which are descriptions of part or all of a data set:
coverage concerns which areas (countries or regions), time periods, and measures are included, or not.
quality includes sanity checks, such as computed/derived statistics for data, and their comparison to reference values.
Automated diagnostics¶
These can be run using the CLI command ixmp historical diagnostic FOLDER
.
Output is produced in a new folder named FOLDER
.
On our continuous integration infrastructure, for every build, these diagnostics are run automatically and uploaded to cloud storage for reference. For instance:
For GitHub pull request #23, the Travis CI service produced build number 313.
The uploaded diagnostics from this build are available at: https://storage.googleapis.com/historical-data-ci.transportenergy.org/313.1/index.html
Code reference¶
-
item.historical.
REGION
¶ Map from ISO 3166 alpha-3 code to region name.
-
item.historical.
SOURCES
← contents of sources.yaml¶ The current version of the file is always accessible at https://github.com/transportenergy/metadata/blob/master/historical/sources.yaml
-
item.historical.
COUNTRY_NAME
= {'Bosnia-Herzegovina': 'Bosnia and Herzegovina', 'Korea': 'Korea, Republic of', 'Montenegro, Republic of': 'Montenegro', 'Serbia, Republic of': 'Serbia', 'The former Yugoslav Republic of Macedonia': 'North Macedonia'}¶ Non-ISO names appearing in 1 or more data sets. These are used in
iso_and_region()
to replace names before they are looked up usingpycountry
.
-
item.historical.
MODULES
= {0: <module 'item.historical.scripts.T000' from '/home/docs/checkouts/readthedocs.org/user_builds/transportenergy/envs/feature-sdmx-dsd/lib/python3.7/site-packages/item/historical/scripts/T000.py'>, 1: <module 'item.historical.scripts.T001' from '/home/docs/checkouts/readthedocs.org/user_builds/transportenergy/envs/feature-sdmx-dsd/lib/python3.7/site-packages/item/historical/scripts/T001.py'>, 3: <module 'item.historical.scripts.T003' from '/home/docs/checkouts/readthedocs.org/user_builds/transportenergy/envs/feature-sdmx-dsd/lib/python3.7/site-packages/item/historical/scripts/T003.py'>, 4: <module 'item.historical.scripts.T004' from '/home/docs/checkouts/readthedocs.org/user_builds/transportenergy/envs/feature-sdmx-dsd/lib/python3.7/site-packages/item/historical/scripts/T004.py'>, 9: <module 'item.historical.scripts.T009' from '/home/docs/checkouts/readthedocs.org/user_builds/transportenergy/envs/feature-sdmx-dsd/lib/python3.7/site-packages/item/historical/scripts/T009.py'>}¶ Submodules usable with
process()
.
-
item.historical.
OUTPUT_PATH
= PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/transportenergy/envs/feature-sdmx-dsd/lib/python3.7/site-packages/item/data/historical/output')¶ Path for output from
process()
.
-
item.historical.
SCRIPTS
= ['T002', 'T005', 'T006', 'T007', 'T008']¶ List of data processing Jupyter/IPython notebooks.
-
item.historical.
cache_results
(id_str, df)[source]¶ Write df to
OUTPUT_PATH
in two file formats.The files written are:
id_str-clean.csv
, in long or ‘programming-friendly’ format, i.e. with a a ‘Year’ column.id_str-clean-wide.csv
, in wide or ‘user-friendly’ format, with one column per year.
-
item.historical.
fetch_source
(id, use_cache=True)[source]¶ Fetch data from source id.
The remote data is fetched using the API for the particular source. A network connection is required.
- Parameters
use_cache (bool, optional) – If given, use a cached local file, if available. No check of cache validity is performed.
-
item.historical.
input_file
(id: int)[source]¶ Return the path to a cached, raw input data file for data source id.
CSV files are located in the ‘historical input’ data path. If more than one file has a name beginning with “T{id}”, the last sorted file is returned.
-
item.historical.
iso_and_region
(name)[source]¶ Return (ISO 3166 alpha-3 code, iTEM region) for a country name.
- Parameters
name (str) – Country name. This is looked up in the pycountry ‘name’, ‘official_name’, or ‘common_name’ field. Replacements from
COUNTRY_NAME
are applied.
-
item.historical.
process
(id)[source]¶ Process a data set given its id.
Performs the following common processing steps:
Load the data from cache.
Load a module defining dataset-specific processing steps. This module is in a file named e.g.
T001.py
.Call the dataset’s (optional)
check()
method. This method receives the input data frame as an argument, and can make one or more assertions to ensure the data is in the expected format.Drop columns in the dataset’s (optional)
COLUMNS['drop']
list
.Call the dataset-specific (required)
process()
method. This method receives the data frame from step (4), and performs any additional processing.Assign ISO 3166 alpha-3 codes and the iTEM region based on a column containing country names; either
COLUMNS['country_name']
or the default, ‘Country’. Seeiso_and_region()
.Assign common dimensions from the dataset’s (optional)
COMMON_DIMS
dict
.Order columns according to
ColumnName
.Output data to two files. See
cache_results()
.
- Parameters
id (int) – Data source id, as listed in
MODULES
. E.g.0
imports data from from fileT000.csv
.- Returns
DataFrame – Apart from generating
cache_results()
, returns dataset as a DataFrame.- Return type
-
item.historical.
source_str
(id)[source]¶ Return the canonical string name (e.g. ‘T001’) for a data source.
Diagnostics for historical data sets.
-
item.historical.diagnostic.
coverage
(df, area='COUNTRY', measure='VARIABLE', period='TIME_PERIOD')[source]¶ Return information about the coverage of a data set.
-
class
item.historical.scripts.util.managers.dataframe.
ColumnName
(value)[source]¶ Bases:
enum.Enum
Column names for processed historical data.
The order of definition below is the standard order.
-
SOURCE
= 'Source'¶
-
COUNTRY
= 'Country'¶
-
ISO_CODE
= 'ISO Code'¶
-
ITEM_REGION
= 'Region'¶
-
VARIABLE
= 'Variable'¶
-
UNIT
= 'Unit'¶
-
SERVICE
= 'Service'¶
-
MODE
= 'Mode'¶
-
VEHICLE_TYPE
= 'Vehicle Type'¶
-
TECHNOLOGY
= 'Technology'¶
-
FUEL
= 'Fuel'¶
-
VALUE
= 'Value'¶
-
YEAR
= 'Year'¶
-
ID
= 'ID'¶
-
Specific data sets¶
T000¶
Data cleaning code and configuration for T000.
T000:
url: https://stats.oecd.org/index.aspx?queryid=79863
name: "Passenger transport: Inland passenger transport"
fetch:
type: SDMX
source: OECD
resource_id: ITF_PASSENGER_TRANSPORT
key: .T-PASS-TOT-INLD+T-PASS-RL-TOT+T-PASS-RD-TOT+T-PASS-RD-CAR+T-PASS-RD-BUS
validate: false
active: true
-
item.historical.scripts.T000.
COLUMNS
= {'drop': ['COUNTRY', 'VARIABLE', 'YEAR', 'Unit', 'Unit Code', 'PowerCode Code', 'PowerCode', 'Reference Period Code', 'Reference Period', 'Flag Codes', 'Flags']}¶ Columns to drop from the raw data.
-
item.historical.scripts.T000.
COMMON_DIMS
= {'fuel': 'All', 'service': 'Passenger', 'source': 'International Transport Forum', 'technology': 'All', 'unit': '10^9 passenger-km / yr', 'variable': 'Passenger Activity'}¶ Dimensions and attributes which do not vary across this data set.
-
item.historical.scripts.T000.
mode_and_vehicle_type
(variable_name)[source]¶ Determine ‘mode’ and ‘vehicle type’ from ‘variable’.
The rules implemented are:
Variable
Mode
Vehicle type
Rail passenger transport
Rail
All
Road passenger transport by buses and coaches
Road
Bus
Road passenger transport by passenger cars
Road
LDV
Total inland passenger transport
All
All
T001¶
Data cleaning code and configuration for T001.
This module:
Detects and corrects #32, a data error in the upstream source where China observation values for years 1990 to 2001 inclusive are too low by 2 orders of magnitude.
T001:
name: Coastal Transport
fetch:
type: SDMX
source: OECD
resource_id: ITF_GOODS_TRANSPORT
key: .T-SEA-CAB
validate: false
active: true
-
item.historical.scripts.T001.
COLUMNS
= {'drop': ['COUNTRY', 'VARIABLE', 'YEAR', 'Flag Codes', 'Flags', 'PowerCode Code', 'PowerCode', 'Reference Period Code', 'Reference Period', 'Unit Code', 'Unit']}¶ Columns to drop from the raw data.
-
item.historical.scripts.T001.
COMMON_DIMS
= {'fuel': 'All', 'mode': 'Shipping', 'service': 'Freight', 'source': 'International Transport Forum', 'technology': 'All', 'variable': 'Freight Activity', 'vehicle_type': 'Coastal'}¶ Dimensions and attributes which do not vary across this data set.
T003¶
Data cleaning code and configuration for T003.
The input data contains the variable names in VARIABLE_MAP
. A new sum is
computed, mode=”Inland ex. pipeline” that is the sum of the variables in
PARTIAL
, i.e. excluding “Pipelines transport”.
T003:
name: Inland Freight Transport
fetch:
type: SDMX
source: OECD
resource_id: ITF_GOODS_TRANSPORT
key: .T-GOODS-TOT-INLD+T-GOODS-RL-TOT+T-GOODS-RD-TOT+T-GOODS-RD-REW+T-GOODS-RD-OWN+T-GOODS-IW-TOT+T-GOODS-PP-TOT
validate: false
active: true
-
item.historical.scripts.T003.
COLUMNS
= {'drop': ['COUNTRY', 'VARIABLE', 'YEAR', 'Flag Codes', 'Flags', 'PowerCode', 'PowerCode Code', 'Reference Period Code', 'Reference Period', 'Unit Code', 'Unit']}¶ Columns to drop from the raw data.
-
item.historical.scripts.T003.
COMMON_DIMS
= {'fuel': 'All', 'source': 'International Transport Forum', 'technology': 'All', 'unit': 'Gt km / year'}¶ Dimensions and attributes which do not vary across this data set.
-
item.historical.scripts.T003.
PARTIAL
= ['Rail freight transport', 'Road freight transport', 'Inland waterways freight transport']¶ Variables to include in a partial sum.
-
item.historical.scripts.T003.
VARIABLE_MAP
= {'Inland waterways freight transport': {'mode': 'Shipping', 'vehicle_type': 'Inland'}, 'Pipelines transport': {'mode': 'Pipeline', 'vehicle_type': 'Pipeline'}, 'Rail freight transport': {'mode': 'Rail', 'vehicle_type': 'All'}, 'Road freight transport': {'mode': 'Road', 'vehicle_type': 'All'}, 'Road freight transport for hire and reward': {'mode': 'Road', 'vehicle_type': 'For Hire and Reward'}, 'Road freight transport on own account': {'mode': 'Road', 'vehicle_type': 'For Own Account'}, 'Total inland freight transport': {'mode': 'Inland', 'vehicle_type': 'All'}}¶ Mapping from Variable to mode and vehicle_type dimensions.
T004¶
Data cleaning code and configuration for T004.
Notes:
The input data is does not express the units, which are single vehicles.
Todo
The input data have labels like “- LPG” in the “Fuel type” column, with the hyphen possibly indicating a hierarchical code list. Find a reference to this code list.
The code currently uses some inconsistent labels, such as:
“Liquid-Bio” (no spaces) vs. “Liquid - Fossil” (spaces).
“Natural Gas Vehicle” vs. “Conventional” (word “Vehicle” is omitted).
Fix these after PR #62 is merged by using code lists for these dimensions.
Add code to fetch this source automatically. It does not have a clearly-defined API.
Capture and preserve the metadata provided by the UNECE data interface.
T004:
url: https://w3.unece.org/PXWeb2015/pxweb/en/STAT/STAT__40-TRTRANS__03-TRRoadFleet/08_en_TRRoadNewVehF_r.px/?rxid=674effaa-3926-4d2e-9d6d-abfd7dd196b8
name: New Road Vehicle Registrations by Vehicle Category and Fuel Type
active: true
-
item.historical.scripts.T004.
COLUMNS
= {'drop': ['Frequency']}¶ Columns to drop from the raw data.
-
item.historical.scripts.T004.
COMMON_DIMS
= {'mode': 'Road', 'source': 'UNECE', 'unit': 'vehicle', 'variable': 'Sales (New Vehicles)'}¶ Dimensions and attributes which do not vary across this data set.
-
item.historical.scripts.T004.
CSV_SEP
= ';'¶ Separator character for
pandas.read_csv()
.
-
item.historical.scripts.T004.
MAP
= {'Fuel type': {'- Bi-fuel vehicles': ('Conventional', 'Liquid-Bio'), '- Biodiesel': ('Conventional', 'Liquid - Fossil'), '- Bioethanol': ('Conventional', 'Liquid-Bio'), '- Compressed natural gas (CNG)': ('Natural Gas Vehicle', 'Natural Gas'), '- Diesel (excluding hybrids)': ('Conventional', 'Liquid - Fossil'), '- Electricity': ('BEV', 'Electricity'), '- Hybrid electric-diesel': ('Conventional', 'Liquid - Fossil'), '- Hybrid electric-petrol': ('Conventional', 'Liquid - Fossil'), '- Hydrogen and fuel cells': ('Fuel Cell', 'Hydrogen'), '- LPG': ('Natural Gas Vehicle', 'Natural Gas'), '- Liquefied natural gas (LNG)': ('Natural Gas Vehicle', 'Natural Gas'), '- Petrol (excluding hybrids)': ('Conventional', 'Liquid - Fossil'), '- Plug-in hybrid diesel-electric': ('PHEV', 'Electricity'), '- Plug-in hybrid petrol-electric': ('PHEV', 'Electricity'), 'Alternative (total)': ('Alternative', 'Alternative'), 'Diesel': ('Conventional', 'Liquid - Fossil'), 'Petrol': ('Conventional', 'Liquid - Fossil'), 'Total': ('All', 'All'), '_dims': ('TECHNOLOGY', 'FUEL')}, 'Type of vehicle': {'New light goods vehicles': ('Freight', 'Light Truck'), 'New lorries (vehicle wt over 3500 kg)': ('Freight', 'Heavy Truck'), 'New motor coaches, buses and trolley buses': ('Freight', 'Bus'), 'New passenger cars': ('Passenger', 'LDV'), 'New road tractors': ('Freight', 'Medium Truck'), '_dims': ('SERVICE', 'VEHICLE')}}¶ Mapping between existing values and values to be assigned.
T009¶
Data cleaning code and configuration for T009.
T009:
name: Road Vehicle Fleet by Vehicle Category and Fuel Type
fetch:
type: OpenKAPSARC
dataset_id: road-vehicle-fleet-by-vehicle-category-and-fuel-type
url: https://datasource.kapsarc.org/explore/dataset/road-vehicle-fleet-by-vehicle-category-and-fuel-type/
-
item.historical.scripts.T009.
FETCH
= True¶ Fetch directly from the source, or cache.
Quality diagnostics¶
A003¶
-
item.historical.diagnostic.A003.
compute
(activity, stock)[source]¶ Quality diagnostic for freight load factor.
Returns the ratio of road freight traffic from
T003
and the total number of freight vehicles fromT009
.- Parameters
activity (pandas.DataFrame) – From
T003
.stock (pandas.DataFrame) – From
T009
.