
Most greenhouse gas data stops at the national or company level. If you want to ask how much CO2 a specific power station emits, and what climate it sits in — for thousands of plants, in one consistent schema — you have to build it yourself. So we did, and published it openly with a DOI. Here is the pipeline, the honest limitations, and the gotchas. The one number that reframes the whole thing: of 4,551 power stations, only ~15% have measured emissions (US EPA + EU ETS). The other 85% are modelled estimates (Climate TRACE). Any per-plant CO2 product that hides that distinction is selling false precision — so we store the source on every single row. What we built A single open dataset of 4,551 power stations across the US, China, India, Japan, Australia, Saudi Arabia, Germany, Turkey and dozens more countries. Each row carries location, primary fuel, capacity (MW), commissioning year, operating status, owner, annual CO2 (t/yr), the source of that CO2 figure, CO2 intensity (t/MWh), and the plant’s Koppen-Geiger climate zone — so the data can be sliced by environment, not just geography. The honest part: measured vs estimated Per-plant CO2 is mostly not measured. In our dataset: 616 plants carry measured emissions from the US EPA GHGRP 47 plants carry measured emissions from the EU ETS 3,888 plants carry modelled estimates from Climate TRACE (satellite + ML) So only ~15% is independently measured. Rather than hide it, we store a co2_source on every row, so any user can filter to measured-only or treat estimates with caution. If you build emissions data and do not expose provenance per row, you are shipping a liability. The pipeline The build is a plain Python ETL, no orchestration framework: Backbone — start from an open plant inventory (WRI Global Power Plant Database + Global Energy Monitor). This gives the canonical list, capacity, fuel and geo. Join CO2 — merge Climate TRACE asset-level emissions, then overwrite with EPA GHGRP and EU ETS where a measured match exists (measured always wins). Match — the hard 80%. Plants do not share IDs across sources, so we match on geo proximity + fuzzy name. Climate — look up each plant’s Koppen zone from an offline 0.5-degree grid by coordinate. QA — dedupe, normalise owners, harmonise units, sanity-check intensities. A simplified version of the match step: from rapidfuzz import fuzz from math import radians, cos, sin, asin, sqrt def haversine_km(a, b): lat1, lon1, lat2, lon2 = map(radians, [*a, *b]) d = sin((lat2-lat1)/2)**2 + cos(lat1)*cos(lat2)*sin((lon2-lon1)/2)**2 return 2 * 6371 * asin(sqrt(d)) def match(plant, candidates): near = [(c, fuzz.token_set_ratio(plant['name'], c['name'])) for c in candidates if haversine_km(plant['coord'], c['coord']) < 10] return max(near, key=lambda x: x[1])[0] if near else None Gotchas that cost us the most time Name matching: “Saint-Egreve”, “St Egreve CCGT”, “ST. EGREVE” are the same plant in three sources. Geo-distance as the primary key with name as the tiebreaker beat name-first matching every time. Owner normalisation: parent vs subsidiary vs joint venture. We collapsed to a canonical owner so portfolio rollups are honest. Unit harmonisation: mixed MW/MWe and t/yr vs kt/yr. One silent unit bug poisons every intensity figure. Double counting: co-located units reported once at site level in one source and per-unit in another. Coastal sites also need a nearest-land fallback for the climate zone. Why climate per plant Environment drives real engineering outcomes: atmospheric corrosivity on outdoor equipment, cooling and efficiency penalties in hot zones, heat-loss behaviour. Tagging Koppen zones lets you ask what the coal fleet in arid zones looks like without a separate GIS step. Open data Check it and find the gaps! The full dataset is on Zenodo with a DOI ( https://doi.org/10.5281/zenodo.20723334 under CC BY 4.0, and an interactive per-plant view lives at PowerAtlas ( https://inzonex.co.uk/poweratlas/ . It is a starting point, not gospel: the estimated share is large and coverage is incomplete. Corrections and pull requests welcome, that is the point of publishing provenance per row. \
View original source — Hacker Noon ↗
