The Data-Driven Lab (DDL) uses cutting-edge data analytics to understand the role of subnational and non-state actors in their efforts to address the climate challenge. In the process to compile and account for global climate action pledges and progress by non-state actors, DDL faces different issues in regards to both data processing and sharing. These issues can be divided into two main data categories:
1) Data that DDL has collected for our analysis and global aggregation reports
Costs associated with climate emissions data: Global climate disclosure initiatives like CDP spend a significant amount of resources collecting, processing, cleaning and verifying data. They currently charge for access to their full dataset, which is ingrained in their business model. They do provide some datasets for free on their open data portal (data.cdp.net), particularly on cities, states and regions.
Data sharing/reposting permissions: The data that DDL has received directly from data providers (e.g., Global Covenant of Mayors, Carbonn, etc.) cannot be shared. The data providers would likely take major issue with this, so we would need to get their permission once it is clearly defined what it is we are doing with the data.
Data updating mechanisms: Many of these organizations do not use APIs to store or display data on websites. What this means is that our process of getting the data from providers usually involves: 1) them sending us a spreadsheet, often including data that is not available publicly on the website; 2) us scraping data from their websites and recompiling it using our R package and other cleaning procedures. This cumbersome process is currently done manually and its cost is covered by research funds rather than shared by the global climate action tracking community that ultimately benefits from it.
Reconciling diverse datasets: There are often errors and inconsistencies in the data that we have to spend copious amounts of time manually checking and verifying, including:
Erroneous baseline or inventory emissions data
Inconsistent reduction targets (e.g., an actor has updated their targets but these updates have not yet been reported to their network or data provider
Incorrect demographic information
Inconsistent targets - at present we have only cleaned/modeled economy-wide emission reduction targets, but in reality actors make a wide range of other targets covering many sectors, and often they report these targets without the necessary accompanying information to quantify their impact (e.g., a renewable electricity generation target means that you need to have the current energy mix breakdown, and very infrequently is this data reported concurrently with the target)
Inconsistent units, particularly for intensity-based targets
For further context, here is an explanation of what CDP does to clean/make accessible their investor dataset: https://www.cdp.net/en/investor/ghg-emissions-dataset
2) Data that actors themselves (e.g., businesses, cities, etc.) possess and could report
High Cost: It is very time-consuming and expensive to develop an emissions inventory
Reporting Fatigue: Actors do not want to have to report to yet another platform because of the associated costs and labor. Even having to report on an annual cycle is a cumbersome affair
Privacy concerns: Some emissions data may be considered sensitive and could reveal proprietary secrets
Data Inequity: Some actors, particularly in the global south, lack capacity to develop their own inventories
Security concerns: The data files might contain viruses or other sort of security threat that might threaten the entire ecosystem.
Sample data from CDP (open-sourced).