A framework to incentivise accurate reporting of climate action data and tracking of progress towards goals. (Tools used: blockchain, data processing, sentiment analysis)
Team Name: Climate COPs
Participants: Barnabé Monnot, Brendan Graetz, Debapratim Jana, Hum Qing Ze, Kenta Iwasaki, Koh Wei Jie, Lai Ying Tong, Lu Shengliang, Ryan Swart, Sofija Stefanovic, Yogendrasingh Pawar Mentors: Prof Angel Hsu, Dorjee Sun, Nihit Goyal, Willie Khoo
Node: Singapore (Yale-NUS College)
- 1.Solar farm is reporting energy production
- 2.First report is truthful but invalid, leads to opening a validator-challenge
- 3.Some report is falsified, leads to a whistleblower challenge
- 4.Solar farm releases a document specifying their commitments re: emissions reduction. NLP extracts data from the plan and checks validity of the commitment
- 5.Solar farm publishes new data, this is checked privately using homomorphic encryption against their commitments.
- data analysis smart contract
- encrypted emissions data (PoC: assume numerical and standard data format)
- goal fulfilment smart contract
- publicly declared reductions goals
- stake deposit
- INPUT: encrypted data, ML weights
- COMPUTE: sentiment analysis for bad actor prediction, data cleaning, outlier detection
- OUTPUT: cleaned encrypted data, bad actors, updated ML weights
- INPUT: cleaned encrypted data
- COMPUTE: aggregate statistics on cleaned encrypted data
- OUTPUT: decrypted aggregate result, proof of decryption
- Reporting accurate data
- Making reduction commitments
- Fulfilling reduction commitments
GitHub - openclimate-sg/data_verification: simple python script to verify climate emissions data
Main repository for data verifier
GitHub - openclimate-sg/data_analysis: Work done for the Yale Open Climate Collabathon
Main repository for data analysis
GitHub - openclimate-sg/datawhistleblowing: Work done for the Yale Open Climate Collabathon
Main repository for data whistleblowing
GitHub - openclimate-sg/datawhistleblowing_ui: UI for datawhistleblowing
Main repository for data whistleblowing app UI
DataReporting.sol smart contract on Ethereum testnet
Sometimes emissions data can be misreported. This script does some simple checks to ensure that the numbers reported are not too ridiculous.
0. Ensure that your file headers are of a similar format
"population", "baseline_emissions", "baseline_year", "total_co2_emissions", "total_co2_emissions_year"
- 1.Run the script on your .csv file
- 2.It for each line in your file whether it passes the following:
- per capita co2 emissions
- per capita baseline emissions
- compound annual growth rate of emissions
- co2 vs baseline emissions
We analyse different aspects of the texts submitted by actors using NLP tools which could be used in the efforts to automate the action tracking given large and inconsistent datasets.
We explore what we can get from the semantics of the texts submitted by actors
For demonstrative purpose we look at semantic similarity between several commitments put forward in the action plans of the Covenant of Mayors (CoM) members, and some of the actions identified as most urgent and effective in the World Scientists’ Warning of a Climate Emergency (Ripple et al., 2019). To achieve this we make use of Google’s pre-trained Universal Sentence Encoder model (Cer et al., 2018), which encodes text in high dimensional vectors. A possible set of sentences made up of calls for urgent actions (1-10 below), extracts from a CoM commitment (11) and a control sentence (12) could look like:
['Implement massive energy efficiency practices', 'Replace fossil fuels with low-carbon renewables', 'Apply strong carbon taxes to cut fossil fuel use', 'We should leave remaining stocks of fossil fuels in the ground ', 'We must swiftly eliminate subsidies for fossil fuels', 'Eat mostly plant-based foods while reducing the global consumption of animal products', 'Encourage vegeterian and vegan diets', 'Free up croplands for growing human plant food instead of livestock feed', 'Release some grazing land to support natural climate solution', 'Increase use of renewable energy sources for housing and public amenities', 'Fossil free energy production', 'Word and sentence embeddings have become an essential part of any Deep-Learning-based natural language processing systems.']
Comparing the embeddings generated with the Universal Sentence Encoder yields the results below for semantic similarity:
Using the model on our example sentences, we see that it correctly groups the energy-related statements and diet-related statements. We also notice that there is no clear separation into different groups, which makes sense given that these issues are interrelated and likely show up in similar contexts in the datasets that the model has been trained on. The example extracted from the CoM dataset "fossil free energy production" is found to be similar to the energy-related statements, the highest overlap being with "Replace fossil fuels with low-carbon renewables". Finally, the control sentence is correctly found to be unrelated to the rest.
Since USE generated embeddings for sentences rather than words, it is more successful at recognising the similarity between statements that show up in similar contexts even if they do not shae the same vocabulary.
Already the simple semantic similarity comparison with USE can give us some measure of how well the actors are aligned with the actions we are interested in, but it is also possible to tweak the pre-trained model further to increase precision for a specific task.
To assess the urgency with which the actors report their commitents and progress we started building a classifier that can be used to detect urgency in any climate-change related texts. We define a text as urgent if the language used:
- intentionally uses phrases that are meant to convey urgency (e.g. climate crisis, breakdown, emergency)
- stresses how devastating the effects of climate change are and will be
- points out there is little time left to act/no time to waste
- calls for radical change/action
Our goal is to make use of pre-trained deep learning models and weak supervision to overcome the issue of a lack of large labeled datasets relevant for climate action tracking. We draw on the approach presented [here](https://medium.com/sculpt/a-technique-for-building-nlp-classifiers-efficiently-with-transfer-learning-and-weak-supervision-a8e2f21ca9c8 to build a powerful classifier for the task. Specifically, we intend to make use of Google's Snorkel to generate a large training dataset for the classifier by relying on labeling functions to probabilistically label large unlabeled datasets.
Part completed for the hackathon:
- 1.Created a dataset of 20000 Guardian articles about climate change accessed through the Guardian API
- 2.Started reviewing and labeling a randomly chosen sample of 700 data points as 'urgent', 'not urgent' or 'neutral'
To be completed:
- 1.Building the training set with Snorkel
- 2.Building the Classification model
We perform a sentiment analysis of the text submitted by actors in climate action dataset.
For our demo, we perform this analysis on a dataset of 1162 cities in the EU Covenant of Mayors (CoM) members. The sentiment analysis is performed using the Allennlp package.
The subject of sentiment analysis is to classify the "sentiment" or polarity of a given text at the document, sentence, or feature/aspect level - whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral.
Using Allennlp, we first train the model on the Stanford Sentiment treebank, and use the model to predict the sentiment of climate action plan texts. The goal is to explore whether performing the sentiment analysis can be an important "marker" to reveal some information about the nature of actors.
From the EU CoM dataset, we extract, what we call the "consistency score" to determine if actors are consistent with their proposed action plans to reduce/offset emissions.
We calculate the promised rate of CO2 emission reduction as promised by the actors as well as their current rate of emission reduction. All reduction rates are calculated with respect to the baseline year as reported by individual actors. The actors in the dataset also promised a certain target, a target year and report their current total emissions.
If the current rate is greater or equal to the promised rate, then the actors are awarded a consistency score of 1, otherwise they are inconsistent and given a score of 0.
We find that the average current rate / promised rate is around 1.07 for our dataset. The overall framework of our analysis model is to use different markers: semantic similarity, sentiment and urgency to see if our model can learn to predict whether the actors will be "good" (consistent) or "bad" (failing to meet their promise).
Most of the documents and proposals pertaining to climate change and climate action are in the PDF file format . With growing number of actors and agencies working on climate action, it is not possible to read through each and every proposal manually and extract relevant data. There is a need to analyze these proposals and quantify the data present in them using programs which can read and understand the PDF files for extraction of relevant data. PDF parsers combined with NLP can be implemented to achieve this.
To demonstrate this, we have used the PyPDF2 to parse the PDFs and the Spacy Library to use NLP for information extraction. We use the default NER (Named Entity Recognition) models available with Spacy to identify various actors . Finally the data is extracted using some simple rule based matching combined with Part Of Speech tagging into a Pandas Dataframe. For demonstration purposes, this extraction was carried out on some manufactured sentences.
The PDF parser is still a work in progress with need for much robust and context specific NER models. There also is a need for developing methodologies to deal with non-textual content such as images and graphs which can contain a lot of relevant information.
For the purposes of this hackathon, we have implemented a simple proof of concept. We use as an example a solar energy farm which is required to report its daily power production. We also assume that this farm is a corporation with 5 executives.
Each executive registers their cryptographic identity into an Ethereum smart contract (based on Semaphore (https://weijiekoh.github.io/semaphore-ui/, https://github.com/kobigurk/semaphore/), a zero-knowledge signalling gadget), so that anyone can anonymously prove their membership in the set and broadcast a whistleblowing signal.
We then simulate the following process of the company reporting data, along with a deposit, for five days in a row, and an executive anonymously blowing the whistle on data reported on the fifth day. This locks up part of the total deposit. After an investigation (outside the system), an investigator then seizes part of the total amount deposited, and rewards part of the seized funds to a separate address specified by the whistleblower when she blew the whistle earlier
- 1.On day 1, the solar farm publishes their true power readings on a smart contract and deposits 0.1 ETH along with the data.
- 2.The solar farm does the same for days 2, 3, and 4.
- 3.On day 5, however, the solar farm reports false power readings.
- 4.Alice, an executive in the corporation, decides to blow the whistle on this false reading. She produces a zero-knowledge proof of her membership in the set of executives, states that the readings of day 5 are fraudulent, and publishes it. Most importantly, the proof does not reveal Alice’s identity.
- 5.The smart contract locks up 0.2 ETH of deposits pending the results of an external investigation.
- 6.We assume that the investigator is a trusted third party. They hold the administrative private key with which they can unlock the farm’s deposit, or trigger the confiscation of said funds. Alice is rewarded a portion of the deposit for correctly whistleblowing, with this portion determined by the rules agreed upon, and saved in the smart contract. In this demo, she is rewarded 0.1 ETH. For the sake of anonymity, we assume that her payout address, specified along with the zero-knowledge proof, is unlinked to the address used to register her identity.