mtyrrell commited on
Commit
d3d9b5e
·
1 Parent(s): 6c86e51

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -46,8 +46,8 @@ Due to inconsistencies in the training data, the classifier performance leaves r
46
  ## Training and evaluation data
47
 
48
  The training dataset is comprised of labelled passages from 2 sources:
49
- - [ClimateWatch NDC Sector data](https://www.climatewatchdata.org/data-explorer/historical-emissions?historical-emissions-data-sources=climate-watch&historical-emissions-gases=all-ghg&historical-emissions-regions=All%20Selected&historical-emissions-sectors=total-including-lucf%2Ctotal-including-lucf&page=1)
50
- - [IKI TraCS Climate Strategies for Transport Tracker](https://changing-transport.org/wp-content/uploads/20220722_Tracker_Database.xlsx) implemented by GIZ and funded by theInternational Climate Initiative (IKI) of the German Federal Ministry for Economic Affairs and Climate Action (BMWK). Here we utilized the QA dataset (CW_NDC_data_Sector).
51
 
52
  The combined dataset[GIZ/policy_qa_v0_1](https://huggingface.co/datasets/GIZ/policy_qa_v0_1) contains ~85k rows. Each row is duplicated twice, to provide varying sequence lengths (denoted by the values 'small', 'medium', and 'large', which correspond to sequence lengths of 60, 85, and 150 respectively - indicated in the 'strategy' column). This effectively means the dataset is reduced by 1/3 in useful size, and the 'strategy' value should be selected based on the use case. For this training, we utilized the 'medium' samples Furthermore, for each row, the 'context' column contains 3 samples of varying quality. The approach used to assess quality and select samples is described below.
53
 
@@ -58,7 +58,7 @@ The pre-processing operations used to produce the final training dataset were as
58
  3. For ClimateWatch, the 'QuestionText' field is searched for the terms 'unconditional' or 'conditional', and labels assigned accordingly.
59
  3. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
60
  4. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
61
- 5. The 'match_onanswer' and 'answerWordcount' are used conditionally to select hihg quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
62
  6. Data is then augmented using sentence shuffle from the ```albumentations``` library
63
 
64
 
 
46
  ## Training and evaluation data
47
 
48
  The training dataset is comprised of labelled passages from 2 sources:
49
+ - [ClimateWatch NDC Sector data](https://www.climatewatchdata.org/data-explorer/historical-emissions?historical-emissions-data-sources=climate-watch&historical-emissions-gases=all-ghg&historical-emissions-regions=All%20Selected&historical-emissions-sectors=total-including-lucf%2Ctotal-including-lucf&page=1). Here we utilized the QA dataset (CW_NDC_data_Sector).
50
+ - [IKI TraCS Climate Strategies for Transport Tracker](https://changing-transport.org/wp-content/uploads/20220722_Tracker_Database.xlsx) implemented by GIZ and funded by theInternational Climate Initiative (IKI) of the German Federal Ministry for Economic Affairs and Climate Action (BMWK).
51
 
52
  The combined dataset[GIZ/policy_qa_v0_1](https://huggingface.co/datasets/GIZ/policy_qa_v0_1) contains ~85k rows. Each row is duplicated twice, to provide varying sequence lengths (denoted by the values 'small', 'medium', and 'large', which correspond to sequence lengths of 60, 85, and 150 respectively - indicated in the 'strategy' column). This effectively means the dataset is reduced by 1/3 in useful size, and the 'strategy' value should be selected based on the use case. For this training, we utilized the 'medium' samples Furthermore, for each row, the 'context' column contains 3 samples of varying quality. The approach used to assess quality and select samples is described below.
53
 
 
58
  3. For ClimateWatch, the 'QuestionText' field is searched for the terms 'unconditional' or 'conditional', and labels assigned accordingly.
59
  3. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
60
  4. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
61
+ 5. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
62
  6. Data is then augmented using sentence shuffle from the ```albumentations``` library
63
 
64