adding readme
Browse files- README.md +36 -70
- models/audio_classification_baseline.pkl +0 -3
- tasks/audio.py +3 -4
README.md
CHANGED
@@ -1,71 +1,37 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
|
10 |
-
|
11 |
-
# Random Baseline Model for Climate Disinformation Classification
|
12 |
-
|
13 |
-
## Model Description
|
14 |
-
|
15 |
-
This is a random baseline model for the Frugal AI Challenge 2024, specifically for the text classification task of identifying climate disinformation. The model serves as a performance floor, randomly assigning labels to text inputs without any learning.
|
16 |
-
|
17 |
-
### Intended Use
|
18 |
-
|
19 |
-
- **Primary intended uses**: Baseline comparison for climate disinformation classification models
|
20 |
-
- **Primary intended users**: Researchers and developers participating in the Frugal AI Challenge
|
21 |
-
- **Out-of-scope use cases**: Not intended for production use or real-world classification tasks
|
22 |
-
|
23 |
-
## Training Data
|
24 |
-
|
25 |
-
The model uses the QuotaClimat/frugalaichallenge-text-train dataset:
|
26 |
-
- Size: ~6000 examples
|
27 |
-
- Split: 80% train, 20% test
|
28 |
-
- 8 categories of climate disinformation claims
|
29 |
-
|
30 |
-
### Labels
|
31 |
-
0. No relevant claim detected
|
32 |
-
1. Global warming is not happening
|
33 |
-
2. Not caused by humans
|
34 |
-
3. Not bad or beneficial
|
35 |
-
4. Solutions harmful/unnecessary
|
36 |
-
5. Science is unreliable
|
37 |
-
6. Proponents are biased
|
38 |
-
7. Fossil fuels are needed
|
39 |
-
|
40 |
-
## Performance
|
41 |
-
|
42 |
-
### Metrics
|
43 |
-
- **Accuracy**: ~12.5% (random chance with 8 classes)
|
44 |
-
- **Environmental Impact**:
|
45 |
-
- Emissions tracked in gCO2eq
|
46 |
-
- Energy consumption tracked in Wh
|
47 |
-
|
48 |
-
### Model Architecture
|
49 |
-
The model implements a random choice between the 8 possible labels, serving as the simplest possible baseline.
|
50 |
-
|
51 |
-
## Environmental Impact
|
52 |
-
|
53 |
-
Environmental impact is tracked using CodeCarbon, measuring:
|
54 |
-
- Carbon emissions during inference
|
55 |
-
- Energy consumption during inference
|
56 |
-
|
57 |
-
This tracking helps establish a baseline for the environmental impact of model deployment and inference.
|
58 |
-
|
59 |
-
## Limitations
|
60 |
-
- Makes completely random predictions
|
61 |
-
- No learning or pattern recognition
|
62 |
-
- No consideration of input text
|
63 |
-
- Serves only as a baseline reference
|
64 |
-
- Not suitable for any real-world applications
|
65 |
-
|
66 |
-
## Ethical Considerations
|
67 |
-
|
68 |
-
- Dataset contains sensitive topics related to climate disinformation
|
69 |
-
- Model makes random predictions and should not be used for actual classification
|
70 |
-
- Environmental impact is tracked to promote awareness of AI's carbon footprint
|
71 |
-
```
|
|
|
1 |
+
# Classification Model for Climate Disinformation Classification
|
2 |
+
|
3 |
+
## Global Informations
|
4 |
+
|
5 |
+
The aim of this model is to detect illegal deforestation thanks to audio clips. Our objective is to make this AI system the most frugal possible.
|
6 |
+
When you're new to AI for audio processing and you're looking for information on the Internet, the following methodology is often described: transform the audio signal into a spectrogram (2D image) and then have it analyzed by a CNN. It can be necessary when you're working on very precise task such as transcription, but it's too heavy for our task, which is simply to detect chainsaw noises. So, for our baseline, we used an mfcc transform to preprocess the data with, then we transformed the 2D output un a 1D by taking the mean for each feature and finally we applied a basic ml classification algorithm like Random Forest.
|
7 |
+
Then, we tried to optimize the 2 different stages: the data preprocessing (to make it simple, quick and to create the smallest possible preprocessed dataset) and the ml model itself. At this point, we noticed than the preprocessing of our data (with the mfcc transformation) was consuming 12 times more than our model training and inference so we mostly worked on optimizing the data preprocessing part. Here are our main ideas :
|
8 |
+
1. The data preprocessing:
|
9 |
+
* Compare different methods of audio feature extraction
|
10 |
+
* Decrease the size of the data to analyse (i.e. by decreasing the sampling rate, take only a small sample of the initial audio of 3s, since we don’t really need several seconds to identify a sound, remove unnecessary data (as with plots we can see that the characteristic sounds of chainsraw are observed between 200Hz & 1000Hz))
|
11 |
+
* Decrease the number of features extracted
|
12 |
+
|
13 |
+
2. The ml model:
|
14 |
+
* Use the most lightweighted model : Avoid neural networks and compare basic ml classification algorithms for example knn is 3 times less energy consuming than the Random Forest with a non-significant loss of accuracy
|
15 |
+
|
16 |
+
## Submitted models
|
17 |
+
Model 1:
|
18 |
+
* Data preprocessing: Resampling to 6000Hz and using librosa mfcc method to calculate 7 MFCC
|
19 |
+
* Energy consumption for the processing of the training dataset : 0.005349 kWh
|
20 |
+
* Model: KNN
|
21 |
+
* Energy consumption for the training of the model : 0.000002 kWh
|
22 |
+
|
23 |
+
Model 2:
|
24 |
+
* Data preprocessing: Resampling to 6000Hz and using librosa mfcc method to calculate 10 MFCC
|
25 |
+
* Energy consumption for the processing of the training dataset : 0.005648 kWh
|
26 |
+
* Model: Random Forest
|
27 |
+
* Energy consumption for the training of the model : 0.000326 kWh
|
28 |
+
|
29 |
+
The 2nd model has better accuracy but is less energy efficient.
|
30 |
+
|
31 |
+
## Other avenues for optimization
|
32 |
+
|
33 |
+
We learned a lot about all the possible optimizations but unfortunately, we didn't keep all of them for the final submission, as they led to an excessive loss of precision (~85% accuracy), they are described below :
|
34 |
+
* Use a less complex method of audio feature extraction (example: spectral centroid is 5 times faster than mfcc)
|
35 |
+
* Focus our analysis on the frequencies that really matter for chainsaws (from 0 to 20,000Hz)
|
36 |
+
* Use less data (shortest audio), i.e. keep randomly only 0.2s of audio for each audio of the dataset before extracting features with mfcc and then also take only a small sample of each audio of the test dataset to predict the label
|
37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
models/audio_classification_baseline.pkl
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:7a27a9a671a920660995bc08b255e17449427f018402ceec81710a0ae93cb612
|
3 |
-
size 36073945
|
|
|
|
|
|
|
|
tasks/audio.py
CHANGED
@@ -15,7 +15,7 @@ load_dotenv()
|
|
15 |
|
16 |
router = APIRouter()
|
17 |
|
18 |
-
DESCRIPTION = "Knn audio classification"
|
19 |
ROUTE = "/audio"
|
20 |
|
21 |
|
@@ -42,9 +42,8 @@ async def evaluate_audio(request: AudioEvaluationRequest):
|
|
42 |
dataset = load_dataset(request.dataset_name, token=os.getenv("HF_TOKEN"))
|
43 |
|
44 |
# Split dataset
|
45 |
-
train_test = dataset["train"]
|
46 |
-
|
47 |
-
test_dataset = train_test["test"]
|
48 |
|
49 |
# Start tracking emissions
|
50 |
tracker.start()
|
|
|
15 |
|
16 |
router = APIRouter()
|
17 |
|
18 |
+
DESCRIPTION = "Model 1 : Knn audio classification"
|
19 |
ROUTE = "/audio"
|
20 |
|
21 |
|
|
|
42 |
dataset = load_dataset(request.dataset_name, token=os.getenv("HF_TOKEN"))
|
43 |
|
44 |
# Split dataset
|
45 |
+
train_test = dataset["train"]
|
46 |
+
test_dataset = dataset["test"]
|
|
|
47 |
|
48 |
# Start tracking emissions
|
49 |
tracker.start()
|