johko commited on
Commit
bc2330d
·
1 Parent(s): dc33f2f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -0
README.md CHANGED
@@ -1,3 +1,22 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ # CapDec - NoiseLevel: 0.015
6
+
7
+ This is are model weights originally provided by the authors of the paper [Text-Only Training for Image Captioning using Noise-Injected CLIP](https://arxiv.org/pdf/2211.00575.pdf).
8
+
9
+ There method aims to train fine-tune CLIP with only text samples. Therefore they are injecting zero-mean Gaussian Noise with a standard-deviation(STD) of into the text embeddings before decoding.
10
+
11
+ In their words:
12
+ *Specifically, we assume that the visual embedding corresponding to a text embedding
13
+ lies somewhere within a ball of small radius around the text embedding (see Fig. 1).
14
+ We would like all text embeddings in this ball to decode to the same caption,which should
15
+ also correspond to the visual content mapped to this ball. We implement this intuition by
16
+ adding zero-mean Gaussian noise of STD to the text embedding before decoding it.*
17
+
18
+ The "Noise Level" of 0.015 is equivalent to the Noise Variance which is the square of the STD.
19
+
20
+ The reported metrics are results of a model with a Noise Variance of 0.016, which the authors unfortunately do not provide in their repository.
21
+ This model with a Noise Variance 0.015 is the closest available pre-trained model to their best model.
22
+