jbilcke-hf HF Staff commited on
Commit
a9df757
·
1 Parent(s): 347756a

some ai notes

Browse files
Files changed (1) hide show
  1. NOTES.md +53 -0
NOTES.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Video Model Training Notes
2
+
3
+ ## Training Step Analysis
4
+
5
+ ### What happens in a training step?
6
+
7
+ A training step processes **exactly `batch_size` samples** (not the entire dataset). Here's what happens:
8
+
9
+ **Per Training Step:**
10
+ - Processes `batch_size` videos/samples (configurable, typically 1-8)
11
+ - Uses smart batching that groups videos by resolution dimensions
12
+ - Two data streams: text embeddings + video latents
13
+
14
+ **Key Points:**
15
+ - With 100 videos and batch_size=4: each step processes 4 videos
16
+ - Training runs for a fixed number of steps (not epochs)
17
+ - Dataset loops infinitely, so videos are reused across steps
18
+ - Uses ResolutionSampler to batch videos of similar dimensions together
19
+
20
+ **Training Loop Structure:**
21
+ 1. Load next `batch_size` samples from dataset
22
+ 2. Group by resolution (spatial + temporal dimensions)
23
+ 3. Forward pass through transformer (denoising)
24
+ 4. Calculate loss and update weights
25
+ 5. Increment step counter
26
+
27
+ So if you have 100 videos and batch_size=1, step 1 processes video 1, step 2 processes video 2, etc. When it reaches video 100, it loops back to video 1.
28
+
29
+ ## Avoiding Overfitting
30
+
31
+ For video model training, a good rule of thumb is to keep each video seen **less than 10-50 times** during training to avoid overfitting.
32
+
33
+ **Common thresholds:**
34
+ - **Conservative**: <10 times per video (strong generalization)
35
+ - **Moderate**: 10-50 times per video (balanced)
36
+ - **Risky**: >100 times per video (likely overfitting)
37
+
38
+ **With low learning rates (e.g., 0.00004):**
39
+ - Lower LR means you can potentially see videos more times safely
40
+ - But still better to err on the side of caution
41
+
42
+ **Practical calculation:**
43
+ - If training for 10,000 steps with batch_size=1:
44
+ - 100 videos = 100 times each (risky)
45
+ - 500 videos = 20 times each (moderate)
46
+ - 1,000+ videos = <10 times each (conservative)
47
+
48
+ **Early stopping indicators:**
49
+ - Training loss continues decreasing but validation loss plateaus/increases
50
+ - Generated videos start looking too similar to training examples
51
+ - Loss of diversity in outputs
52
+
53
+ With low learning rates, staying under 20-30 times per video should be relatively safe, but <10 times is ideal for strong generalization.