Spaces:
Running
Running
Commit
·
a9df757
1
Parent(s):
347756a
some ai notes
Browse files
NOTES.md
ADDED
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Video Model Training Notes
|
2 |
+
|
3 |
+
## Training Step Analysis
|
4 |
+
|
5 |
+
### What happens in a training step?
|
6 |
+
|
7 |
+
A training step processes **exactly `batch_size` samples** (not the entire dataset). Here's what happens:
|
8 |
+
|
9 |
+
**Per Training Step:**
|
10 |
+
- Processes `batch_size` videos/samples (configurable, typically 1-8)
|
11 |
+
- Uses smart batching that groups videos by resolution dimensions
|
12 |
+
- Two data streams: text embeddings + video latents
|
13 |
+
|
14 |
+
**Key Points:**
|
15 |
+
- With 100 videos and batch_size=4: each step processes 4 videos
|
16 |
+
- Training runs for a fixed number of steps (not epochs)
|
17 |
+
- Dataset loops infinitely, so videos are reused across steps
|
18 |
+
- Uses ResolutionSampler to batch videos of similar dimensions together
|
19 |
+
|
20 |
+
**Training Loop Structure:**
|
21 |
+
1. Load next `batch_size` samples from dataset
|
22 |
+
2. Group by resolution (spatial + temporal dimensions)
|
23 |
+
3. Forward pass through transformer (denoising)
|
24 |
+
4. Calculate loss and update weights
|
25 |
+
5. Increment step counter
|
26 |
+
|
27 |
+
So if you have 100 videos and batch_size=1, step 1 processes video 1, step 2 processes video 2, etc. When it reaches video 100, it loops back to video 1.
|
28 |
+
|
29 |
+
## Avoiding Overfitting
|
30 |
+
|
31 |
+
For video model training, a good rule of thumb is to keep each video seen **less than 10-50 times** during training to avoid overfitting.
|
32 |
+
|
33 |
+
**Common thresholds:**
|
34 |
+
- **Conservative**: <10 times per video (strong generalization)
|
35 |
+
- **Moderate**: 10-50 times per video (balanced)
|
36 |
+
- **Risky**: >100 times per video (likely overfitting)
|
37 |
+
|
38 |
+
**With low learning rates (e.g., 0.00004):**
|
39 |
+
- Lower LR means you can potentially see videos more times safely
|
40 |
+
- But still better to err on the side of caution
|
41 |
+
|
42 |
+
**Practical calculation:**
|
43 |
+
- If training for 10,000 steps with batch_size=1:
|
44 |
+
- 100 videos = 100 times each (risky)
|
45 |
+
- 500 videos = 20 times each (moderate)
|
46 |
+
- 1,000+ videos = <10 times each (conservative)
|
47 |
+
|
48 |
+
**Early stopping indicators:**
|
49 |
+
- Training loss continues decreasing but validation loss plateaus/increases
|
50 |
+
- Generated videos start looking too similar to training examples
|
51 |
+
- Loss of diversity in outputs
|
52 |
+
|
53 |
+
With low learning rates, staying under 20-30 times per video should be relatively safe, but <10 times is ideal for strong generalization.
|