Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,94 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- homebrewltd/instruction-speech-whispervq-v2
|
4 |
+
language:
|
5 |
+
- en
|
6 |
+
license: apache-2.0
|
7 |
+
tags:
|
8 |
+
- sound language model
|
9 |
+
---
|
10 |
+
## Caution
|
11 |
+
|
12 |
+
This is an intermediate checkpoint.
|
13 |
+
|
14 |
+
## Model Details
|
15 |
+
|
16 |
+
We have developed and released the family [llama3s](https://huggingface.co/collections/homebrew-research/llama3-s-669df2139f0576abc6eb7405). This family is natively understanding audio and text input.
|
17 |
+
|
18 |
+
We continue to supervised finetune our last checkpoint using WhisperVQ as a tokenizer for audio files [homebrewltd/...](...) with 2B tokens from [Instruction Speech WhisperVQ v2](https://huggingface.co/datasets/homebrewltd/instruction-speech-whispervq-v2) dataset.
|
19 |
+
|
20 |
+
**Model developers** Homebrew Research.
|
21 |
+
|
22 |
+
**Input** Text and sound.
|
23 |
+
|
24 |
+
**Output** Text.
|
25 |
+
|
26 |
+
**Model Architecture** Llama-3.
|
27 |
+
|
28 |
+
**Language(s):** English.
|
29 |
+
|
30 |
+
## Intended Use
|
31 |
+
|
32 |
+
**Intended Use Cases** This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.
|
33 |
+
|
34 |
+
**Out-of-scope** The use of llama3-s in any manner that violates applicable laws or regulations is strictly prohibited.
|
35 |
+
|
36 |
+
## How to Get Started with the Model
|
37 |
+
|
38 |
+
First, we need to convert the audio file to sound tokens
|
39 |
+
|
40 |
+
```python
|
41 |
+
|
42 |
+
```
|
43 |
+
|
44 |
+
Then, we can inference the model the same as any other LLM.
|
45 |
+
|
46 |
+
```python
|
47 |
+
|
48 |
+
```
|
49 |
+
|
50 |
+
## Training process
|
51 |
+
**Training Metrics Image**: Below is a snapshot of the training loss curve visualized.
|
52 |
+
|
53 |
+
![training_loss](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/Mo_FGQvhkcHl3y1REf76f.png)
|
54 |
+
|
55 |
+
### Hardware
|
56 |
+
|
57 |
+
**GPU Configuration**: Cluster of 8x NVIDIA H100-SXM-80GB.
|
58 |
+
**GPU Usage**:
|
59 |
+
- **Continual Training**: 6 hours.
|
60 |
+
|
61 |
+
### Training Arguments
|
62 |
+
|
63 |
+
We utilize [torchtune](https://github.com/pytorch/torchtune) library for the latest FSDP2 training code implementation.
|
64 |
+
|
65 |
+
| Parameter | Continual Training |
|
66 |
+
|----------------------------|-------------------------|
|
67 |
+
| **Epoch** | 1 |
|
68 |
+
| **Global batch size** | 128 |
|
69 |
+
| **Learning Rate** | 0.5e-4 |
|
70 |
+
| **Learning Scheduler** | Cosine with warmup |
|
71 |
+
| **Optimizer** | Adam torch fused |
|
72 |
+
| **Warmup Ratio** | 0.01 |
|
73 |
+
| **Weight Decay** | 0.005 |
|
74 |
+
| **Max Sequence Length** | 1024 |
|
75 |
+
|
76 |
+
|
77 |
+
## Citation Information
|
78 |
+
|
79 |
+
**BibTeX:**
|
80 |
+
|
81 |
+
```
|
82 |
+
@article{Llama3-S: Sound Instruction Language Model 2024,
|
83 |
+
title={Llama3-S},
|
84 |
+
author={Homebrew Research},
|
85 |
+
year=2024,
|
86 |
+
month=August},
|
87 |
+
url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-15}
|
88 |
+
```
|
89 |
+
|
90 |
+
## Acknowledgement
|
91 |
+
|
92 |
+
- **[WhisperSpeech](https://github.com/collabora/WhisperSpeech)**
|
93 |
+
|
94 |
+
- **[Meta-Llama-3.1-8B-Instruct ](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)**
|