Ben Burtenshaw commited on
Commit
3379fc5
ยท
1 Parent(s): 7503ca9
pages/3_๐ŸŒฑ Generate Dataset.py CHANGED
@@ -30,22 +30,28 @@ hub_token = st.session_state.get("hub_token")
30
 
31
  st.divider()
32
 
33
- st.markdown("## ๐Ÿงฐ Pipeline Configuration")
34
 
35
- st.write(
36
- "Now we need to define the configuration for the pipeline that will generate the synthetic data."
37
- )
38
- st.write(
39
- "โš ๏ธ Model and parameter choices significantly affect the quality of the generated data. \
40
- We reccomend that you start with generating a few samples and review the data. Then scale up from there. \
41
- You can run the pipeline multiple times with different configurations and append it to the same Argilla dataset."
42
  )
43
 
 
 
 
44
 
45
  st.markdown("#### ๐Ÿค– Inference configuration")
46
 
47
  st.write(
48
- "Add the url of the Huggingface inference API or endpoint that your pipeline should use. You can find compatible models here:"
 
 
 
49
  )
50
 
51
  with st.expander("๐Ÿค— Recommended Models"):
@@ -85,27 +91,57 @@ domain_expert_base_url = st.text_input(
85
  value="https://api-inference.huggingface.co/models/microsoft/Phi-3-mini-4k-instruct",
86
  )
87
 
 
 
 
 
88
  st.divider()
89
  st.markdown("#### ๐Ÿงฎ Parameters configuration")
90
 
 
 
 
 
 
 
 
 
 
 
 
91
  self_intruct_num_generations = st.slider(
92
  "Number of generations for self-instruction", 1, 10, 2
93
  )
94
  domain_expert_num_generations = st.slider(
95
  "Number of generations for domain expert response", 1, 10, 2
96
  )
 
 
 
 
 
 
 
97
  self_instruct_temperature = st.slider("Temperature for self-instruction", 0.1, 1.0, 0.9)
98
  domain_expert_temperature = st.slider("Temperature for domain expert", 0.1, 1.0, 0.9)
99
 
 
 
 
 
100
  st.divider()
101
  st.markdown("#### ๐Ÿ”ฌ Argilla API details to push the generated dataset")
 
 
 
 
102
  argilla_url = st.text_input("Argilla API URL", ARGILLA_URL)
103
  argilla_api_key = st.text_input("Argilla API Key", "owner.apikey")
104
  argilla_dataset_name = st.text_input("Argilla Dataset Name", project_name)
105
  st.divider()
106
 
107
  ###############################################################
108
- # LOCAL
109
  ###############################################################
110
 
111
  st.markdown("## Run the pipeline")
@@ -154,37 +190,40 @@ if all(
154
  )
155
 
156
  st.markdown(
157
- "To run the pipeline locally, you need to have the `distilabel` library installed. You can install it using the following command:"
 
158
  )
159
 
160
  st.code(
161
- f"""
162
-
163
  # Install the distilabel library
164
  pip install distilabel
165
- """
 
166
  )
167
 
168
- st.markdown("Next, you'll need to clone your dataset repo and run the pipeline:")
169
 
170
  st.code(
171
- f"""
172
  git clone https://github.com/huggingface/data-is-better-together
173
  cd data-is-better-together/domain-specific-datasets/pipelines
174
  pip install -r requirements.txt
175
- """
 
 
176
  )
177
 
178
  st.markdown("Finally, you can run the pipeline using the following command:")
179
 
180
  st.code(
181
  f"""
182
- huggingface-cli login
183
  python domain_expert_pipeline.py {hub_username}/{project_name}""",
184
  language="bash",
185
  )
186
  st.markdown(
187
- "๐Ÿ‘ฉโ€๐Ÿš€ If you want to customise the pipeline take a look in `pipeline.py` and teh [distilabel docs](https://distilabel.argilla.io/)"
 
188
  )
189
 
190
  st.markdown(
 
30
 
31
  st.divider()
32
 
33
+ st.markdown("## ๐Ÿงฐ Data Generation Pipeline")
34
 
35
+ st.markdown(
36
+ """
37
+ Now we need to define the configuration for the pipeline that will generate the synthetic data.
38
+ The pipeline will generate synthetic data by combining self-instruction and domain expert responses.
39
+ The self-instruction step generates instructions based on seed terms, and the domain expert step generates \
40
+ responses to those instructions. Take a look at the [distilabel docs](https://distilabel.argilla.io/latest/sections/learn/tasks/text_generation/#self-instruct) for more information.
41
+ """
42
  )
43
 
44
+ ###############################################################
45
+ # INFERENCE
46
+ ###############################################################
47
 
48
  st.markdown("#### ๐Ÿค– Inference configuration")
49
 
50
  st.write(
51
+ """Add the url of the Huggingface inference API or endpoint that your pipeline should use to generate instruction and response pairs. \
52
+ Some domain tasks may be challenging for smaller models, so you may need to iterate over your task definition and model selection. \
53
+ This is a part of the process of generating high-quality synthetic data, human feedback is key to this process. \
54
+ You can find compatible models here:"""
55
  )
56
 
57
  with st.expander("๐Ÿค— Recommended Models"):
 
91
  value="https://api-inference.huggingface.co/models/microsoft/Phi-3-mini-4k-instruct",
92
  )
93
 
94
+ ###############################################################
95
+ # PARAMETERS
96
+ ###############################################################
97
+
98
  st.divider()
99
  st.markdown("#### ๐Ÿงฎ Parameters configuration")
100
 
101
+ st.write(
102
+ "โš ๏ธ Model and parameter choices significantly affect the quality of the generated data. \
103
+ We reccomend that you start with generating a few samples and review the data. Then scale up from there. \
104
+ You can run the pipeline multiple times with different configurations and append it to the same Argilla dataset."
105
+ )
106
+
107
+ st.markdown(
108
+ "Number of generations are the samples that each model will generate for each seed term, \
109
+ so if you have 10 seed terms, 2 instruction generations, and 2 response generations, you will have 40 samples in total."
110
+ )
111
+
112
  self_intruct_num_generations = st.slider(
113
  "Number of generations for self-instruction", 1, 10, 2
114
  )
115
  domain_expert_num_generations = st.slider(
116
  "Number of generations for domain expert response", 1, 10, 2
117
  )
118
+
119
+ st.markdown(
120
+ "Temperature is a hyperparameter that controls the randomness of the generated text. \
121
+ Lower temperatures will generate more deterministic text, while higher temperatures \
122
+ will add more variation to generations."
123
+ )
124
+
125
  self_instruct_temperature = st.slider("Temperature for self-instruction", 0.1, 1.0, 0.9)
126
  domain_expert_temperature = st.slider("Temperature for domain expert", 0.1, 1.0, 0.9)
127
 
128
+ ###############################################################
129
+ # ARGILLA API
130
+ ###############################################################
131
+
132
  st.divider()
133
  st.markdown("#### ๐Ÿ”ฌ Argilla API details to push the generated dataset")
134
+ st.markdown(
135
+ "Here you can define the Argilla API details to push the generated dataset to your Argilla space. \
136
+ These are the defaults that were set up for the project. You can change them if needed."
137
+ )
138
  argilla_url = st.text_input("Argilla API URL", ARGILLA_URL)
139
  argilla_api_key = st.text_input("Argilla API Key", "owner.apikey")
140
  argilla_dataset_name = st.text_input("Argilla Dataset Name", project_name)
141
  st.divider()
142
 
143
  ###############################################################
144
+ # Pipeline Run
145
  ###############################################################
146
 
147
  st.markdown("## Run the pipeline")
 
190
  )
191
 
192
  st.markdown(
193
+ "To run the pipeline locally, you need to have the `distilabel` library installed. \
194
+ You can install it using the following command:"
195
  )
196
 
197
  st.code(
198
+ body="""
 
199
  # Install the distilabel library
200
  pip install distilabel
201
+ """,
202
+ language="bash",
203
  )
204
 
205
+ st.markdown("Next, you'll need to clone the pipeline code and install dependencies:")
206
 
207
  st.code(
208
+ """
209
  git clone https://github.com/huggingface/data-is-better-together
210
  cd data-is-better-together/domain-specific-datasets/pipelines
211
  pip install -r requirements.txt
212
+ huggingface-cli login
213
+ """,
214
+ language="bash",
215
  )
216
 
217
  st.markdown("Finally, you can run the pipeline using the following command:")
218
 
219
  st.code(
220
  f"""
 
221
  python domain_expert_pipeline.py {hub_username}/{project_name}""",
222
  language="bash",
223
  )
224
  st.markdown(
225
+ "๐Ÿ‘ฉโ€๐Ÿš€ If you want to customise the pipeline take a look in `domain_expert_pipeline.py` \
226
+ and the [distilabel docs](https://distilabel.argilla.io/)"
227
  )
228
 
229
  st.markdown(