Update app.py
Browse files
app.py
CHANGED
@@ -171,18 +171,18 @@ with tab1:
|
|
171 |
# st.markdown('<div class="title">Leaderboard</div>', unsafe_allow_html=True)
|
172 |
st.markdown('<div class="tab-content">', unsafe_allow_html=True)
|
173 |
|
174 |
-
st.markdown('Metrics Explanation')
|
175 |
-
st.markdown(
|
176 |
<div class="metric">
|
177 |
<br/>
|
178 |
<p style="font-size:16px;">
|
179 |
-
<strong> Factual Precision </strong> measures the
|
180 |
</p>
|
181 |
<p style="font-size:16px;">
|
182 |
π for closed LLMs; π for open-weights LLMs; π¨ for newly added models"
|
183 |
</p>
|
184 |
</div>
|
185 |
-
|
186 |
unsafe_allow_html=True
|
187 |
)
|
188 |
|
|
|
171 |
# st.markdown('<div class="title">Leaderboard</div>', unsafe_allow_html=True)
|
172 |
st.markdown('<div class="tab-content">', unsafe_allow_html=True)
|
173 |
|
174 |
+
st.markdown('# Metrics Explanation')
|
175 |
+
st.markdown("""
|
176 |
<div class="metric">
|
177 |
<br/>
|
178 |
<p style="font-size:16px;">
|
179 |
+
<strong> Factual Precision </strong> measures the ratio of supported units divided by all units averaged over model responses. <strong> Hallucination Score </strong> quantifies the incorrect or inconclusive contents within a model response, as described in the paper. We also provide statistics on the average number of units labelled as unsupported (<strong>Avg. # Unsupported</strong>), the average number of units labelled as undecidable (<strong>Avg. # Undecided</strong>), the average length of the response in terms of the number of tokens, and the average verifiable units existing in the model responses (<strong>Avg. # Units</strong>).
|
180 |
</p>
|
181 |
<p style="font-size:16px;">
|
182 |
π for closed LLMs; π for open-weights LLMs; π¨ for newly added models"
|
183 |
</p>
|
184 |
</div>
|
185 |
+
""",
|
186 |
unsafe_allow_html=True
|
187 |
)
|
188 |
|