', unsafe_allow_html=True) st.markdown('# Metrics Explanation') st.markdown("""

Factual Precision measures the ratio of supported units divided by all units averaged over model responses. Hallucination Score quantifies the incorrect or inconclusive contents within a model response, as described in the paper. We also provide statistics on the average number of units labelled as unsupported (Avg. # Unsupported), the average number of units labelled as undecidable (Avg. # Undecided), the average length of the response in terms of the number of tokens, and the average verifiable units existing in the model responses (Avg. # Units).

🔒 for closed LLMs; 🔑 for open-weights LLMs; 🚨 for newly added models"

""", unsafe_allow_html=True ) st.markdown('@Farima populate here') st.markdown(""" """, unsafe_allow_html=True) # Dropdown menu to filter tiers tiers = ['All Tiers', 'Tier 1: Hard', 'Tier 2: Moderate', 'Tier 3: Easy'] selected_tier = st.selectbox('Select Tier:', tiers) # Filter the data based on the selected tier if selected_tier != 'All Tiers': filtered_df = df[df['tier'] == selected_tier] else: filtered_df = df sort_by_factuality = st.checkbox('Sort by Factuality Score') # Sort the dataframe based on Factuality Score if the checkbox is selected if sort_by_factuality: updated_filtered_df = filtered_df.sort_values( by=['tier', 'factuality_score'], ascending=[True, False] ) else: updated_filtered_df = filtered_df.sort_values( by=['tier', 'original_order'] ) # Create HTML for the table if selected_tier == 'All Tiers': html = ''' ''' else: html = '''

Tier	Rank	Model	Factual Precision	Hallucination Score	Avg. # Tokens	Avg. # Units	Avg. # Undecidable	Avg. # Unsupported

''' # Generate the rows of the table current_tier = None for i, row in updated_filtered_df.iterrows(): html += '' # Only display the 'Tier' column if 'All Tiers' is selected if selected_tier == 'All Tiers': if row['tier'] != current_tier: current_tier = row['tier'] html += f'' # Fill in model and scores html += f''' ''' # Close the table html += '''

Rank	Model	Factual Precision	Hallucination Score	Avg. # Tokens	Avg. # Units	Avg. # Undecidable	Avg. # Unsupported
{current_tier}	{row['rank']}	{row['model']}	{row['factuality_score']}	{row['hallucination_score']}	{row['avg_tokens']}	{row['avg_factual_units']}	{row['avg_undecidable_units']:.2f}	{row['avg_unsupported_units']:.2f}

''' # Display the table st.markdown(html, unsafe_allow_html=True) st.markdown('

', unsafe_allow_html=True) st.markdown('

Benchmark Details

', unsafe_allow_html=True) st.image(image, use_column_width=True) st.markdown('### VERIFY: A Pipeline for Factuality Evaluation') st.write( "Language models (LMs) are widely used by an increasing number of users, " "underscoring the challenge of maintaining factual accuracy across a broad range of topics. " "We present VERIFY (Verification and Evidence Retrieval for Factuality evaluation), " "a pipeline to evaluate LMs' factual accuracy in real-world user interactions." ) st.markdown('### Content Categorization') st.write( "VERIFY considers the verifiability of LM-generated content and categorizes content units as " "`supported`, `unsupported`, or `undecidable` based on the retrieved web evidence. " "Importantly, VERIFY's factuality judgments correlate better with human evaluations than existing methods." ) st.markdown('### Hallucination Prompts & FactBench Dataset') st.write( "Using VERIFY, we identify 'hallucination prompts' across diverse topics—those eliciting the highest rates of " "incorrect or unverifiable LM responses. These prompts form FactBench, a dataset of 985 prompts across 213 " "fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and is " "regularly updated with new prompts." ) st.markdown('