Jellyfish042 commited on
Commit
55fafcc
·
1 Parent(s): cb9e32d

2025-04 30B+

Browse files
app.py CHANGED
@@ -64,43 +64,9 @@ model_size_to_file_name = {
64
  "Other": "other",
65
  }
66
 
67
- about_md = """
68
- # Uncheatable Eval
69
-
70
- GitHub page: [https://github.com/Jellyfish042/uncheatable_eval](https://github.com/Jellyfish042/uncheatable_eval)
71
-
72
- ## Introduction
73
- Traditional LLM benchmarks are easily compromised by unintentional or intentional data leakage, making many benchmarks unreliable and unable to truly reflect the capabilities of LLMs.
74
-
75
- Uncheatable Eval addresses this issue by testing LLMs on real-time, newly generated data from the internet,
76
- ensuring that the evaluation is immune to data leaks and cannot be gamed.
77
-
78
- ## How?
79
- Uncheatable Eval assesses the language modeling capabilities of LLMs on new data from various sources such as recent papers on arXiv, new projects on GitHub, news articles, and more. Since this data is brand new (e.g., from the past 1-2 weeks), it is impossible for these data to be included in the training sets of publicly released models, thus avoiding the impact of unintentional or intentional data leaks.
80
-
81
- Specifically, we calculate the sum of negative log probabilities of the models on these texts. In other words, models that are more likely to generate these texts are considered better.
82
-
83
- *Note* : Uncheatable Eval only tests base models.
84
-
85
- ## Q&A
86
- ### Why Calculate the Sum of Negative Log Probabilities?
87
- First, the goal of language models, at least today's language models, is to generate text that is as realistic as possible, maximizing the probability of real text. They are trained and designed to do exactly this. Calculating the sum of negative log probabilities on real text is the most direct way to test this capability.
88
-
89
- Second, from the perspective of "compression is intelligence," a good way to test a language model would be to use the model with an entropy coding algorithm for compression and test the model's compression rate [[1]](https://arxiv.org/abs/2309.10668)[[2]](https://arxiv.org/abs/2402.00861). A model with a lower compression rate is considered better. Using a language model + arithmetic coding as an example, it is easy to prove that a model's ability to compress a piece of text is proportional to the sum of its negative log probabilities on that text (see [proof](#proof-of-the-equivalence-between-compression-capability-and-negative-log-probability-sum)).
90
- Therefore, the compression rate of a model can be directly calculated through the sum of negative log probabilities, and the method for this has been provided in `show_results_v2.ipynb`.
91
- ### Can Models Using Different Tokenizers Be Directly Compared?
92
- Yes. When calculating the sum of negative log probabilities, we essentially treat the model + tokenizer as a single entity or system. As long as this system has a high probability of generating real text, we consider it better. From the perspective of compression, you can choose any tokenizer. From the compression rate perspective, we don't care; we only care about whether your system can compress the text more effectively.
93
-
94
- ### Is It Really Uncheatable? Can't I train my model on a large number of arXiv papers to improve its test performance on arXiv papers?
95
- Uncheatable Eval's data sources currently include new arXiv papers, new GitHub projects, BBC news, AO3 fanfictions, and new Wikipedia entries, with more sources to be added in the future. If you genuinely achieve excellent results across these data by training extensively on these sources, I would consider you to have developed a genuinely good language model rather than cheating.
96
-
97
- From my test results, accurately modeling these data is very challenging. I believe Uncheatable Eval more accurately reflects the value of every bit of data and computing you invest compared to other benchmarks. Models trained with more data and computing are almost always better, and there are no shortcuts. This is a key strength of Uncheatable Eval.
98
-
99
- ### Is This Too "Random"? Why Consider Random Texts from the Internet as Ground Truth?
100
- This is why we choose rigorous and verified texts such as arXiv papers and news reports, which typically have better quality. Additionally, a round of Uncheatable Eval evaluates a model over millions of tokens, increasing the reliability of the results.
101
-
102
- In fact, the model rankings obtained through Uncheatable Eval are very stable. For instance, the model ranked first in January's data is highly likely to remain first in February, March, April, May, and June, indicating that the data obtained through this method is sufficiently representative.
103
- """
104
 
105
 
106
  def rename_columns(df):
@@ -124,16 +90,6 @@ def get_folders_matching_format(directory):
124
 
125
 
126
  def get_unique_column_names(all_data):
127
- # column_names = {}
128
- #
129
- # for folder_name, files in all_data.items():
130
- # for file_name, sheets in files.items():
131
- # for sheet_name, dataframe in sheets.items():
132
- # for column in dataframe.columns:
133
- # if column not in ['Name', 'Average (The lower the better)', 'Parameters Count (B)']:
134
- # column_names[column] = None
135
- #
136
- # return list(column_names.keys())
137
 
138
  return [
139
  "ao3_\u200benglish",
@@ -143,7 +99,6 @@ def get_unique_column_names(all_data):
143
  "arxiv_\u200bphysics",
144
  "github_\u200bcpp",
145
  "github_\u200bpython",
146
- # "ao3_\u200bchinese",
147
  ]
148
 
149
 
@@ -495,12 +450,10 @@ def create_scaling_plot(all_data, period):
495
  y_values = new_df["Average (The lower the better)"].astype(float).tolist()
496
  names = new_df["Name"].tolist()
497
 
498
- # 计算对数空间的范围
499
  x_min, x_max = np.log10(min(x_values)), np.log10(max(x_values))
500
  y_min, y_max = np.log10(min(y_values)), np.log10(max(y_values))
501
 
502
- # 计算合适的刻度间隔
503
- x_dtick = (x_max - x_min) / 4 # 分成5个刻度
504
  y_dtick = (y_max - y_min) / 4
505
 
506
  fig = go.Figure()
@@ -531,7 +484,7 @@ def create_scaling_plot(all_data, period):
531
  zeroline=False,
532
  type="log",
533
  dtick=x_dtick,
534
- tickformat=".2f", # 保留两位小数
535
  range=[x_min - 0.1, x_max + 0.1],
536
  ),
537
  yaxis=dict(
@@ -540,7 +493,7 @@ def create_scaling_plot(all_data, period):
540
  zeroline=False,
541
  type="log",
542
  dtick=y_dtick,
543
- tickformat=".2f", # 保留两位小数
544
  range=[y_min - 0.1, y_max + 0.1],
545
  autorange="reversed",
546
  ),
@@ -590,9 +543,8 @@ last_period = time_list[-1]
590
  initial_fig = create_scaling_plot(all_data, last_period)
591
  initial_metric = metric_list[0]
592
  initial_columns = get_unique_column_names(all_data)
593
- # initial_columns = initial_columns[:-1]
594
  initial_colors = ["Average", "Individual Tests"]
595
- initial_size_range = [0, 15]
596
  initial_data = update_table(last_period, model_size_list, initial_metric, initial_columns, initial_colors, initial_size_range)
597
 
598
  css = """
@@ -615,7 +567,7 @@ table {
615
  """
616
 
617
  TITLE_HTML = '<h1 style="text-align:center"><span style="font-size:1.3em">🏆 LLM Compression Leaderboard</span></h1>'
618
- SUBTITLE_HTML = "<h1 style='text-align:center'><span style='font-size:0.8em'>Welcome to Uncheatable Eval LLM Compression Leaderboard, where fancy fine-tuning and cheating wont work 🚫; only compute 💻, data 📊, and real innovation 🔥 can prevail!</span></h1>"
619
 
620
  with gr.Blocks(css=css) as demo:
621
  gr.HTML(TITLE_HTML)
@@ -633,12 +585,6 @@ with gr.Blocks(css=css) as demo:
633
  color_selector = gr.CheckboxGroup(label="Colored Columns", choices=["Average", "Individual Tests"], value=initial_colors)
634
  colfilter = gr.CheckboxGroup(label="Data Source", choices=get_unique_column_names(all_data), value=initial_columns)
635
 
636
- # table = gr.Dataframe(
637
- # initial_data,
638
- # column_widths=[130, 50, 50, 35, 35, 35, 35, 35, 35, 35, 35],
639
- # wrap=True,
640
- # max_height=800,
641
- # )
642
  table = gr.HTML(initial_data)
643
 
644
  period_selector.change(
@@ -692,7 +638,7 @@ with gr.Blocks(css=css) as demo:
692
  period_selector_2.change(update_plot, inputs=period_selector_2, outputs=plot)
693
 
694
  with gr.Tab("ℹ️ About"):
695
- gr.Markdown(about_md)
696
 
697
  with gr.Tab("🚀 Submit"):
698
  with gr.Group():
 
64
  "Other": "other",
65
  }
66
 
67
+ def read_about_md():
68
+ with open('about.md', 'r', encoding='utf-8') as f:
69
+ return f.read()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
 
72
  def rename_columns(df):
 
90
 
91
 
92
  def get_unique_column_names(all_data):
 
 
 
 
 
 
 
 
 
 
93
 
94
  return [
95
  "ao3_\u200benglish",
 
99
  "arxiv_\u200bphysics",
100
  "github_\u200bcpp",
101
  "github_\u200bpython",
 
102
  ]
103
 
104
 
 
450
  y_values = new_df["Average (The lower the better)"].astype(float).tolist()
451
  names = new_df["Name"].tolist()
452
 
 
453
  x_min, x_max = np.log10(min(x_values)), np.log10(max(x_values))
454
  y_min, y_max = np.log10(min(y_values)), np.log10(max(y_values))
455
 
456
+ x_dtick = (x_max - x_min) / 4
 
457
  y_dtick = (y_max - y_min) / 4
458
 
459
  fig = go.Figure()
 
484
  zeroline=False,
485
  type="log",
486
  dtick=x_dtick,
487
+ tickformat=".2f",
488
  range=[x_min - 0.1, x_max + 0.1],
489
  ),
490
  yaxis=dict(
 
493
  zeroline=False,
494
  type="log",
495
  dtick=y_dtick,
496
+ tickformat=".2f",
497
  range=[y_min - 0.1, y_max + 0.1],
498
  autorange="reversed",
499
  ),
 
543
  initial_fig = create_scaling_plot(all_data, last_period)
544
  initial_metric = metric_list[0]
545
  initial_columns = get_unique_column_names(all_data)
 
546
  initial_colors = ["Average", "Individual Tests"]
547
+ initial_size_range = [0, 40]
548
  initial_data = update_table(last_period, model_size_list, initial_metric, initial_columns, initial_colors, initial_size_range)
549
 
550
  css = """
 
567
  """
568
 
569
  TITLE_HTML = '<h1 style="text-align:center"><span style="font-size:1.3em">🏆 LLM Compression Leaderboard</span></h1>'
570
+ SUBTITLE_HTML = "<h1 style='text-align:center'><span style='font-size:0.8em'>Welcome to Uncheatable Eval LLM Compression Leaderboard, where fancy fine-tuning and cheating won't work 🚫; only compute 💻, data 📊, and real innovation 🔥 can prevail!</span></h1>"
571
 
572
  with gr.Blocks(css=css) as demo:
573
  gr.HTML(TITLE_HTML)
 
585
  color_selector = gr.CheckboxGroup(label="Colored Columns", choices=["Average", "Individual Tests"], value=initial_colors)
586
  colfilter = gr.CheckboxGroup(label="Data Source", choices=get_unique_column_names(all_data), value=initial_columns)
587
 
 
 
 
 
 
 
588
  table = gr.HTML(initial_data)
589
 
590
  period_selector.change(
 
638
  period_selector_2.change(update_plot, inputs=period_selector_2, outputs=plot)
639
 
640
  with gr.Tab("ℹ️ About"):
641
+ gr.Markdown(read_about_md())
642
 
643
  with gr.Tab("🚀 Submit"):
644
  with gr.Group():
data/2025-04/14b.xlsx ADDED
Binary file (11.7 kB). View file
 
data/2025-04/1b5.xlsx ADDED
Binary file (11.7 kB). View file
 
data/2025-04/3b.xlsx ADDED
Binary file (11.7 kB). View file
 
data/2025-04/7b.xlsx ADDED
Binary file (11.7 kB). View file
 
data/2025-04/9b.xlsx ADDED
Binary file (11.7 kB). View file
 
data/2025-04/other.xlsx ADDED
Binary file (12.8 kB). View file