CoralLeiCN
commited on
Commit
·
c1e5656
1
Parent(s):
dfd6d8a
Update README.md: enhance experiments section with detailed scoring and tool descriptions
Browse files
README.md
CHANGED
@@ -48,18 +48,22 @@ uv run gradio app.py
|
|
48 |
```
|
49 |
|
50 |
# experiments
|
|
|
51 |
| Features | Model | Score (out of 20) | Notes |
|
52 |
| --- | --- | --- | --- |
|
53 |
-
|
|
54 |
-
| + Structured Output
|
55 |
-
| +
|
56 |
-
| + Agentic LLM with Smolagent & Add “Understand video” tool using gemini-flash. | gemini-flash 2.0 |
|
57 |
-
| +
|
58 |
-
| + Change
|
59 |
-
| + VisitWebpageTool | gemini-flash 2.5 | 10-12 |
|
60 |
| + DownloadFile & ReadExcelFileBytes | gemini-flash 2.5 | 13 | |
|
61 |
-
|
|
62 |
-
|
|
|
|
|
|
|
63 |
|
64 |
# Developer
|
65 |
## Install pre-commit
|
|
|
48 |
```
|
49 |
|
50 |
# experiments
|
51 |
+
The score is max score after a few runs
|
52 |
| Features | Model | Score (out of 20) | Notes |
|
53 |
| --- | --- | --- | --- |
|
54 |
+
| Prompted LLM | gemini-flash 2.0 | 1 | Format could be wrong |
|
55 |
+
| + Structured Output | gemini-flash 2.0 | 3 | |
|
56 |
+
| + System prompt from GAIA, slightly modified | gemini-flash 2.0 | 3 | |
|
57 |
+
| + Agentic LLM with Smolagent & Add “Understand video” tool using gemini-flash. | gemini-flash 2.0 | 4 | Results varied due to `Codeagent` output formatting issues. |
|
58 |
+
| + WebSearchTool | gemini-flash 2.0 | 7 | Greatly improved, still has formatting issues |
|
59 |
+
| + Change from flash 2.0 to flash 2.5 | gemini-flash 2.5 | 7 | |
|
60 |
+
| + VisitWebpageTool | gemini-flash 2.5 | 10-12 | Cannot see image, mp3, excel file. |
|
61 |
| + DownloadFile & ReadExcelFileBytes | gemini-flash 2.5 | 13 | |
|
62 |
+
| + TranscribeAudioBytes | gemini-flash 2.5 | 14 | |
|
63 |
+
| + CodeExecutionTool | gemini-flash 2.5 | 14 | |
|
64 |
+
| + WikipediaSearchTool (customised using wikipedia + markdownify) | gemini-flash 2.5 | 15 | Tested both WikipediaRetriever and WikipediaLoader from LangChain, but their performance was limited — those omit table data, which can include key information. |
|
65 |
+
| + Increase thinking budget on flash2.5 for performance & Edited system prompt | gemini-flash 2.5, temperature = 0 for consistency. | 15-17 | Stability is the key challenge. Can achieve 17 with GAIA score function, but this assignment is using exact match, so formatter needs improvement. |
|
66 |
+
| + chess best move tool & Change Image understanding model to gemini-pro 2.5 | gemini-flash 2.5 & gemini-pro 2.5 | 18 | conversion of digital chessboard images to FEN strings is unstable. |
|
67 |
|
68 |
# Developer
|
69 |
## Install pre-commit
|