CoralLeiCN commited on
Commit
c1e5656
·
1 Parent(s): dfd6d8a

Update README.md: enhance experiments section with detailed scoring and tool descriptions

Browse files
Files changed (1) hide show
  1. README.md +13 -9
README.md CHANGED
@@ -48,18 +48,22 @@ uv run gradio app.py
48
  ```
49
 
50
  # experiments
 
51
  | Features | Model | Score (out of 20) | Notes |
52
  | --- | --- | --- | --- |
53
- | Directly call LLM with questions | gemini-flash 2.0 | 1 | direct call |
54
- | + Structured Output: Response Format | gemini-flash 2.0 | 3 | |
55
- | + system prompt from GAIA, slightly modified | gemini-flash 2.0 | 3 | |
56
- | + Agentic LLM with Smolagent & Add “Understand video” tool using gemini-flash. | gemini-flash 2.0 | 1-4 | Results varied due to `Codeagent` output formatting issues. |
57
- | + WebSearch tool | gemini-flash 2.0 | 7 | Greatly improved, still has formatting issues |
58
- | + Change model | gemini-flash 2.5 | 7 | |
59
- | + VisitWebpageTool | gemini-flash 2.5 | 10-12 | cannot see image, mp3, excel file. |
60
  | + DownloadFile & ReadExcelFileBytes | gemini-flash 2.5 | 13 | |
61
- | | | | |
62
-
 
 
 
63
 
64
  # Developer
65
  ## Install pre-commit
 
48
  ```
49
 
50
  # experiments
51
+ The score is max score after a few runs
52
  | Features | Model | Score (out of 20) | Notes |
53
  | --- | --- | --- | --- |
54
+ | Prompted LLM | gemini-flash 2.0 | 1 | Format could be wrong |
55
+ | + Structured Output | gemini-flash 2.0 | 3 | |
56
+ | + System prompt from GAIA, slightly modified | gemini-flash 2.0 | 3 | |
57
+ | + Agentic LLM with Smolagent & Add “Understand video” tool using gemini-flash. | gemini-flash 2.0 | 4 | Results varied due to `Codeagent` output formatting issues. |
58
+ | + WebSearchTool | gemini-flash 2.0 | 7 | Greatly improved, still has formatting issues |
59
+ | + Change from flash 2.0 to flash 2.5 | gemini-flash 2.5 | 7 | |
60
+ | + VisitWebpageTool | gemini-flash 2.5 | 10-12 | Cannot see image, mp3, excel file. |
61
  | + DownloadFile & ReadExcelFileBytes | gemini-flash 2.5 | 13 | |
62
+ | + TranscribeAudioBytes | gemini-flash 2.5 | 14 | |
63
+ | + CodeExecutionTool | gemini-flash 2.5 | 14 | |
64
+ | + WikipediaSearchTool (customised using wikipedia + markdownify) | gemini-flash 2.5 | 15 | Tested both WikipediaRetriever and WikipediaLoader from LangChain, but their performance was limited — those omit table data, which can include key information. |
65
+ | + Increase thinking budget on flash2.5 for performance & Edited system prompt | gemini-flash 2.5, temperature = 0 for consistency. | 15-17 | Stability is the key challenge. Can achieve 17 with GAIA score function, but this assignment is using exact match, so formatter needs improvement. |
66
+ | + chess best move tool & Change Image understanding model to gemini-pro 2.5 | gemini-flash 2.5 & gemini-pro 2.5 | 18 | conversion of digital chessboard images to FEN strings is unstable. |
67
 
68
  # Developer
69
  ## Install pre-commit