AI & ML interests

retrieval augmented generation, grounded generation, large language models, LLMs, question answering, chatbot

Recent Activity

ofermendย  updated a Space 4 days ago
vectara/cfpb-assistant
ofermendย  updated a Space 4 days ago
vectara/ev-assistant
ofermendย  updated a Space 4 days ago
vectara/finance-assistant
View all activity

vectara's activity

ofermendย 
posted an update 5 months ago
ofermendย 
posted an update 8 months ago
view post
Post
1754
If you are a debate fan or did this as an extracurricular activity as a kid, you might have fun with this demo - debate bot. Debate against AI/RAG:

vectara/debate-bot
ยท
nthakurย 
posted an update 8 months ago
view post
Post
3289
๐Ÿฆข The SWIM-IR dataset contains 29 million text-retrieval training pairs across 27 diverse languages. It is one of the largest synthetic multilingual datasets generated using PaLM 2 on Wikipedia! ๐Ÿ”ฅ๐Ÿ”ฅ

SWIM-IR dataset contains three subsets :
- Cross-lingual:nthakur/swim-ir-cross-lingual
- Monolingual: nthakur/swim-ir-monolingual
- Indic Cross-lingual: nthakur/indic-swim-ir-cross-lingual

Check it out:
https://huggingface.co/collections/nthakur/swim-ir-dataset-662ddaecfc20896bf14dd9b7
clefourrierย 
posted an update 8 months ago
view post
Post
5448
In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences ๐Ÿฉธ

It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.

This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.

openlifescienceai/open_medical_llm_leaderboard

Congrats to @aaditya and @pminervini !
Learn more in the blog: https://huggingface.co/blog/leaderboard-medicalllm
clefourrierย 
posted an update 8 months ago
view post
Post
4429
Contamination free code evaluations with LiveCodeBench! ๐Ÿ–ฅ๏ธ

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date ๐Ÿ“…

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! ๐Ÿš€

Check it out!

Blog: https://huggingface.co/blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!
clefourrierย 
posted an update 8 months ago
view post
Post
2209
๐Ÿ†• Evaluate your RL agents - who's best at Atari?๐Ÿ†

The new RL leaderboard evaluates agents in 87 possible environments (from Atari ๐ŸŽฎ to motion control simulations๐Ÿšถand more)!

When you submit your model, it's run and evaluated in real time - and the leaderboard displays small videos of the best model's run, which is super fun to watch! โœจ

Kudos to @qgallouedec for creating and maintaining the leaderboard!
Let's find out which agent is the best at games! ๐Ÿš€

open-rl-leaderboard/leaderboard
clefourrierย 
posted an update 9 months ago
view post
Post
2216
Fun fact about evaluation, part 2!

How much do scores change depending on prompt format choice?

Using different prompts (all present in the literature, from Prompt question? to Question: prompt question?\nChoices: enumeration of all choices\nAnswer: ), we get a score range of...

10 points for a single model!
Keep in mind that we only changed the prompt, not the evaluation subsets, etc.
Again, this confirms that evaluation results reported without their details are basically bullshit.

Prompt format on the x axis, all these evals look at the logprob of either "choice A/choice B..." or "A/B...".

Incidentally, it also changes model rankings - so a "best" model might only be best on one type of prompt...