8000 lechmazur / Starred · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
View lechmazur's full-sized avatar

Block or report lechmazur

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Benchmark that evaluates LLMs using 651 NYT Connections puzzles extended with extra trick words

Python 101 5 Updated Jun 11, 2025
HTML 1 Updated Apr 4, 2025

Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent benchmark that tests cooperative and self-interested strategies among Large Language Models (LLMs) in a resource-sharing econ…

36 2 Updated Apr 10, 2025

A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other

277 9 Updated Jun 10, 2025

LLM public goods game

8 Updated Feb 22, 2025

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a…

54 2 Updated Jun 6, 2025

Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which ite…

59 2 Updated Jun 11, 2025

This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, motivations, etc.) in a short creative story

Batchfile 240 6 Updated Jun 11, 2025

LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each other or to 50 initial random words.

31 1 Updated Mar 20, 2025

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation …

27 2 Updated Mar 20, 2025

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

HTML 175 5 Updated Jun 11, 2025

Estimate the number of legal chess positions

C++ 12 1 Updated Jan 14, 2021
0