Search: [llm-evaluation] - Biapy Web Directory

Factorio Learning Environment https://jackhopkins.github.io/factorio-learning-environment/

Sun Mar 16 16:43:29 2025

email

A non-saturating, open-ended environment for evaluating LLMs in Factorio.

Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations. We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, program synthesis, and resource optimization.

Factorio Learning Environment @ GitHub.

Chatbot Arena https://lmarena.ai/

Tue Mar 4 14:03:00 2025

email

Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots.

Related contents:

Evals are not all you need @ Andrew Marble.

DeepEval https://docs.confident-ai.com/

Fri Feb 21 13:45:40 2025

email

The Open-Source LLM Evaluation Framework.

DeepEval is a simple-to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation.

DeepEval @ GitHub.

Links per page

Filters