<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>llm-evaluation</title>
    <link rel="self" type="application/atom+xml" href="https://links.biapy.com/guest/tags/1116/feed"/>
    <updated>2026-06-28T02:32:33+00:00</updated>
    <id>https://links.biapy.com/guest/tags/1116/feed</id>
            <entry>
            <id>https://links.biapy.com/links/1288</id>
            <title type="text"><![CDATA[Factorio Learning Environment]]></title>
            <link rel="alternate" href="https://jackhopkins.github.io/factorio-learning-environment/" />
            <link rel="via" type="application/atom+xml" href="https://links.biapy.com/links/1288"/>
            <author>
                <name><![CDATA[Biapy]]></name>
            </author>
            <summary type="text">
                <![CDATA[A non-saturating, open-ended environment for evaluating LLMs in Factorio.

 Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations. We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, program synthesis, and resource optimization. 

- [Factorio Learning Environment @ GitHub](https://github.com/JackHopkins/factorio-learning-environment).]]>
            </summary>
            <updated>2025-08-28T19:31:06+00:00</updated>
        </entry>
            <entry>
            <id>https://links.biapy.com/links/1371</id>
            <title type="text"><![CDATA[⚔️ Chatbot Arena]]></title>
            <link rel="alternate" href="https://lmarena.ai/" />
            <link rel="via" type="application/atom+xml" href="https://links.biapy.com/links/1371"/>
            <author>
                <name><![CDATA[Biapy]]></name>
            </author>
            <summary type="text">
                <![CDATA[⚔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare &amp;amp; Test Best AI Chatbots.

Related contents:

- [Evals are not all you need @ Andrew Marble](https://www.marble.onl/posts/evals_are_not_all_you_need.html).]]>
            </summary>
            <updated>2025-08-28T19:45:08+00:00</updated>
        </entry>
            <entry>
            <id>https://links.biapy.com/links/1505</id>
            <title type="text"><![CDATA[DeepEval]]></title>
            <link rel="alternate" href="https://docs.confident-ai.com/" />
            <link rel="via" type="application/atom+xml" href="https://links.biapy.com/links/1505"/>
            <author>
                <name><![CDATA[Biapy]]></name>
            </author>
            <summary type="text">
                <![CDATA[The Open-Source LLM Evaluation Framework.

DeepEval is a simple-to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation.

- [DeepEval @ GitHub](https://github.com/confident-ai/deepeval).]]>
            </summary>
            <updated>2025-08-28T20:07:27+00:00</updated>
        </entry>
    </feed>
