llm-evaluation

Here are 46 public repositories matching this topic...

langfuse / langfuse

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated Apr 29, 2024
TypeScript

Giskard-AI / giskard

Sponsor

Star

🐢 Open-Source Evaluation & Testing framework for LLMs and ML models

Updated Apr 29, 2024
Python

Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Apr 30, 2024
TypeScript

confident-ai / deepeval

Star

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Apr 30, 2024
Python

Agenta-AI / agenta

Star

The all-in-one LLM developer platform: prompt management, evaluation, human feedback, and deployment all in one place.

prompt-toolkit rag human-annotation large-language-models llm prompt-engineering llms langchain llmops llama-index prompt-management llm-tools llm-framework llm-evaluation rag-evaluation

Updated Apr 29, 2024
Python

relari-ai / continuous-eval

Star

Open-Source Evaluation for GenAI Application Pipelines

information-retrieval evaluation-metrics evaluation-framework rag llmops retrieval-augmented-generation llm-evaluation

Updated Apr 25, 2024
Python

onejune2018 / Awesome-LLM-Eval

Star

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, learderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向大型语言模型评测（例如ChatGPT、LLaMA、GLM、Baichuan等）.

nlp benchmark machine-learning leaderboard evaluation dataset openai llama bert rag awsome-list gpt3 llm awsome-lists chatgpt large-language-model chatglm qwen llm-evaluation

Updated Apr 26, 2024

henry-yeh / Awesome-LLM-in-Social-Science

Star

Awesome papers involving LLMs in Social Science.

social-network simulation-environment policy economics psychology alignment social-science large-language-models llms llm-agent llm-evaluation

Updated Apr 28, 2024

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Apr 19, 2024
Python

villagecomputing / superpipe

Star

Superpipe - optimized LLM pipelines for structured data

classification data-extraction structured-data data-labeling llm llm-evaluation llm-optimization

Updated Apr 24, 2024
Python

allenai / CommonGen-Eval

Star

Evaluating LLMs with CommonGen-Lite

evaluation text-generation llm chatgpt gpt-evaluation llama2 llm-evaluation

Updated Mar 21, 2024
Python

raga-ai-hub / raga-llm-hub

Star

Framework for LLM evaluation, guardrails and security

guardrails llmops llm-security llm-evaluation

Updated Mar 10, 2024
Python

Re-Align / just-eval

Star

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

evaluation gpt4 llm llm-eval llm-evaluation llm-evaluation-toolkit

Updated Jan 29, 2024
Python

PetroIvaniuk / llms-tools

Star

A list of LLMs Tools & Projects

data-science machine-learning ai chatbots chat-bot llm chatgpt open-source-llm llm-evaluation

Updated Apr 21, 2024

rungalileo / hallucination-index

Star

Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.

openai rag hallucinations large-language-models llm retrieval-augmented-generation llm-evaluation

Updated Nov 15, 2023

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Apr 30, 2024
Python

ChanLiang / CONNER

Star

The implementation for EMNLP 2023 paper ”Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators“

llama factuality hallucinations large-language-models nlg-evaluation chatgpt llm-evaluation emnlp2023

Updated Jan 22, 2024
Python

AntonioGr7 / pratical-llms

Star

A collection of hand on notebook for LLMs practitioner

quantization llm llm-serving genai llm-training llm-inference llm-evaluation

Updated Apr 22, 2024
Jupyter Notebook

LLM-Evaluation-s-Always-Fatiguing / leaf-playground

Star

A framework to build scenario simulation projects where human and LLM based agents can participant in, with a user-friendly web UI to visualize simulation, support automatically evaluation on agent action level.

agent automation evaluations agents agent-based-simulation chatgpt llm-evaluation

Updated Feb 22, 2024
Python

intuit-ai-research / DCR-consistency

Star

DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

consistency summarization blackbox divide-and-conquer-approach hallucinations large-language-models llm llm-evaluation

Updated Jan 11, 2024
Python

Improve this page

Add a description, image, and links to the llm-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-evaluation

Here are 46 public repositories matching this topic...

langfuse / langfuse

Giskard-AI / giskard

promptfoo / promptfoo

confident-ai / deepeval

Agenta-AI / agenta

relari-ai / continuous-eval

onejune2018 / Awesome-LLM-Eval

henry-yeh / Awesome-LLM-in-Social-Science

athina-ai / athina-evals

villagecomputing / superpipe

allenai / CommonGen-Eval

raga-ai-hub / raga-llm-hub

Re-Align / just-eval

PetroIvaniuk / llms-tools

rungalileo / hallucination-index

parea-ai / parea-sdk-py

ChanLiang / CONNER

AntonioGr7 / pratical-llms

LLM-Evaluation-s-Always-Fatiguing / leaf-playground

intuit-ai-research / DCR-consistency

Improve this page

Add this topic to your repo