RawBench - LLM Prompt Evaluation Framework Open Source

GitHub
Documentation

RawBench is a powerful, minimal framework for Large Language Model prompt evaluation with YAML-first configuration, tool execution support, and comprehensive result tracking. Built for developers who need systematic, reproducible LLM prompt evaluation with minimal setup complexity.

Key features include multi-model testing with simultaneous comparisons, tool call mocking with recursive support, dynamic variable injection for flexible prompt customization, and an interactive React dashboard for result visualization. The framework supports comprehensive metrics tracking including latency, token usage, and performance analytics.

YAML-based configuration for declarative setup
CLI-native workflow with Python API support
Zero complex setup with extensible tool mocking
Real-time dashboard with side-by-side model comparisons
MIT licensed with Python 3.8+ support

Perfect for prompt optimization, A/B testing, model performance comparison, and LLM evaluation workflows across different models and configurations.