RawBench - LLM Prompt Evaluation Framework | Open Source
RawBench is a powerful, minimal framework for Large Language Model prompt evaluation with YAML-first configuration, tool execution support, and comprehensive result tracking. Built for developers who need systematic, reproducible LLM prompt evaluation with minimal setup complexity.
Key features include multi-model testing with simultaneous comparisons, tool call mocking with recursive support, dynamic variable injection for flexible prompt customization, and an interactive React dashboard for result visualization. The framework supports comprehensive metrics tracking including latency, token usage, and performance analytics.
- YAML-based configuration for declarative setup
- CLI-native workflow with Python API support
- Zero complex setup with extensible tool mocking
- Real-time dashboard with side-by-side model comparisons
- MIT licensed with Python 3.8+ support
Perfect for prompt optimization, A/B testing, model performance comparison, and LLM evaluation workflows across different models and configurations.