by patronus-ai
Provides a standardized interface for running powerful LLM system optimizations, evaluations, and experiments via the Model Context Protocol.
Patronus MCP Server delivers a ready‑to‑run API layer that lets developers initialize a Patronus project, execute single or batch evaluations, run dataset‑driven experiments, and query available evaluator information. It abstracts away the underlying LLM calls and evaluation logic, exposing a clean RPC‑style tool set.
PATRONUS_API_KEY
environment variable.initialize
, evaluate
, batch_evaluate
, run_experiment
, list_evaluator_info
, create_criteria
, custom_evaluate
) via the MCP client, passing a serialized Request
model.tests/test_live.py
script, which offers menu‑driven execution of single‑ and batch‑evaluation scenarios.@evaluator
.--api-key
or PATRONUS_API_KEY
.@evaluator
and reference it through CustomEvaluatorConfig
.run_experiment
call returns a status field and a placeholder string; results become available after the asynchronous processing completes.pyproject.toml
configuration; using a recent Python 3.9+ version is recommended.An MCP server implementation for the Patronus SDK, providing a standardized interface for running powerful LLM system optimizations, evaluations, and experiments.
git clone https://github.com/yourusername/patronus-mcp-server.git
cd patronus-mcp-server
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e .
uv pip install -e ".[dev]"
The server can be run with an API key provided in two ways:
python src/patronus_mcp/server.py --api-key your_api_key_here
export PATRONUS_API_KEY=your_api_key_here
python src/patronus_mcp/server.py
The test script (tests/test_live.py
) provides an interactive way to test different evaluation endpoints. You can run it in several ways:
python -m tests.test_live src/patronus_mcp/server.py --api-key your_api_key_here
export PATRONUS_API_KEY=your_api_key_here
python -m tests.test_live src/patronus_mcp/server.py
python -m tests.test_live src/patronus_mcp/server.py
The test script provides three test options:
Each test will display the results in a nicely formatted JSON output.
from patronus_mcp.server import mcp, Request, InitRequest
request = Request(data=InitRequest(
project_name="MyProject",
api_key="your-api-key",
app="my-app"
))
response = await mcp.call_tool("initialize", {"request": request.model_dump()})
from patronus_mcp.server import Request, EvaluationRequest, RemoteEvaluatorConfig
request = Request(data=EvaluationRequest(
evaluator=RemoteEvaluatorConfig(
name="lynx",
criteria="patronus:hallucination",
explain_strategy="always"
),
task_input="What is the capital of France?",
task_output="Paris is the capital of France."
task_context=["The capital of France is Paris."],
))
response = await mcp.call_tool("evaluate", {"request": request.model_dump()})
from patronus_mcp.server import Request, BatchEvaluationRequest, RemoteEvaluatorConfig
request = Request(data=BatchEvaluationRequest(
evaluators=[
AsyncRemoteEvaluatorConfig(
name="lynx",
criteria="patronus:hallucination",
explain_strategy="always"
),
AsyncRemoteEvaluatorConfig(
name="judge",
criteria="patronus:is-concise",
explain_strategy="always"
)
],
task_input="What is the capital of France?",
task_output="Paris is the capital of France."
task_context=["The capital of France is Paris."],
))
response = await mcp.call_tool("batch_evaluate", {"request": request.model_dump()})
from patronus_mcp import Request, ExperimentRequest, RemoteEvaluatorConfig, CustomEvaluatorConfig
# Create a custom evaluator function
@evaluator()
def exact_match(expected: str, actual: str, case_sensitive: bool = False) -> bool:
if not case_sensitive:
return expected.lower() == actual.lower()
return expected == actual
# Create a custom adapter class
class ExactMatchAdapter(FuncEvaluatorAdapter):
def __init__(self, case_sensitive: bool = False):
super().__init__(exact_match)
self.case_sensitive = case_sensitive
def transform(self, row, task_result, parent, **kwargs):
args = []
evaluator_kwargs = {
"expected": row.gold_answer,
"actual": task_result.output if task_result else "",
"case_sensitive": self.case_sensitive
}
return args, evaluator_kwargs
# Create experiment request
request = Request(data=ExperimentRequest(
project_name="my_project",
experiment_name="my_experiment",
dataset=[{
"input": "What is 2+2?",
"output": "4",
"gold_answer": "4"
}],
evaluators=[
# Remote evaluator
RemoteEvaluatorConfig(
name="judge",
criteria="patronus:is-concise"
),
# Custom evaluator
CustomEvaluatorConfig(
adapter_class="my_module.ExactMatchAdapter",
adapter_kwargs={"case_sensitive": False}
)
]
))
# Run the experiment
response = await mcp.call_tool("run_experiment", {"request": request.model_dump()})
response_data = json.loads(response[0].text)
# The experiment runs asynchronously, so results will be pending initially
assert response_data["status"] == "success"
assert "results" in response_data
assert isinstance(response_data["results"], str) # Results will be a string (pending)
Get a comprehensive view of all available evaluators and their associated criteria:
# No request body needed
response = await mcp.call_tool("list_evaluator_info", {})
# Response structure:
{
"status": "success",
"result": {
"evaluator_family_name": {
"evaluator": {
# evaluator configuration and metadata
},
"criteria": [
# list of available criteria for this evaluator
]
}
}
}
This endpoint combines information about evaluators and their associated criteria into a single, organized response. The results are grouped by evaluator family, with each family containing its evaluator configuration and a list of available criteria.
Creates a new evaluator criteria in the Patronus API.
{
"request": {
"data": {
"name": "my-criteria",
"evaluator_family": "Judge",
"config": {
"pass_criteria": "The MODEL_OUTPUT should contain all the details needed from RETRIEVED CONTEXT to answer USER INPUT.",
"active_learning_enabled": false,
"active_learning_negative_samples": null,
"active_learning_positive_samples": null
}
}
}
}
Parameters:
name
(str): Unique name for the criteriaevaluator_family
(str): Family of the evaluator (e.g., "Judge", "Answer Relevance")config
(dict): Configuration for the criteria
pass_criteria
(str): The criteria that must be met for a passactive_learning_enabled
(bool, optional): Whether active learning is enabledactive_learning_negative_samples
(int, optional): Number of negative samples for active learningactive_learning_positive_samples
(int, optional): Number of positive samples for active learningReturns:
{
"status": "success",
"result": {
"name": "my-criteria",
"evaluator_family": "Judge",
"config": {
"pass_criteria": "The MODEL_OUTPUT should contain all the details needed from RETRIEVED CONTEXT to answer USER INPUT.",
"active_learning_enabled": False,
"active_learning_negative_samples": null,
"active_learning_positive_samples": null
}
}
}
Evaluates a task output using a custom evaluator function decorated with @evaluator
.
{
"request": {
"data": {
"task_input": "What is the capital of France?",
"task_context": ["The capital of France is Paris."],
"task_output": "Paris is the capital of France.",
"evaluator_function": "is_concise",
"evaluator_args": {
"threshold": 0.7
}
}
}
}
Parameters:
task_input
(str): The input prompttask_context
(List[str], optional): Context information for the evaluationtask_output
(str): The output to evaluateevaluator_function
(str): Name of the evaluator function to use (must be decorated with @evaluator
)evaluator_args
(Dict[str, Any], optional): Additional arguments for the evaluator functionThe evaluator function can return:
bool
: Simple pass/fail resultint
or float
: Numeric score (pass threshold is 0.7)str
: Text outputEvaluationResult
: Full evaluation result with score, pass status, explanation, etc.Returns:
{
"status": "success",
"result": {
"score": 0.8,
"pass_": true,
"text_output": "Good match",
"explanation": "Output matches context well",
"metadata": {
"context_length": 1
},
"tags": ["high_score"]
}
}
Example evaluator function:
from patronus import evaluator, EvaluationResult
@evaluator
def is_concise(output: str) -> bool:
"""Simple evaluator that checks if the output is concise"""
return len(output.split()) < 10
@evaluator
def has_score(output: str, context: List[str]) -> EvaluationResult:
"""Evaluator that returns a score based on context"""
return EvaluationResult(
score=0.8,
pass_=True,
text_output="Good match",
explanation="Output matches context well",
metadata={"context_length": len(context)},
tags=["high_score"]
)
patronus-mcp-server/
├── src/
│ └── patronus_mcp/
│ ├── __init__.py
│ └── server.py
├── tests/
│ └── test_server.py
└── test_live.py
├── pyproject.toml
└── README.md
Define new request models in server.py
:
class NewFeatureRequest(BaseModel):
# Define your request fields here
field1: str
field2: Optional[int] = None
Implement new tool functions with the @mcp.tool()
decorator:
@mcp.tool()
def new_feature(request: Request[NewFeatureRequest]):
# Implement your feature logic here
return {"status": "success", "result": ...}
Add corresponding tests:
test_server.py
:
def test_new_feature():
request = Request(data=NewFeatureRequest(
field1="test",
field2=123
))
response = mcp.call_tool("new_feature", {"request": request.model_dump()})
assert response["status"] == "success"
test_live.py
:
async def test_new_feature(self):
request = Request(data=NewFeatureRequest(
field1="test",
field2=123
))
result = await self.session.call_tool("new_feature", {"request": request.model_dump()})
await self._handle_response(result, "New feature test")
main()
Update the README with:
The test script uses the Model Context Protocol (MCP) client to communicate with the server. It supports:
You can also run the standard test suite:
pytest tests/
python -m src.patronus_mcp.server
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Please log in to share your review and rating for this MCP.
Discover more MCP servers with similar functionality and use cases
by danny-avila
Provides a customizable ChatGPT‑like web UI that integrates dozens of AI models, agents, code execution, image generation, web search, speech capabilities, and secure multi‑user authentication, all open‑source and ready for self‑hosting.
by ahujasid
BlenderMCP integrates Blender with Claude AI via the Model Context Protocol (MCP), enabling AI-driven 3D scene creation, modeling, and manipulation. This project allows users to control Blender directly through natural language prompts, streamlining the 3D design workflow.
by pydantic
Enables building production‑grade generative AI applications using Pydantic validation, offering a FastAPI‑like developer experience.
by GLips
Figma-Context-MCP is a Model Context Protocol (MCP) server that provides Figma layout information to AI coding agents. It bridges design and development by enabling AI tools to directly access and interpret Figma design data for more accurate and efficient code generation.
by mcp-use
Easily create and interact with MCP servers using custom agents, supporting any LLM with tool calling and offering multi‑server, sandboxed, and streaming capabilities.
by sonnylazuardi
This project implements a Model Context Protocol (MCP) integration between Cursor AI and Figma, allowing Cursor to communicate with Figma for reading designs and modifying them programmatically.
by lharries
WhatsApp MCP Server is a Model Context Protocol (MCP) server for WhatsApp that allows users to search, read, and send WhatsApp messages (including media) through AI models like Claude. It connects directly to your personal WhatsApp account via the WhatsApp web multi-device API and stores messages locally in a SQLite database.
by idosal
GitMCP is a free, open-source remote Model Context Protocol (MCP) server that transforms any GitHub project into a documentation hub, enabling AI tools to access up-to-date documentation and code directly from the source to eliminate "code hallucinations."
by Klavis-AI
Klavis AI provides open-source Multi-platform Control Protocol (MCP) integrations and a hosted API for AI applications. It simplifies connecting AI to various third-party services by managing secure MCP servers and authentication.