Automate AI Agent Testing with Managed Evaluations and OTel Traces

AWS · New Product Launch · 2026-03-31 · notable

Briefing for: Engineering

What happened

Amazon Bedrock AgentCore Evaluations is now generally available, providing a managed service for measuring agent performance across session, trace, and tool-call levels. It supports OpenTelemetry (OTEL) semantic conventions, allowing for standardized instrumentation across agents built with LangGraph or Strands Agents.

Why it matters

This replaces manual, non-deterministic testing with systematic measurement using LLM-as-a-Judge, ground truth, or custom Lambda-based evaluators. You can now integrate quality gating directly into CI/CD pipelines to prevent regressions before they reach production.

What this enables

If you struggle with inconsistent agent behavior, use the 13 built-in evaluators to pinpoint whether failures occur in tool selection, parameter extraction, or response generation.
If you need deterministic validation for data-heavy agents, deploy custom code-based evaluators using AWS Lambda to verify exact strings like transaction IDs or prices.
If you run high-scale production agents, enable online evaluation to sample a percentage of live traffic and store quality scores directly in CloudWatch logs.

Get personalized AI briefings for your role at Changecast →