Preprint • Research Artifact Bundle

Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents

A practical workflow for evaluating tool-using agents with scenario design, expected behavior, failure modes, and operational scorecards that teams can actually use during rollout.

Read the PDF Cite on Zenodo View Release

Abstract

Tool-using AI agents are increasingly used in coding, browser automation, research assistance, and support workflows. In practice, however, many teams still evaluate these systems through isolated prompts, one-off demos, or broad benchmark references that do not translate well into deployment judgment.

This paper presents a lightweight workflow for evaluating agent behavior that begins with scenario design, continues through explicit expected behavior and failure-mode definition, and ends with an operational scorecard that helps teams judge rollout readiness.

Publication Links

Zenodo DOI
Zenodo record
SSRN abstract page
MetaArXiv submission
ORCID profile

Workflow Figure

Connected Artifacts

Kaggle dataset: Agent Eval Scenarios
Kaggle notebook
Hugging Face dataset mirror
Hugging Face Space
OpenRouter app page
Modal endpoint

Repository Contents

Source manuscript
Bibliography
Citation metadata
Submission metadata
Artifact manifest
Related public links