Back to Work

Agentic AI for Service Request Automation

Ongoing work as a Principal Data Scientist at Fidelity Investments.

Shyam Subramanian

·

Image for work project: Agentic AI for Service Request Automation

  • Built an agentic AI workflow to automate manual service request classification, description generation, and routing
  • Improved service request quality and routing accuracy while reducing manual effort and request resolution time
  • Designed to handle broad & ambiguous request types, noisy & incomplete human inputs, context-dependent business rules & exceptions
  • Used specialized agent design patterns and multi-agent orchestration while balancing performance and cost
  • Developed using LangGraph; Deployed and monitored through LangSmith

Contact Centers across Fidelity create hundreds of thousands of service requests annually on behalf of customers. Even small inefficiencies in classification, creation, and routing compound into significant resolution delays and operational cost. When representatives manually create them in-between calls, the process is inconsistent and error-prone, and cuts directly into their handle time metrics.

  • I led this project as Principal Data Scientist, contributing hands-on across subject matter expertise collection, agent development, and validation while driving the team's technical direction.
  • I also collaborated closely with the engineering team on operationalization including containerized deployment and production monitoring, and with business teams on a phased rollout staged across business groups with varying workflows.
  • Beyond this project, I helped establish the internal playbook for agentic AI: how to identify the right use cases, build reliable agents, and scale them across business groups.

The foundation of the system was historical service request data with original Contact Center Rep selected types, subtypes, and descriptions. However the challenge is that these are frequently incorrect. We used downstream explicit rejection flags, longer resolution time indicating implicit low-quality, and other sophisticated multi-LLM agreement approaches to isolate bad labels. However, these methods did not completely resolve the issue because these signals are either not available for all requests or not highly reliable. While both good and bad labels helped with designing the agent and the prompts, we also had subject matter experts both qualitatively and quantitatively validate the agent and iteratively collected labeled data based on the agent's outputs.

We started with a pragmatic baseline: (1) an LLM classifier over call notes and a lightweight intake questionnaire, (2) structured description templates prefilled by LLM based on the predicted type and subtype, and (3) deterministic routing to the right downstream team. It worked well for straightforward unambiguous cases, but noisy inputs, ambiguous request boundaries, and subtle errors together revealed the limits of an LLM-only workflow.

Rather than prompt engineering the LLMs more, we redesigned the workflow around the structure of the problem itself. To handle missing or noisy inputs, we added clarification steps that interactively probed contact center representatives for additional details. To resolve ambiguity between similar request types and subtypes, we introduced looking up similar historical requests, including both successful and rejected examples, that grounds classification decisions in precedent. Reflection and feedback loops helped catch and correct subtle errors before handoff. This evolved into a multi-agent architecture where a high-level orchestrator coordinated between specialized sub-agents for classification, description generation, and routing, with explicit stages for clarification, validation, and reflection throughout.

On the operationalization side, the system was deployed as a containerized service and monitored in production through LangSmith, which provided step-level tracing and alerting to give the team visibility into agent behavior at each stage of the pipeline. User feedback was collected through ratings, thumbs up/down signals, bug reports, and surveys providing a lightweight but meaningful quality signal that helped surface edge cases and inform ongoing improvements.

Routing accuracy

+10%

Rework rate

-6%

Creation time

-40%

Prompt design directly affects agent autonomy and prediction stability: Small changes in how instructions were framed could make an agent more or less exploratory in unpredictable ways. To stabilize critical classification decisions, we experimented with making parts of the pipeline more deterministic where it mattered most.

Context engineering is as important as model selection: How context was constructed and passed between steps had an outsized effect on quality. We experimented extensively with summarization, key information extraction, and retrieval to find the right level of context for each stage.

Model size is a design decision, not a default: Larger models are not always better. They are slower and more expensive. We learned to match model capability to task complexity, using smaller models for well-defined steps and larger ones where reasoning over ambiguous inputs was required.

Know when to explore and when to move on: Deciding when the agent should probe representatives for more information versus proceed to the next stage was a non-trivial design problem. Too much exploration hurt handle time; too little hurt quality. Finding that balance required careful experimentation.

Agentic evaluation requires strategic thinking: We evaluated at three levels: step level to validate individual agent actions, trajectory level to assess the overall reasoning path, and response level to measure final output quality. Each revealed failure modes the others missed.

  • Close the feedback loop into a data flywheel to continuously retrain and improve the system.
  • Fetch information from real-time call transcription and tools/APIs that hosts customer information.
  • Extend intelligent routing deeper into downstream handoffs to reduce misrouting.
  • Use downstream resolution notes to assist representatives in real-time during calls.

LangGraph

Used for stateful agent orchestration, tool routing, and control-flow between sub-agents.

LangSmith

Used for deploying, tracing, and evaluation of agent behavior.

OpenSearch

Used as retrieval infrastructure for looking up similar historical requests.

LLMs

Used OpenAI, Anthropic, LLaMA models as the reasoning and generation backbone.

More Work

Shyam Subramanian

© All rights reserved. 2026.

CONTENT

Work

Blog

Built with Reflex