Conversational Intelligence for Contact Center Assistants

Work done during my time as Principal Data Scientist at Fidelity Investments.

Shyam Subramanian

Image for work project: Conversational Intelligence for Contact Center Assistants

Curated a high-quality dataset from ~27M calls through a multi-step stratified sampling pipeline across call, rep, and customer metadata attributes
Designed curriculum-style continuous pre-training and fine-tuning of a foundation model on financial knowledge, internal policies, and call data
Built a multi-turn Conversational RAG pipeline with script-based dialog management constructed from real call data
Evaluated through a 10-week expert evaluation study with 20 contact center representatives, achieving 4.6/5 human rating and 87% acceptance rate
Delivered two foundational enterprise assets: a Fidelity-specific domain-adapted LLM and a multi-turn conversational framework

Problem

Contact centers across Fidelity handle millions and millions of calls annually, but the institutional knowledge embedded in those calls is largely untapped. Large language models are powerful but context-blind out of the box as they lack company-specific knowledge, customer history, and the operational expertise that experienced representatives build over years. Customer call interactions capture exactly this knowledge in raw form: how representatives navigate internal tools and systems, how they handle complex compliance scenarios, and how company policies play out in real customer interactions. That institutional knowledge lives in the calls but is never systematically extracted or used to power AI systems.

My Role

Established the data strategy for the project by designing the feature-engineering, sampling, filtering, and complexity stratification pipeline across ~27M calls to curate a high-quality call dataset
Owned the training design for continuous pre-training and fine-tuning of foundation models on financial knowledge, curated call dataset, and internal policies & FAQs.
Designed and led the multi-turn Conversational RAG pipeline including script-based dialog management, powered by the custom tuned foundation models.
Contributed to broader team explorations in audio-LLM research, benchmarking speech-to-text and text-to-speech models, and end-to-end audio-enabled conversational agent framework.

Approach

The first step was establishing the data foundation. Curating a high-quality dataset from ~27M calls required a multi-step stratified sampling pipeline balancing across a rich set of call, rep, and customer metadata attributes joining data across disparate internal tables and aggregating across interactions within each call. Statistical and LLM-generated features were engineered from transcripts to assess call quality and complexity, filtering out low-quality calls and enabling a curriculum-style training progression.

With the dataset in place, we adapted a foundation model through continuous pre-training and fine-tuning, drawing on research into effective continual pre-training strategies. Balancing new domain abilities against catastrophic forgetting, we exposed the model to financial regulatory documents, internal policies, and the curated call corpus in a structured progression, followed by task-specific fine-tuning on internal FAQ and Q&A datasets to ground the model in company-specific knowledge.

Alongside model training, we explored retrieval-augmented generation as a complementary approach to surfacing company knowledge at inference. While effective for single-turn factual lookups, real contact center conversations are inherently multi-turn and context-dependent. Integrating the domain-trained model into the RAG pipeline improved response quality, but navigating the varied and branching nature of real conversations remained a challenge. Flat scripts extracted directly from transcripts failed to capture diverse conversational flows, leading us to construct hierarchical conversational trees from the call corpus instead. This mapped how conversations branch across topics and subtopics and we derived scripts from those trees. The final system combined the domain-trained model, the RAG pipeline, and script-based dialog management grounded in the conversational trees. We also conducted a pilot testing phase

Beyond the core system, I also contributed in a limited capacity to the team's broader exploration of audio-enabled conversational AI, testing speech-to-text models including Whisper and NVIDIA Parakeet, text-to-speech systems such as MeloTTS and Fish Speech, audio-native language models including Voxtral and Qwen-Audio, and experimenting with LiveKit as an end-to-end audio agent framework.

Impact

The system was evaluated through a 10-week expert evaluation study with 20 contact center representatives. Evaluation cases were stratified across call types and topics to ensure broad coverage. For each case, representatives were shown a real call transcript alongside the system's retrieved results and generated response evaluating up to 3 turns per call. They were asked to reorder retrieved results by relevance, rate the overall response quality, and provide reasons when responses could not be used as-is. This multi-dimensional feedback design gave us signal on both retrieval quality and response acceptability simultaneously, grounded in real call contexts rather than synthetic scenarios.

Beyond the pilot results, this project delivered two foundational assets for the enterprise - a first-of-its-kind Fidelity-specific domain-adapted LLM, and a multi-turn conversational framework with potential to enhance RAG-style systems across the company.

Response rating

4.6/5

Acceptance rate

87%

Retrieval MRR

0.70

Learnings

Quality over quantity. At ~27M calls, the knowledge most useful for grounding an AI system is concentrated in a high-quality subset. Volume alone is not a proxy for value. Careful curation is what makes the data useful.

Engineered features outperform raw embeddings. Using raw transcript embeddings is computationally expensive and impractical at this scale. Moreover, similarity in embeddings of summaries does not capture similarity in conversational flow since two calls can look alike on the surface but follow entirely different paths. Specific, interpretable features are more efficient and more meaningful.

Curriculum-style training improves domain adaptation. Exposing the model to progressively more complex calls rather than training on the full dataset uniformly, produced more stable and effective domain adaptation, with better retention of general capabilities alongside new domain knowledge.

Flat scripts cannot capture conversational complexity. Extracting scripts directly from transcripts at the topic and subtopic level failed to represent the varied branching paths within real conversations. Hierarchical conversational trees were necessary to faithfully capture how conversations actually evolve and deriving scripts from trees produced better dialog management.

Two foundational enterprise assets. Beyond the pilot results, we realized the project has org wide impact with a domain-adapted LLM with Fidelity-specific knowledge that could serve as a stronger base model for future AI initiatives, and a multi-turn conversational framework with potential to enhance RAG-style systems across the company.

What I Would Do Next

Evolve the system from a standalone conversational chatbot into a real-time call assistant that surfaces suggestions and retrieves information alongside representatives during live calls.
Integrate dynamic customer context through AI agents and function calling, enabling the system to pull real-time customer data.
Benchmark the conversational RAG pipeline against intent-based rule dialog systems that are currently the industry standard for contact center assistants.