Document Extraction for 401-K Client Onboarding

Work done during my time as Senior Data Scientist at Fidelity Investments.

Shyam Subramanian

Image for work project: Document Extraction for 401-K Client Onboarding

Built and deployed an end-to-end multi-modal document extraction pipeline achieving ~85% accuracy across 80+ fields from complex, varied 401-K business documents
Engineered a custom document parsing pipeline combining OCR, layout detection, and configurable heuristics to produce a hierarchical document representation
Developed a progressive extraction strategy ranging from lightweight NLP techniques for simple fields to fine-tuned generative AI LLMs for complex multi-section reasoning
Replaced a fully manual extraction process, saving ~6 FTE annually and enabling the business to scale client onboarding

Problem

Onboarding a 401-K client is a months-long process where manually extracting information from complex business documents into internal systems is a core bottleneck. This manual process is time-consuming, error-prone, and dependent on capacity, slowing time-to-value for new clients and limiting how many clients the business can onboard at a given time. These documents are lengthy, spanning scanned and digital PDFs with varied layouts, complex tables, checkboxes, and hierarchical structures that are difficult to process consistently at scale. Key-information extraction systems with classical OCR pipeline systems were brittle against the length, variability, and complexity of these documents requiring advanced NLP techniques to tackle this challenging problem with greater flexibility and accuracy.

My Role

Grew from hands-on Data Scientist into the technical lead of the project over 2-3 years, directing the overall technical direction while remaining deeply hands-on across modeling, experimentation, and system design.
Owned end-to-end development of the multi-modal document parsing pipeline, retrieval and ranking system, and LLM fine-tuning workstreams from research and prototyping through to production deployment.
Collaborated closely with a cross-functional team of data scientists, software and data engineers, business/quality analysts, while managing stakeholder expectations across a multi-year delivery timeline.

Approach

One of the most challenging aspects of the project was building a reliable training dataset. We had access to thousands of documents with labels for final extracted values, but these came with significant limitations: label histories were unavailable when values were revised after client confirmation, page location information was sparse and inconsistent, and multi-page answers had no location annotations at all. Rather than discarding this imperfect data, we treated the available labels as distant supervision signals, cleaning them through template-based heuristics and dropping unreliable ones to reduce noise. To supplement this, we developed a custom document annotation tool that allowed us to collect high-quality labeled data for both field values and granular section & span-level locations.

The parsing pipeline was built from the ground up to handle the full complexity of real-world 401-K documents. For text extraction, we developed a hybrid approach combining native PDF text extraction with Tesseract OCR to maximize accuracy across varying document quality. Layout analysis models (LayoutLMv2 and YOLO) were used to detect visual objects like titles, headings, bordered tables, and lists. Hierarchical structure parsing relied primarily on heuristics using bullets, numbering, and indentation from known templates which covered about 70% of the documents. This approach generalized effectively for explicitly structured documents but insufficient to handle unseen layouts and ambiguous cases. We trained LayoutLMv2 and YOLO models specifically to overcome these challenges and integrated them alongside the heuristic pipeline. For table extraction, bordered tables were handled using pdfplumber and Camelot, while borderless tables presented a more significant challenge. Existing grid-based approaches broke down due to overlapping column content and fuzzy boundaries. To address this, we developed a configurable OpenCV-based grid detection pipeline that explicitly tolerated layout fuzziness, significantly improving extraction reliability for borderless tables. The final output was a custom hierarchical tree representation of the document, exportable to JSON, which later evolved into a graph-based structure to capture richer cross-section relationships for downstream retrieval and reasoning.

With the document structure parsed and represented hierarchically, the next challenge was extracting the right information for each of the 80+ fields. Not all fields required the same level of sophistication, and we took a deliberate approach of starting simple and escalating complexity only where justified. Simpler fields such as primitive types like dates, numbers, and named entities were handled reliably through clustering-based template section identification, regex patterns, and lightweight NLP techniques like named entity recognition and date extraction. These approaches remain in production today for the fields they handle well, keeping the system efficient where complexity is not needed.

For more complex fields, we progressively escalated modeling sophistication as we encountered harder extraction challenges. We introduced a retrieval layer using Bi-Encoder and Cross-Encoder Sentence Transformer models to identify and rank relevant passages within long hierarchical documents, fine-tuning the Bi-Encoder further as retrieval complexity increased. On top of retrieval, we fine-tuned BERT and T5 for answer extraction, before evolving toward LLM prompting and ultimately fine-tuning Flan-T5-XXL with LoRA for the most challenging fields, those requiring reasoning across multiple sections, handling amendment and addendum overrides, or inferring implicit template-specific business logic that was never explicitly documented.

Impact

Extraction accuracy

~85%

Fields extracted

80+

Annual time savings

~6 FTE

Learnings

Imperfect labels are still useful if handled carefully. Real-world labeled data is rarely clean. We treated noisy labels as distant supervision signals, combined with weak supervision techniques and a purpose-built annotation tool to collect higher quality labels incrementally. The annotation tool in particular was essential and without it, getting reliable labeled data at scale was not feasible.

Layout models are powerful but linear. Models like LayoutLM excel at detecting visual objects on a page but are fundamentally limited in capturing deep hierarchical structures that are not visually explicit. Combining them with heuristic pipelines and task-specific trained models was necessary to handle the full complexity of real-world document structures.

Retrieval and extraction complexity scales with reasoning demands. Single-section, single-page extractions are tractable with standard retrieval and discriminative models. The challenge grows significantly when answers require reasoning across multiple distant sections and pages, or when the task involves enumerating a list of items scattered throughout the document. These cases push the limits of retrieval pipelines that optimize for single-passage relevance and discriminative models that extract spans. Generative models have become increasingly capable of handling these complex reasoning demands, and moving toward LLM prompting and fine-tuning was a direct response to this class of problems.

B2B documents are a different problem class. Most document understanding research focuses on forms, invoices, and licenses which are shorter, more structured, and more uniform documents. 401-K business documents are lengthy, deeply hierarchical, varied across clients, and contain implicit domain knowledge that never appears explicitly on the page. Techniques that work well on standard benchmarks often fail to generalize to this class of documents.

Production AI systems are living systems. Over 2-3 years, we learned that shipping a model is not the end, it is the beginning. New document types, client-specific variations, regulatory changes, and edge cases require continuous maintenance, retraining, and adaptation. We employed active learning strategies to systematically collect feedback on low-confidence predictions, prioritizing the most informative examples for human review and annotation, creating a feedback loop that continuously improved model performance over time. Building for maintainability and feedback collection from the start is as important as building for accuracy.

What I Would Do Next

Expand the system to handle the full diversity of client onboarding document types, including narrative documents, forms, and structured spreadsheets that still require manual extraction today.
Explore visually rich document understanding models such as GPT-4V, Donut, DocOwl, and Qwen-VL as end-to-end alternatives potentially simplifying the architecture.
Evolve extraction from RAG-style retrieval toward a graph-based paradigm where an AI agent navigates a structured document graph, enabling more precise and interpretable extraction for complex multi-section reasoning tasks.
Create a public benchmark dataset that reflects the true complexity of B2B business documents, fostering academic research beyond the simpler forms and invoices that dominate existing benchmarks.