Optimizing AI in Insurance: A Deep Dive into Performance Evaluation at Newfront
By Ryan Velazquez | Published February 4, 2025
At Newfront, technology has always been part of our story. Frustrated by the outdated and overly complex insurance-buying process, our co-founders set out to modernize the industry by empowering brokers with purpose-built technology. Recognizing that insurance brokerages are, at their core, data-intensive businesses that process and analyze vast amounts of information, we started in 2017 by building an online platform where clients can see their policies, download their certificates, pay their bills, and more.
The launch of ChatGPT at the end of 2022 marked a turning point for the company. Large language models (LLMs) unlocked new possibilities for automating workflows, processing unstructured data, and building tools that enhance both client and team experiences. Inspired by these capabilities, we oriented our spring 2023 Hackathon around prototyping AI-driven solutions, setting the stage for a new era of innovation in our operations.
The Role of AI at Newfront
Since then, we’ve built and introduced an array of AI-powered products designed to streamline workflows and deliver significant value for our clients. This suite includes Benji, a benefits assistant for employees, compliance review for contacts, and quote ingestion tooling to process data faster and more accurately. We also developed an ethical framework, AI Principles, to ensure responsible development and implementation.
In this article, we’ll discuss the goals of this tooling, its components, why we chose to build it ourselves, and lessons learned along the way.
Performance Evaluation Tooling
As an agile company, we want to allow our software engineers and data scientists to contribute to projects quickly by establishing a shared understanding of how our systems are built for various use cases.
The focus of performance evaluation is to:
Build trust that the AI product is performing up to our standards
Prevent performance declines from new changes
Accelerate testing and refinement of AI models
Our AI performance evaluation tooling has focused on structured data-related products so far. We still rely heavily on manual review for chatbot applications that require evaluating open-ended generative outputs, but we are also exploring automation.
While AI performance evaluation tools exist in the marketplace, none fully met our criteria. We chose to develop our own tool to achieve a high degree of customization—both to tailor the implementation to our product needs and to modify its functionality as requirements evolve. Given the early stage of this field, we’re continuing to monitor developments and remain open to adopting a standardized tool in the future if the landscape changes.
Tooling Components
While we had some tools to help automate evaluations for our structured data projects, we wanted a unified framework to standardize and streamline the process. To do this we built a toolset consisting of the following components:
Project Framework: A standardized structure for new AI projects that accelerates setup and ensures consistency across initiatives.
Python Library: A collection of classes and functions for performing common tasks in a consistent way. The library lets us string together evaluators that might be used across different projects. There are also some helper functions provided for things like calculating metrics and confidence intervals.
Evaluators: We built a set of default “evaluators” that individual projects can use as is or extend for their particular use case. One of the evaluators is for comparing dictionaries. This provides a sane, default way of comparing nested dictionaries. The extensibility is handy if, for instance, one project needs custom fuzzy string matching.
Infrastructure Connections: Seamless integrations with data sources for document storage (s3 or Google Drive), evaluation persistence for reproducibility (Snowflake DWH and Google Drive), and visualization connections (Snowflake DWH to Hex) are included in the framework and library. These connections handle essential tasks, letting teams focus on product-specific requirements.
Evaluation Visualization: We utilize a modern data visualization tool called Hex to view evaluation results, compare runs, and dive deep into the analytics. The primary goal of this project was to standardize aspects at the code level, but we’ve noticed that this level of visualization has brought the most satisfaction to internal users thus far. The ability to quickly see where one experiment improves or degrades performance at a glance is powerful!
Key Insights from Building Our Evaluation Tooling
Developing our AI performance evaluation tooling has provided critical lessons that continue to shape our approach.
Granular Metrics Enable Precision: Overall performance numbers are useful, but granular metrics at the “field” and “example” levels are essential for allowing users to pinpoint issues quickly. For example, identifying that the AI is struggling on line A of coverage X for policy examples Y and Z.
Comparisons Drive Progress: Comparing different versions of AI models offers valuable insights into potential improvements. We can also analyze multiple versions on a detailed level to assess their differences. For example, we can test a new foundation model to understand how specific fields or examples perform compared to the earlier version.
Reduce Friction for Adoption: Automating processes, such as syncing data directly to dashboards, significantly increases user adoption. Early manual upload requirements were a barrier that automation eliminated effectively.
Maintain High-Quality Ground Truth Data: Structuring and managing accurate benchmarks ensures reliable evaluations and supports continuous refinement manageable.
Augment Auto Evaluation with Manual Testing: Automated evaluations accelerate testing but cannot replace manual reviews. Business experts are critical in validating performance with real-world data before release.
Looking Ahead
Our performance evaluation tooling has been instrumental in tracking and improving the performance of our AI products. With an average of 250 evaluation runs per month, these tools build trust in our AI and accelerate the pace of innovation. We’re proud of the impact AI has had on helping our colleagues and clients and are excited to continue launching thoroughly-tested, impactful AI products that enhance our team's and clients' experiences.
Ryan Velazquez
Data Scientist
As a data scientist Ryan helps Newfront colleagues and clients tackle complex problems with data. With over a decade of experience, he’s built AI models and analytical tools that have improved decision-making across industries. He’s used AI to helped the FDA crack down on fraudulent medical devices, developed a patented machine learning model to predict contaminated groundwater plumes, and built tools for epidemiologists—including the CDC—to track the spread of viruses.
Connect with Ryan on LinkedIn