Technology

Optimizing AI in Insurance: A Deep Dive into Performance Evaluation at Newfront

By Ryan Velazquez | Published February 4, 2025

At Newfront, technology has always been part of our story. Frustrated by the outdated and overly complex insurance-buying process, our co-founders set out to modernize the industry by empowering brokers with purpose-built technology. Recognizing that insurance brokerages are, at their core, data-intensive businesses that process and analyze vast amounts of information, we started in 2017 by building an online platform where clients can see their policies, download their certificates, pay their bills, and more.

The launch of ChatGPT at the end of 2022 marked a turning point for the company. Large language models (LLMs) unlocked new possibilities for automating workflows, processing unstructured data, and building tools that enhance both client and team experiences. Inspired by these capabilities, we oriented our spring 2023 Hackathon around prototyping AI-driven solutions, setting the stage for a new era of innovation in our operations.

The Role of AI at Newfront

Since then, we’ve built and introduced an array of AI-powered products designed to streamline workflows and deliver significant value for our clients. This suite includes Benji, a benefits assistant for employees, compliance review for contacts, and quote ingestion tooling to process data faster and more accurately. We also developed an ethical framework, AI Principles, to ensure responsible development and implementation.

In this article, we’ll discuss the goals of this tooling, its components, why we chose to build it ourselves, and lessons learned along the way.

Performance Evaluation Tooling

As an agile company, we want to allow our software engineers and data scientists to contribute to projects quickly by establishing a shared understanding of how our systems are built for various use cases.

The focus of performance evaluation is to:

Build trust that the AI product is performing up to our standards
Prevent performance declines from new changes
Accelerate testing and refinement of AI models

Our AI performance evaluation tooling has focused on structured data-related products so far. We still rely heavily on manual review for chatbot applications that require evaluating open-ended generative outputs, but we are also exploring automation.

While AI performance evaluation tools exist in the marketplace, none fully met our criteria. We chose to develop our own tool to achieve a high degree of customization—both to tailor the implementation to our product needs and to modify its functionality as requirements evolve. Given the early stage of this field, we’re continuing to monitor developments and remain open to adopting a standardized tool in the future if the landscape changes.

Tooling Components

While we had some tools to help automate evaluations for our structured data projects, we wanted a unified framework to standardize and streamline the process. To do this we built a toolset consisting of the following components:

Project Framework: A standardized structure for new AI projects that accelerates setup and ensures consistency across initiatives.
Python Library: A collection of classes and functions for performing common tasks in a consistent way. The library lets us string together evaluators that might be used across different projects. There are also some helper functions provided for things like calculating metrics and confidence intervals.
Evaluators: We built a set of default “evaluators” that individual projects can use as is or extend for their particular use case. One of the evaluators is for comparing dictionaries. This provides a sane, default way of comparing nested dictionaries. The extensibility is handy if, for instance, one project needs custom fuzzy string matching.
Infrastructure Connections: Seamless integrations with data sources for document storage (s3 or Google Drive), evaluation persistence for reproducibility (Snowflake DWH and Google Drive), and visualization connections (Snowflake DWH to Hex) are included in the framework and library. These connections handle essential tasks, letting teams focus on product-specific requirements.
Evaluation Visualization: We utilize a modern data visualization tool called Hex to view evaluation results, compare runs, and dive deep into the analytics. The primary goal of this project was to standardize aspects at the code level, but we’ve noticed that this level of visualization has brought the most satisfaction to internal users thus far. The ability to quickly see where one experiment improves or degrades performance at a glance is powerful!

Key Insights from Building Our Evaluation Tooling

Developing our AI performance evaluation tooling has provided critical lessons that continue to shape our approach.

Granular Metrics Enable Precision: Overall performance numbers are useful, but granular metrics at the “field” and “example” levels are essential for allowing users to pinpoint issues quickly. For example, identifying that the AI is struggling on line A of coverage X for policy examples Y and Z.
Comparisons Drive Progress: Comparing different versions of AI models offers valuable insights into potential improvements. We can also analyze multiple versions on a detailed level to assess their differences. For example, we can test a new foundation model to understand how specific fields or examples perform compared to the earlier version.
Reduce Friction for Adoption: Automating processes, such as syncing data directly to dashboards, significantly increases user adoption. Early manual upload requirements were a barrier that automation eliminated effectively.
Maintain High-Quality Ground Truth Data: Structuring and managing accurate benchmarks ensures reliable evaluations and supports continuous refinement manageable.
Augment Auto Evaluation with Manual Testing: Automated evaluations accelerate testing but cannot replace manual reviews. Business experts are critical in validating performance with real-world data before release.

Looking Ahead

Our performance evaluation tooling has been instrumental in tracking and improving the performance of our AI products. With an average of 250 evaluation runs per month, these tools build trust in our AI and accelerate the pace of innovation. We’re proud of the impact AI has had on helping our colleagues and clients and are excited to continue launching thoroughly-tested, impactful AI products that enhance our team's and clients' experiences.

The Author

Ryan Velazquez

Data Scientist

As a data scientist Ryan helps Newfront colleagues and clients tackle complex problems with data. With over a decade of experience, he’s built AI models and analytical tools that have improved decision-making across industries. He’s used AI to helped the FDA crack down on fraudulent medical devices, developed a patented machine learning model to predict contaminated groundwater plumes, and built tools for epidemiologists—including the CDC—to track the spread of viruses.

The information provided here is of a general nature only and is not intended to provide advice. For more detail about how this information may be treated, see our General Terms of Use.

How to Improve Retrieval Systems in AI Products

April 11 • 2025

Newfront's 2024 Hackathon: Transformation through Client-First, AI-Forward Solutions

April 23 • 2024

Newfront Announces Expanded Access and Capabilities for its Navigator Platform

March 25 • 2024

Explore the New
Frontier with Us

Press

Events

Insights

Careers

Contact

A Brokerage for the 21st Century

Privacy Terms Sitemap

Do Not Sell My Personal Information (CA)Limit Use of My Personal Information (CA)

We are transforming the risk management, business insurance, total rewards, and retirement services space through the combination of elite expertise and cutting-edge technology. Newfront Insurance Services is a trade name licensed to sell insurance products in all 50 states, the District of Columbia, and Puerto Rico. See this link for a list of our licenses.

Business Insurance

Total Rewards

Technology

Who we are

Discover Newfrontiers

Optimizing AI in Insurance: A Deep Dive into Performance Evaluation at Newfront

Ryan Velazquez

Related Articles

How to Improve Retrieval Systems in AI Products

Newfront's 2024 Hackathon: Transformation through Client-First, AI-Forward Solutions

Newfront Announces Expanded Access and Capabilities for its Navigator Platform

Explore the New
Frontier with Us

Optimizing AI in Insurance: A Deep Dive into Performance Evaluation at Newfront

Ryan Velazquez

Related Articles

How to Improve Retrieval Systems in AI Products

Newfront's 2024 Hackathon: Transformation through Client-First, AI-Forward Solutions

Newfront Announces Expanded Access and Capabilities for its Navigator Platform

Explore the NewFrontier with Us

Explore the New
Frontier with Us