Guardians of healthcare calls: AI review in action

At Infinitus, our voice AI agents can call payors on behalf of healthcare providers to collect detailed patient benefits and prior authorization data. These calls are complex and lengthy (often over an hour), and require navigation through convoluted IVRs, long hold times, and transfers to human representatives.

Our AI agents operate on a multimodal, multi-model platform adhering to each customer’s unique standard operating procedures (SOPs) and industry regulations. On each call, these agents can gather over 200 call outputs, including plan details, network status, prior authorization requirements, and more.

However, customer trust depends on the accuracy of data collected. It wouldn’t make sense, for example, if collected data showed there was no individual deductible while simultaneously including an individual deductible value. Similarly, stating medication is not covered, yet requiring prior authorization for pharmacy benefit coverage would be confusing and unactionable.

Inconsistencies further create inefficiencies. Incorrect data triggers repeated calls, delays patient access, and burdens back-office staff. Given that half a billion benefits verification calls were made in 2024 in the US alone, manual human reviews cannot scale effectively. Since there are few experts in this space who understand the nuances of each medication and evolving payor specifications, the labor intensiveness to assess the quality disproportionately goes up.

Human reviewers can provide nuanced understanding but suffer from inconsistency, fatigue, and subjectivity, leading to potential oversight and errors. Periodic audits or random spot-checks introduce delayed feedback and miss critical data patterns. But with voice AI agents driving the call, human reviewers not only need to validate the data outputs, but also flag inconsistent behavior by the AI automation, so that these corrective retries can be made through additional phone calls.

These challenges present the question: How do we ensure every call is up to standard?

AI review: an automated judge

At Infinitus, we have human reviewers in the loop to ensure the data we send back to customers is of high quality. But we also have AI review, a behind-the-scenes automated “judge” that evaluates call quality before it ever reaches a customer. The system identifies contradictions, missed questions, or incorrect data and flags calls for redo when necessary. This process ensures each call aligns with formatting rules, customer-specific SOPs, and consistency standards.

How the AI review system works

The Infinitus AI review system integrates multiple models, combining shallow classifiers and large language models (LLMs). The system analyzes both call recordings and text transcripts alongside contextual information from knowledge graphs, digital sources, and SOPs. These models also consider scenarios like human representatives not answering questions clearly, contradictions in their responses, and incorrect or improperly formatted data outputs.

Input sources used by this system like knowledge graphs are shared context across all of our AI systems including the voice AI agents that drive the call. For instance, these agents leverage our internal knowledge graphs to efficiently collect accurate data while minimizing unnecessary questions and frustration during the call with the human representative.

This system provides diligent oversight, impossible to achieve manually; It ensures that every single call either meets quality standards or is promptly flagged for a human review – this may result in a redo of the call or human corrections made to the outputs of the call, before outputs are submitted to the customer.

Our AI review system is a multi-layered pipeline of models.

Layer 1: Deflection and rule-based guardrails

The AI system first tries to derisk by deflecting any contextually complex scenarios to a human expert. These can be simple heuristic flags, like escalation cases (the human representative could not answer the question during the call or gave conflicting responses, so we had to transfer to another department) , or complex scenarios like potential adverse events (which for regulation reasons we need to ensure that we have a human to review the call). This layer also ensures the system only auto-reviews supported scenarios.

Layer 2: Rule-based auto-correctors

We apply consistent formatting and contextual auto-corrections, mirroring typical corrections made by our human reviewers. Common examples include capitalization corrections and adherence to standard formats and rules. We perform these ahead of any model predictions; intuitively, if those edits were all that were required, the call should be approved. This allows subsequent layers to have cleaner training data as well, learning nuances more effectively.

Layer 3: Large language models (LLMs)

We leverage OpenAI and Google LLM APIs using both audio and text modalities. Prompting these models with formatted transcripts and segmented audio as inputs, extracts structured JSON containing data points (e.g., the call outputs we want to validate, was a question asked, did the agent answer the question, etc.) and the associated reasoning. These structured outputs become features for downstream machine learning (ML) classifiers.

Our voice AI agents capture data at different stages of the call – during IVR, on hold, and in payor agent conversations. The audio segments we pass to these models are therefore contextual, such as the phase and the audio surrounding the moment a question was asked.

Layer 4: Feature extraction

We have multiple stages of classifiers that perform varying degrees of contextual assessment. These classifiers may choose to use a variety of features: call context (customer name, medication discussed, insurance company, call type); features constructed from LLM prompt responses; data outputs gathered from the call and their sources; and more. We also incorporate BERT-based embeddings of data outputs as features that may further undergo dimensionality reduction before being used as final features.

Layer 5: High-level contextual classifier

A preliminary XGBoost classifier determines if the call broadly passes quality criteria or if the call is incomplete/needs to be reviewed by a human for compliance reasons.This layer can also detect bad data scenarios where member information could not be validated over the call. Based on the prediction made in this classifier, we proceed to the second stage output-level classifiers

Layer 6: Output-level classifiers

This stage has a number of smaller, specialized XGBoost classifiers that evaluate individual data outputs. They predict the likelihood of human-required corrections based on historical human reviewer corrections.

Both of these stages have classifiers that produce explainable probabilities (or confidence scores) that guide final decisions. Evaluation criteria use correctness as a hard metric to proxy answer relevance and protocol understanding, serving as a quality estimator scoring from 0 to 1 in a point-wise evaluation.

Layer 7: ML auto-correctors

Based on the classifier predictions and confidence scores, ML auto-correctors perform nuanced contextual corrections. Unlike purely rule-based corrections, these corrections leverage classifier insights and LLM-extracted information to edit and enhance accuracy. This is specifically advantageous in the cases where the classifier is not confident of the labels, and we would want to potentially exploit the extracted values from prompt responses.

Final decision aggregation

The AI review decision aggregates results from all layers, auto-approving calls if every model agrees. When this does not occur, it highlights specific issues for human intervention. This pipeline approach enables system-wide and component-level performance monitoring, allowing independent evaluation and improvement of any part.

This system also determines whether the human review process can be limited to specific data outputs. By focusing on targeted data outputs (i.e. partial), the review process becomes an order of magnitude faster, significantly streamlining the workflow.

Why not use only LLMs?

Statistical estimators, like XGBoost classifiers, provide metrics such as feature importance. We consistently found that our LLM response-based features ranked among the top, which led us to question if we could rely solely on LLMs. However, when we experimented by removing our statistical classifiers and using only LLMs, the system’s accuracy dropped significantly. In this section, we elaborate the reasons for this decline and the rationale for using LLM outputs as features.

While LLMs are powerful at generating coherent text and analyzing context, their outputs can be open-ended, inconsistent, and prone to hallucination, especially without extensive domain-specific training data.

A sample LLM generated response

Challenges with solely relying on LLMs include:

Brittle SOP maintenance: Continuously updating detailed prompts based on domain knowledge and customer requirements is complex and error-prone.
Hallucinations and fallacies: LLMs may generate incorrect reasoning due to their limited domain-specific training.
Reliability: With fine-tuned LLMs, we need to consider how we prevent one example from outweighing everything else. We observe that fine-tuning can hallucinate based on just a single noisy example in the training data. This requires high quality standardized data, which we don’t generally have for human reviewers due to subjectivity.

These factors greatly affect development time to production. In addition, while digital data sources reduce the need for phone call-based data collection, the AI judge must still assess overall quality, given those data outputs are not discussed in the call.

An example output-level classifier prediction demonstrating the use of features based on LLM outputs.

Statistical models can better capture these data patterns than generative models relying on a few in-context examples. The model can learn which sets of outputs are likely to appear together (with their specific values) from the historical data. For example, if an insurance plan is terminated, we most likely – and correctly – may not have deductibles present. The model may also learn to depend heavily on the features from LLM outputs to understand if the field needs correction, for fields that are spelled out (like plan name, reference number)

Hence, we integrate LLM-generated features into the XGBoost classifiers. These classifiers leverage historical, real-world human reviewer correction data and produce reliable, explainable predictions. This statistical approach increases accuracy, scalability, and explainability.

Human-in-the-loop: collaborative oversight

At Infinitus, we firmly believe AI augments human intelligence, it doesn’t replace it. AI handles the heavy lifting, while humans provide deep understanding and complex decision-making.

The AI review system aggregates decisions from models across all layers. Depending on the judgment’s severity, it may flag a call for full human review or highlight specific data outputs, enabling experts to conduct a focused, time-efficient review. Experts randomly assigned to calls ensure unbiased evaluations. Human reviewers also flag systematic AI errors across our systems, informing iterative AI model improvements.

These free human reviewers from overwhelming routine checks and let them focus their expertise on critical, complex cases. Humans in the loop, in turn, validate our AI system’s results for quality, further building trust in the system.

Benefits of AI review

Implementing AI review provides multiple advantages:

Compliance protection: AI detects and prevents compliance violations automatically.
Efficiency and scalability: AI reviews 100% of calls immediately, drastically reducing blind spots and reliance on spot checks.
Consistency and trust: AI applies consistent standards across all calls, fostering internal and external trust.
Real-time feedback: Issues are flagged instantly rather than through delayed manual audits.
Improved human productivity: Experts focus only on genuinely problematic calls, optimizing their time and expertise

AI review: driving accuracy, compliance, and trust

At Infinitus, AI review is not just a technological choice – it’s a strategic imperative ensuring accuracy, scalability, and quality across every interaction around patient data. Our AI review system is a comprehensive, multi-layered approach designed to automatically ensure the quality and accuracy of automated healthcare phone calls. The AI-human collaboration – models, human expertise, and real-time oversight – allows Infinitus to deliver consistent and compliant outcomes. This not only reduces errors but also boosts efficiency, enabling faster, more reliable patient care and greater customer trust.

For more information about the work we’re doing at Infinitus to create time for healthcare, reach out to us directly. We’re also actively hiring, and a complete list of open roles can be viewed here.