We’re excited to share a new model architecture that outperforms other approaches for phone call automation in our real world, healthcare setting. The model is a graph aware language transformer, which has more flexibility than traditional model pipelines and more reliability (meaning no hallucinations), as well as lower latency than the latest prompt models for next action prediction.

In addition, this work introduces a model evaluation framework that is more comprehensive for applied settings. We establish a paradigm for human evaluation that: 1. accounts for the fact that multiple different conversational responses can lead to success and 2. evaluates the success of models in settings of varying degrees of difficulty (e.g., background noise, uncooperative payor agents, etc.)

We’re happy to announce that this recent research has been accepted at NAACL 2024. You can read the full paper here, or a summary and explanation of its importance below. 

Why it matters

Before we dive into our findings, it’s important to understand the broad range of factors we consider when selecting a model for applied settings. These include:

Development time to production: An important consideration building AI for enterprise customers  is how much modeling and engineering effort is required to get an initial model into production. With model pipelines, the  effort is often high because it requires data labeling efforts for each model in the pipeline and hyperparameter tuning or feature engineering work. With prompt models, this effort is often lower, but there is some prompt engineering and in context learning work involved. 

Robustness: This is how well models can navigate unseen language and varying practical settings that may come up at inference time. For example, can the model handle different wording in utterances? Can it handle background noise? Can it handle speech-to-text mistranscriptions? Robust enterprise AI solutions are as performant in the lab as they are in the real world. Larger models trained on larger datasets tend to be more robust to language variations; GenAI prompt models are generally quite good at this. How robust models are to other forms of noise depends on how they were trained, fine-tuned, or how context was provided.

Flexibility: How well is the model able to handle conversations that are outside of the expected conversation flows. Can it handle cases where a user made a mistake and needs to correct some past information that was provided? Is the model able to redirect a conversation with an uncooperative agent? LLMs are generally more flexible than shallow models since they are trained on more data, including data outside of the specific use case.

Reliability: This is about consistency. If the model performed well on one phone call, will it perform just as well on another phone call? Infinitus calls over 500 payors for various use cases and our AI agent needs to be able to reliably perform consistently and accurately in each of these settings. We’ve seen with some prompt models, the same prompt can lead to different predictions and different latency at different times. Trained-in-house models, including shallow models, can have high reliability. 

Adaptability to changing customer requirements: If there is an update to the customer or product requirements, how easy is it to update the system? Models may need to be retrained. If so, how much data is needed to retrain the model and how much engineering work is required? Depending on the use case, models with reusable components or zero shot prompt models can be easier to maintain and adapt to changing requirements.

User experience: It’s important for the conversation to feel natural. When there are errors, if the error is reasonable and can still move the conversation forward, that’s better than something completely unexpected. For phone call settings, latency is a critical part of user experience. Consistently responding in less than 1 second will keep the conversation moving forward. We also want to avoid interrupting the user, talking over them, or cutting them off to avoid frustration.

Task success: The most critical metric is task success or business success. For example in our use case, were we able to collect all benefits information over the phone call? The individual model metrics of accuracy, speed, and scale are important, however need to be viewed in context of the entire use case. .

Given all of these different industry requirements, traditional precision, recall, and F1 metrics on next action prediction don’t capture the full picture of how well a model will work in applied settings. We therefore introduce a new evaluation measure, which more comprehensively captures how well a model performs against these different industry requirements.

We experimented with different model architectures, evaluating them against a model pipeline which leverages a dialog manager (DM) for conversation flow and state tracking. The input to the DM is model predictions on the utterance including Intent Classification (IC) and Slot Filling (SF). The output is a predicted action which corresponds to a response template. We evaluated new models against this system as a strong baseline, since it is an architecture that has demonstrated success in industry settings against multiple of the above factors.

Visualization of component based conversational AI pipeline which includes a Dialog Manager.

Figure 1. Visualization of component based conversational AI pipeline which includes a Dialog Manager.

One key intuition that we had was that our complex conversations naturally form a graph-like structure where nodes represent the current state of the conversation and edges represent the next action or response to take. In the healthcare setting, we also have many standard operating procedures (SOPs), which are graph-like guidelines for navigating conversations. For example, an SOP may dictate that if a plan type is commercial you should follow flow A, but if it is Medicare you should follow flow B. This graph-like logic can be explicitly encoded in a component like a Dialog Manager, however explicitly encoding logic can be brittle when handling unexpected flows and requires engineering effort to encode each SOP and update them if they change. We wanted to explore if there was a way to more implicitly encode this knowledge to make the system more flexible, robust, and adaptable.

Given this, we experimented with various approaches for integrating graph information including GNNs and encoding graph context into trained language transformers and GenAI prompt models.

GaLT model

The graph aware language transformer (GaLT) model employs a graph embedding layer that encodes past actions as nodes. The language transformer is fed the user utterance alongside the history of actions to implicitly learn the co-occurrence between user utterances and system actions. A fusion layer combines both the language transformer and the graph component features and is then passed through a fully connected layer to predict the next action.

An illustration of the graph aware language transformer (GaLT) model.

This model outperformed more complex GNN models that included the action connection details through edges. It had both a higher quality of outputs and lower training and inference time which is important for applied settings.

The models in our research were evaluated on a human-reviewed next action prediction dataset of over 500,000 dialog turns. The proposed GaLT model outperformed other models on F1 metrics. To further evaluate on product level metrics we introduced the human evaluation framework.

Human evaluation framework

One key limitation of F1 accuracy metrics is that a desired outcome of a call can be achieved through various paths, so we do not need to tie success to a single correct next action. To compare GaLT with the DM system, human assessors acted the role of the agent receiving calls. 

Three call difficulty levels were defined: easy, medium, and hard. Easy was rated 0-1 difficult scenarios, medium was 2-3, and hard was 4-5. The scenarios can be agent level or flow level. Agent scenarios are challenges regarding the human user’s situation during a call such as mumbling, background noise, or repeated expressions. The flow scenarios are specific conditions and edge cases that increase the complexity of the conversation and the information required to be collected to complete the task. 

For each evaluated call, the evaluation framework tracked product-level metrics and human qualitative evaluation. The product-level metrics were how far into the conversation we progressed before the human user got frustrated and how many customer outputs they were able to successfully collect. The qualitative evaluation was a scoring system where evaluators ranked the success of the conversation.

We found that the GaLT system outperformed the DM system on the product level metrics – both progressing further into the conversation and allowing collection of more fields. It received higher scores on the qualitative evaluation as well, making it overall a stronger system for applied settings.

We’re happy to announce that this work has been accepted at NAACL 2024, and will be presented at its 2024 annual conference in Mexico City. You can read the full paper here

Amin Hosseiny Marani, Ulie Schnaithmann, Youngseo Son, Akil Iyer, and Manas Paldhe also contributed to this research.