We often speak about the importance of humans in the loop in AI, and how Infinitus uses humans in concert with our purpose-built AI system. Our human experts serve a number of roles in our ecosystem, including stepping into live calls the Infinitus AI agent is conducting with payor representatives. This may occur if the representative provides answers that aren’t what would be expected, for example, or if an error is introduced into the STT text for the AI agent. But how those humans get pulled into these situations is just as important.

To that end, we are sharing new research about our novel multimodal model, which can detect dialogue breakdowns in complex phone calls, like that which our AI agent has, leveraging both audio and STT texts. Dialogue breakdown detection is important because it allows the system to correct when the main conversational AI system makes mistakes, and automatically flag a human operator when assistance is needed, saving time and streamlining the call automation process. In our new research, we introduce a Multimodal Contextual Dialogue Breakdown (MultConDB) model, which significantly outperforms other known best models.

Example of the model in action:

An illustration of what dialogue breakdown detection looks like in real time.

How the model learns and detects dialogue breakdowns

MultConDB learned from phone call conversations between our conversational AI agent and users (usually payor representatives), in which human intervention was required because of dialogue breakdowns (e.g., the AI agent misclassified intent due to mistranscribing an answer in its speech-to-text action caused by loud noise on the payor’s end during the phone call), which were detected using both STT and acoustic signals.

MultConDB has a model architecture that consists of state-of-the-art multimodal transformers optimized towards capturing dialogue breakdowns from phone conversations: RoBERTa processing STT text and Wav2Vec2 processing audio.

An illustration of the Infinitus AI system.

What is the model learning under the hood?

Dialogue breakdowns can happen because of a number of different reasons, from audio issues to AI missing subtle nuances and not following our standard operating procedures in complex situations.

We investigated how MultConDB makes inferences in each of the underlying causes of dialogue breakdown; we visualized its outputs using 2D tSNE. We found that MultConDB learns and inherently distinguishes differences between underlying causes of dialogue breakdown without explicit labeling. This suggests MultConDB model architecture is effective for learning even more fine-grained dialogue breakdown types beyond just binary classification for dialogue breakdown detection.

A map depicting the reasons for dialogue breakdown, as detected by the AI model in question.

Is the model effective and sustainable in the industry environment? 

Models and data change every day; Infinitus calls different payors about patients with various conditions and continually releases updated conversational AI models trained with new data. Thus, today’s dialogue breakdowns also look different from yesterday’s calls.

Also, at Infinitus, a large number of automated phone calls concurrently happen and it is critical to bring human operators in the loop as close as possible to potential dialogue breakdowns in addition to detecting exact breakdown turns. Otherwise, human feedback is as useful and overall call failure rates may increase because other phone calls with actual dialogue breakdowns may not get support from human intervention in time.

According to our experiment, although MultConDB is trained with phone calls from August 2023, it successfully detected dialogue breakdowns from September 2023; it obtained similarly high performances with data from both. MultConDB obtained the highest performance for detecting dialogue breakdowns and showed similar patterns of capturing potential breakdowns close to actual breakdowns throughout different conversational AI model releases.

We’re happy to announce that this work has been accepted at NAACL 2024, and will be presented at its 2024 annual conference in Mexico City. You can read the full paper here

Youngseo Son, Messal Miah, and Ulie Schnaithmann also contributed to the research.