Improving Open-Domain Dialogue Evaluation with a Causal Inference Model

Cat P. Le · Luke Dai · Michael Johnston · Yang Liu · Marilyn Walker · Reza Ghanadan
arXiv:2305.11400 Machine Learning (cs.LG) Preprint • v1 • Jan 31, 2023

Abstract

Effective evaluation methods remain a significant challenge for research on open-domain conversational dialogue systems. Explicit satisfaction ratings can be elicited from users, but users often do not provide ratings when asked, and those they give can be highly subjective. Post-hoc ratings by experts are an alternative, but these can be both expensive and complex to collect. Here, we explore the creation of automated methods for predicting both expert and user ratings of open-domain dialogues. We compare four different approaches. First, we train a baseline model using an end-to-end transformer to predict ratings directly from the raw dialogue text. The other three methods are variants of a two-stage approach in which we first extract interpretable features at the turn level that capture, among other aspects, user dialogue behaviors indicating contradiction, repetition, disinterest, compliments, or criticism. We project these features to the dialogue level and train a dialogue-level MLP regression model, a dialogue-level LSTM, and a novel causal inference model called counterfactual-LSTM (CF-LSTM) to predict ratings. The proposed CF-LSTM is a sequential model over turn-level features which predicts ratings using multiple regressors depending on hypotheses derived from the turn-level features. As a causal inference model, CF-LSTM aims to learn the underlying causes of a specific event, such as a low rating. We also bin the user ratings and perform classification experiments with all four models. In evaluation experiments on conversational data from the Alexa Prize SocialBot, we show that the CF-LSTM achieves the best performance for predicting dialogue ratings and classification.

TL;DR: The paper introduces CF-LSTM, a causal inference model that predicts dialogue ratings by modeling user behaviors and counterfactual outcomes, outperforming traditional models in both regression and classification tasks. By leveraging interpretable features and treatment-based reasoning, it enables scalable and robust evaluation of open-domain dialogue systems.

Motivation

Evaluating open-domain dialogue systems is challenging due to:

The paper introduces a causal inference model to improve automated evaluation of dialogue systems.

Models Compared

  • Transformer Baseline: End-to-end model using raw text.
  • Dialogue-Level MLP: Aggregates turn-level features into dialogue-level inputs for regression.
  • Dialogue-Level LSTM: Uses LSTM to model sequential turn-level features.
  • Counterfactual LSTM (CF-LSTM): Novel causal model using treatment-based regressors and counterfactual reasoning.

Feature Engineering

Turn-level features include:

  • ODES Classifier: Detects user behaviors (e.g., insult, compliment)
  • Sentiment Analysis: Valence, satisfaction, activation
  • DialogRPT & FED Metrics: Relevance, specificity, fluency
  • ASR Confidence Scores: From speech recognition

Causal Inference Framework

  • Uses treatment assignment based on ODES signals
  • Assumes observed confounders via extracted features
  • Trains separate regressors for treated vs. untreated dialogues
  • Loss function includes Integral Probability Metric (IPM) to reduce bias

Experiments & Results

Using Alexa Prize SocialBot data, the models were evaluated on:

  • Regression: Predicting continuous ratings
  • Classification: Binary and 5-class rating prediction
CF-LSTM outperformed all other models in both tasks, showing robustness to bias and variance in noisy data.

Key Contributions

Citation

@article{le2023improving,
title={Improving open-domain dialogue evaluation with a causal inference model},
author={Le, Cat P and Dai, Luke and Johnston, Michael and Liu, Yang and Walker, Marilyn and Ghanadan, Reza},
journal={arXiv preprint arXiv:2301.13372},
year={2023}
}
        

Contact

Questions about this work? Reach out: calvine.le@gmail.com

More: Google Scholar