Doctoral Dissertation

PhD Thesis


User-centered Adaptive Spoken Dialogue Modelling (pdf)


This thesis investigates novel concepts and methods for automatic recognition of dynamic user properties using statistical supervised learning and for integrating these properties into the dialogue management process to model the course of the dialogue adaptive to the user.

Current commercial spoken dialogue systems usually do not account for dynamic user properties like the user’s satisfaction level. Even in state-of-the-art research systems, this type of adaptation is usually missing. However, if the system was aware of these properties, it would be able to have a better understanding of the current situation and thus to react more appropriately. Therefore, the goal of this thesis is to introduce this type of user-centred adaptation by separating the problem into two sub-problems: recognising the user state, i.e., the dynamic user properties, and integrating this estimation of the user state into the dialogue management.

Before these individual sub-problems are approached, first, the necessary background, which is important for understanding the content of this thesis, is described containing relevant information about spoken dialogue systems and supervised machine learning. Following this description of the background, research of others which aims at solving similar research problems is described including a clear distinction of their work to ours.

For the first sub-problem of deriving the user state, we consider four different user states: the user satisfaction (US), the perceived coherence of the system reaction, the emotional state, and the intoxication level. Automatic recognition of these user states is based on supervised statistical learning. As US is a universal property, an emphasis is put on its automatic recognition with a focus on how temporal information may be used for this recognition process. Here, we propose three novel approaches on how to improve the US recognition performance: introducing an error correction module into the classification process, exploiting temporal learning algorithms by using a modified Markovian model, and optimising the feature set used as input to statistical classification models. While all of our proposed approaches result in a significant performance improvement, the best performance boost is achieved with an optimised feature set. With this feature set, we are able to improve the performance by up to 10.82 % in unweighted average recall achieving a correlation of \rho=0.812.

Research on the recognition of the remaining three user states have also resulted in significant contributions to the state-of-the-art. For the automatic recognition of the perceived coherence of the system reaction, we are the first to connect aspects of the interaction with coherence. Exploiting this relationship, the problem is modelled as a statistical classification task achieving an unweighted average recall of 0.623. In order to improve the performance of speech-based emotion recognition, we propose two novel approaches which add information about the speaker to the recognition process. Here, we show that having speaker-dependent emotion recognition has the potential to improve the overall recognition accuracy by up to 9.42 %. With a comparative study on intoxication recogntion, we compare the recongition performance of machines with humans showing that machines may outperform humans on this task.

For the second sub-problem of rendering the dialogue management adaptive to the user state, we propose three novel approaches. The first approach uses rules to select the next system action. These rules are based on the current user state. Here, we are able to outperform non-adaptive strategies for a bus information dialogue system in terms of task success / dialogue completion (54.27 %) as well as dialogue length and user satisfaction. Our second approach introduces the user state into a POMDP-based dialogue manager by either extending the user state or utilising the user state for modelling the reward function. For the latter, the resulting dialogue policy achieves a better task success rate / dialogue completion rate (60.61 %) than conventional reward functions. While these two approaches incorporates mechanisms which are common for dialogue management, our third approach proposes a two-stage model: in the first stage, any dialogue manager may be used to create an ordered list of possible system actions. In the second stage, the possible change of the user state induced by each system action is predicted. The final system action is then selected accordingly. In a simulated experiment, we are able to show that our proposed approach results in an significant improvement in dialogue coherence reducing the number of non-coherent system actions by -4.45 %.

During the work on these two problems, we have created open-source implementations of a Conditioned Hidden Markov Model library as well as the POMDP-based dialogue manager. Both implementations have been made accessible to the public. Furthermore, we have created an annotated corpus for the recognition of user satisfaction as used in this thesis as well as for the recognition of the perceived coherence.