Turn-Taking in Spoken Dialogue Systems CS4706 Julia Hirschberg

Turn-Taking in Spoken Dialogue Systems CS4706 Julia Hirschberg Joint work with Agustn Gravano In collaboration with Stefan Benus Hector Chavez Gregory Ward and Elisa Sneed German Michael Mulley With special thanks to Hanae Koiso, Anna Hjalmarsson, KTH TMH colleagues and the Columbia Speech Lab for useful discussions Current Limitations of IVR Systems Automatic Speech Recognition (ASR) + Text-ToSpeech (TTS) account for most users IVR problems ASR: Up to 60% word error rate TTS: Described as odd, mechanical, too

friendly As ASR and TTS improve, other problems emerge, e.g. coordination of system-user exchanges How do users know when they can speak? How do systems know when users are done? AT&T Labs Research TOOT example Commercial Importance http://www.ivrsworld.com/advanced-ivrs/usabilityguidelines-of-ivr-systems/ 11. Avoid Long gaps in between menus or information Never pause long for any reason. Once caller gets silence for more than 3 seconds or so, he might think something has gone wrong and press some other keys! But then a menu with short gap can make a rapid fire menu and will be difficult to use for caller. A perfectly paced menu should be adopted as per target caller, complexity of the features. The best way to achieve perfectly paced prompts are again

testing by users! Until then.http://www.gethuman.com Turn-taking Can Be Hard Even for Humans Beattie (1982): Margaret Thatcher (Iron Lady vs. Sunny Jim Callahan Public perception: Thatcher domineering in interviews but Callaghan a nice guy But Thatcher is interrupted much more often than Callaghan and much more often than she interrupts interviewer Hypothesis: Thatcher produces unintentional turn-yielding behaviors what could those be? Turn-taking Behaviors Important for IVR Systems Smooth Switch: S1 is speaking and S2 speaks and takes and holds the floor Hold: S1 is speaking, pauses, and continues to speak

Backchannel: S1 is speaking and S2 speaks -to indicate continued attention -- not to take the floor (e.g. mhmm, ok, yeah) Why do systems need to distinguish these? System understanding: Is the user backchanneling or is she taking the turn (does ok mean I agree or Im listening)? Is this a good place for a system backchannel? System generation: How to signal to the user that the system systems turn is over? How to signal to the user that a backchannel might be appropriate? Our Approach Identify associations between observed phenomena (e.g. turn exchange types) and measurable events (e.g. variations in acoustic,

prosodic, and lexical features) in human-human conversation Incorporate these phenomena into IVR systems to better approximate human-like behavior Previous Studies Sacks, Schegloff & Jefferson 1974 Transition-relevance places (TRPs): The current speaker may either yield the turn, or continue speaking. Duncan 1972, 1973, 1974, inter alia Six turn-yielding cues in face-to-face dialogue Clause-final level pitch Drawl on final or stressed syllable of terminal clause Sociocentric sequences (e.g. you know) Drop in pitch and loudness plus sequence Completion of grammatical clause Gesture

Hypothesis: There is a linear relation between number of displayed cues and likelihood of turn-taking attempt Corpus and perception studies Attempt to formalize/ verify some turnyielding cues hypothesized by Duncan (Beattie 1982; Ford & Thompson 1996; Wennerstrom & Siegel 2003; Cutler & Pearson 1986; Wichmann & Caspers 2001; Heldner&Edlund Submitted; Hjalmarsson 2009) Implementations of turn-boundary detection Experimental (Ferrer et al. 2002, 2003; Edlund et al. 2005; Schlangen 2006; Atterer et al. 2008; Baumann 2008) Fielded systems (e.g., Raux & Eskenazi 2008) Exploiting turn-yielding cues improves

performance Columbia Games Corpus 12 task-oriented spontaneous dialogues 13 subjects: 6 female, 7 male Series of collaborative computer games of different types 9 hours of dialogue Annotations Manual orthographic transcription, alignment, prosodic annotations (ToBI), turn-taking behaviors Automatic logging, acoustic-prosodic information Objects Games Player 1: Describer Player 2: Follower Turn-Taking Labeling Scheme for Each Speech Segment

Turn-Yielding Cues Cues displayed by the speaker before a turn boundary (Smooth Switch) Compare to turn-holding cues (Hold) Method IPU (Inter Pausal Unit): Maximal sequence of words from the same speaker surrounded by silence 50ms (n=16257) Hold Speaker A: Speaker B: IPU1 Smooth Switch IPU2 IPU3 Hold: Speaker A pauses and continues with no

intervening speech from Speaker B (n=8123) Smooth Switch: Speaker A finishes her utterance; Speaker B takes the turn with no overlapping speech (n=3247) Method Hold Speaker A: Speaker B: IPU1 Smooth switch IPU2 IPU3 Compare IPUs preceding Holds (IPU1) with IPUs preceding Smooth Switches (IPU2) Hypothesis: Turn-Yielding Cues are more likely to occur before Smooth Switches (IPU2) than

before Holds (IPU1) Individual Turn-Yielding Cues 1. 2. 3. 4. 5. 6. 7. Final intonation Speaking rate Intensity level Pitch level Textual completion Voice quality IPU duration 1. Final Intonation

Smooth Switch Hold H-H% 22.1% 9.1% [!]H-L% 13.2% 29.9% L-H% 14.1%

11.5% L-L% 47.2% 24.7% No boundary tone 0.7% 22.4% Other 2.6% 2.4%

Total 100% 100% (2 test: p0) Falling, high-rising: turn-final. Plateau: turn-medial. Stylized final pitch slope shows same results as handlabeled 2. Speaking Rate 0.5 * 0.4 0.3

z-score 0.2 * 0.1 0 * * Syllables per second Phonemes per second -0.1

S Smooth Switch H Hold -0.2 -0.3 -0.4 -0.5 Syllables per second Phonemes per second Final IPU IPU Entire

Finalword word Final (*) ANOVA: p < 0.01 Note: Rate faster before SS than H (controlling for word identity and speaker) 3/4. Intensity and Pitch Levels 0.5 0.4 0.3 * * z-score

0.2 * 0.1 0 * * * IPU Final 1.0s Final

0.5s -0.1 S Smooth Switch HHold -0.2 -0.3 -0.4 -0.5 IPU Final 1.0s Intensity Intensity

Final 0.5s (*) ANOVA: p < 0.01 Pitch Pitch Lower intensity, pitch levels before turn boundaries 5. Textual Completion Syntactic/semantic/pragmatic completion, independent of intonation and gesticulation. E.g. Ford & Thompson 1996 in discourse context, [an utterance] could be interpreted as a complete clause Automatic computation of textual completion. (1) Manually annotated a portion of the data. (2) Trained an SVM classifier. (3) Labeled entire corpus with SVM classifier.

5. Textual Completion (1) Manual annotation of training data Token: Previous turn by the other speaker + Current turn up to a target IPU -- No access to right context Speaker A: the lions left paw our front Speaker B: yeah and its th- right so the {C / I} Guidelines: Determine whether you believe what speaker B has said up to this point could constitute a complete response to what speaker A has said in the previous turn/segment. 3 annotators; 400 tokens; Fleiss = 0.814 5. Textual Completion (2) Automatic annotation Trained ML models on manually annotated data Syntactic, lexical features extracted from current turn, up to target IPU Ratnaparkhis (1996) maxent POS tagger, Collins (2003)

statistical parser, Abneys (1996) CASS partial parser Majority-class baseline (complete) SVM, linear kernel Mean human agreement 55.2% 80.0% 90.8% 5. Textual Completion (3) Labeled all IPUs in the corpus with the SVM model. 18% 82% Smooth switch 47% Hold

53% Incomplete Complete (2 test, p 0) Textual completion almost a necessary condition before switches -- but not before holds 5a. Lexical Cues S H Word Fragments 10 (0.3%) 549 (6.7%)

Filled Pauses 31 (1.0%) 764 (9.4%) 3246 (100%) 8123 (100%) Total IPUs No specific lexical cues other than these 6. Voice Quality 0.6 *

0.5 0.4 0.3 z-score 0.2 * * * 0.1 * *

* * * SSmooth 0 Switch Hold H -0.1 -0.2 -0.3 -0.4 IPU

Final Final 1.0s 0.5s Jitter Jitter IPU Final Final 1.0s 0.5s Shimmer Shimmer IPU Final 1.0s

NHR NHR Final 0.5s (*) ANOVA: p < 0.01 Higher jitter, shimmer, NHR before turn boundaries 7. IPU Duration 0.5 0.4 z-score 0.3 * *

0.2 Smooth Switch Hold 0.1 0 -0.1 (*) ANOVA: p < 0.01 -0.2 IPU duration IPU word count Longer IPUs before turn boundaries

Combining Individual Cues 1. 2. 3. 4. 5. 6. 7. Final intonation Speaking rate Intensity level Pitch level Textual completion Voice quality IPU duration Defining Cue Presence

2-3 representative features for each cue: Final intonation Abs. pitch slope over final 200ms, 300ms Speaking rate Syllables/sec, phonemes/sec over IPU Intensity level Mean intensity over final 500ms, 1000ms Pitch level Mean pitch over final 500ms, 1000ms Voice quality

Jitter, shimmer, NHR over final 500ms IPU duration Duration in ms, and in number of words Textual completion Complete vs. incomplete (binary) Define presence/absence based on whether value closer to mean value before S or to mean before H Presence of Turn-Yielding Cues 1: Final intonation 2: Speaking rate 3: Intensity level

4: Pitch level 5: IPU duration 6: Voice quality 7: Completion Percentage of turn-taking attempts Likelihood of TT Attempts 70% 60% 50% 40% r 2 = 0.969 30% 20% 10% 0% 0

1 2 3 4 5 6 Number of cues conjointly displayed in IPU 7 Sum: Cues Distinguishing Smooth Switches from Holds

Falling or high-rising phrase-final pitch Faster speaking rate Lower intensity Lower pitch Point of textual completion Higher jitter, shimmer and NHR Longer IPU duration Backchannel-Inviting Cues

Recall: Backchannels (e.g. yeah) indicate that Speaker B is paying attention but does not wish to take the turn Systems must Distinguish from users smooth switches (recognition) Know how to signal to users that a backchannel is appropriate In human conversations What contexts do Backchannels occur in? How do they differ from contexts where no Backchannel occurs (Holds) but Speaker A continues to talk and contexts where Speaker B takes the floor (Smooth Switches)

Method Hold Speaker A: Speaker B: IPU1 Backchannel IPU4 IPU2 IPU3 Compare IPUs preceding Holds (IPU1) (n=8123) with IPUs preceding Backchannels (IPU2) (n=553) Hypothesis: BC-preceding cues more likely to occur before Backchannels than before Holds Cues Distinguishing Backchannels from

Holds 1. 2. 3. 4. 5. 6. Final rising intonation: H-H% or L-H% Higher intensity level Higher pitch level Longer IPU duration Lower NHR Final POS bigram: DT NN, JJ NN, or NN NN Presence of Backchannel-Inviting Cues 1: Final intonation 2: Intensity level 3: Pitch level

4: IPU duration 5: Voice quality 6: Final POS bigram Percentage of IPUs followed by a BC Combined Cues 35% 30% 25% 20% 15% r 2 = 0.993 r 2 = 0.812 10% 5% 0%

0 -5% 1 2 3 4 5 Number of cues conjointly displayed 6 Smooth Switch, Backchannel, and Hold Differences

Summary We find major differences between Turn-yielding and Backchannel-preceding cues and between both and Holds Objective, automatically computable Should be useful for task-oriented dialogue systems Recognize user behavior correctly Produce appropriate system cues for turn-yielding, backchanneling, and turn-holding Future Work Additional turn-taking cues Better voice quality features Study cues that extend over entire turns, increasing near potential turn boundaries Novel ways to combine cues Weighting which more important? Which easier to calcluate? Do similar cues apply for behavior involving

overlapping speech e.g., how does Speaker2 anticipate turn-change before Speaker1 has finished? Next Class Entrainment in dialogue EXTRA SLIDES Overlapping Speech Hold Speaker A: ipu1 Overlap ipu2 ipu3

Speaker B: 95% of overlaps start during the turn-final phrase (IPU3). We look for turn-yielding cues in the second-tolast intermediate phrase (e.g., IPU2). Overlapping Speech Cues found in IPU2s: Higher speaking rate. Lower intensity. Higher jitter, shimmer, NHR. All cues match the corresponding cues found in (nonoverlapping) smooth switches. Cues seem to extend further back in the turn, becoming more prominent toward turn endings. Future research: Generalize the model of discrete turnyielding cues. Columbia Games Corpus Cards Game, Part 1

Player 1: Describer Player 2: Searcher Columbia Games Corpus Cards Game, Part 2 Player 1: Describer Player 2: Searcher Turn-Yielding Cues Speaker Variation Display of individual turn-yielding cues: Backchannel-Inviting Cues

Speaker Variation Display of individual BC-inviting cues: Turn-Yielding Cues 6. Voice Quality Jitter Variability in the frequency of vocal-fold vibration (measure of harshness) Shimmer Variability in the amplitude of vocal-fold vibration (measure of harshness) Noise-to-Harmonics Ratio (NHR) Energy ratio of noise to harmonic components in the voiced speech signal (measure of hoarseness) Turn-Yielding Cues

Speaker Variation 100% 100% 90% 90% 102 80% 103 101 80% 70%

104 105 70% 60% 60% 106 50% 50% 40% 40% 30% 30%

20% 20% 10% 10% 0% 0% 111 112 109 113 108 110

107 0 1 2 3 4 5 6 7 0

1 2 3 4 5 6 7 Backchannel-Inviting Cues Speaker Variation 70%

105 112 60% 113 50% 40% 110 30% 111 20% 103

108 106 10% 102 0% 0 1 2 3 4 5

6

Recently Viewed Presentations

  • PowerPoint 프레젠테이션 - Cisco

    PowerPoint 프레젠테이션 - Cisco

    Arial 굴림 Verdana HY헤드라인M Times New Roman 돋움 Wingdings Cisco2002Template Bitmap Image Premier Reseller 등록하기 PowerPoint 프레젠테이션 PowerPoint 프레젠테이션 PowerPoint 프레젠테이션 PowerPoint 프레젠테이션 PowerPoint 프레젠테이션 PowerPoint 프레젠테이션 PowerPoint 프레젠테이션 ...
  • Myths and Facts in African Agriculture: What We

    Myths and Facts in African Agriculture: What We

    Agricultural Factor Markets in Sub-Saharan Africa: An . Updated View with Formal Tests for Market . Failure … a preview. Provide a summary overview of land and labor market participation in Ethiopia, Malawi, Niger, Tanzania, and Uganda.
  • Animal Kingdom - Woodstown-Pilesgrove Regional School ...

    Animal Kingdom - Woodstown-Pilesgrove Regional School ...

    Animal Kingdom Invertebrate Phylum Animal Kingdom characteristics Eukaryotic Heterotrophic Multicellular Most sexual reproduction, asexual = budding, fragmentation No cell walls Symmetry Asymmetry Radial symmetry Bilateral symmetry Porifera The sponge Asymmetric Cell level of organization Cool characteristics: Choanocytes Amoeboid cells Spicules/spongin...
  • Albert Camus Biography - Chandler Unified School District

    Albert Camus Biography - Chandler Unified School District

    Albert Camus Biography. Roshini Jayasankar. Justine Enns . Michelle Anthony . Pooja Viswanath . should i talk about religion in context of stranger within the philosophy part or make a new slide.
  • Jesus at the centre

    Jesus at the centre

    Jesus at the centre. See, the former things have taken place, and new things I declare. Isaiah 42:9 (NIV) Well, this is an exciting day! As you know, it has been a while coming and so thank you for your...
  • Cartoons, Graphs, and Visuals for Practice

    Cartoons, Graphs, and Visuals for Practice

    Which slogan best reflects the point of view of Cecil Rhodes as shown in this cartoon? (1) "Imperialism is a Glorious Pursuit." (2) "Embrace African Diversity." (3) "Unite All Africans." (4) "Connecting Constantinople to Cairo."
  • Working On or Near Energized Lines June 2012

    Working On or Near Energized Lines June 2012

    Working On or Near Energized Lines. June 2012. Working on or near energized lines means a number of different things to different crews. Bare hand, hot stick, and Rubber gloving are all methods of live line maintenance and each of...
  • Composition 3 - Welcome to Mrs. Hackworth&#x27;s Weebly!

    Composition 3 - Welcome to Mrs. Hackworth's Weebly!

    The reader knows exactly which words belong to Moers and where to find the quote in her work. Following MLA format, the full bibliographic information for Moers's article then appears in a "Works Cited" list at the end of the...