New version page

InfoMagnets: Making Sense of Corpus Data

Upgrade to remove ads

This preview shows page 1-2-3-26-27-28 out of 28 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

InfoMagnets: Making Sense of Corpus DataOutlineSlide 3Defining Exploratory Corpus AnalysisInfoMagnetsInfoMagnets ApplicationsSlide 7Authoring Conversational InterfacesSlide 9Topic SegmentationTopic Segmentation of DialogueCoherence Defined Over Shared TaskSlide 13Experimental Condition1st Attempt: TextTilingTextTiling ResultsSlide 17Slide 182nd Attempt: Barzilay and Lee (2005)B&L ResultsSlide 21Incorporating Dialogue Dynamics3rd Attempt: Cross-Dimensional LearningX-Dimensional Learning ResultsStatistically Significant ImprovementFuture DirectionsRecapQ/AInfoMagnets: Making Sense of Corpus DataJaime ArguelloLanguage Technologies InstituteOutline•InfoMagnets•Applications•Topic Segmentation•Conclusions•Q/AOutline•InfoMagnets•Applications•Topic Segmentation•Conclusions•Q/ADefining Exploratory Corpus Analysis•Getting a “sense” of your data•How does it relate to:–Information retrieval•Need to understand the whole corpus –Data mining•Need rich interface to support serendipitous search –Text classification•Need to find the “interesting” classesInfoMagnetsInfoMagnets Applications•Behavioral Research–2 Publishable results (submitted to CHI)•CycleTalk Project, LTI–New findings on mechanisms at work in guided exploratory learning•Robert Kraut’s Netscan Group, HCII•Conversational Interfaces•Corpus organization makes authoring conversational agents less intimidating. Rose, Pai, & Arguello (2005); Gweon et al., (2005)InfoMagnets Applications•Behavioral Research–2 Publishable results (submitted to CHI)•CycleTalk Project, LTI–New findings on mechanisms at work in guided exploratory learning•Robert Kraut’s Netscan Group, HCII•Conversational Interfaces•Corpus organization makes authoring conversational agents less intimidating. Rose, Pai, & Arguello (2005); Gweon et al., (2005)Authoring Conversational Interfaces•Goal: Make Authoring CI’s easier•Solution: –Guide development with pre-processed sample human-human conversations•Addresses different issues–Accessible to non-computational linguists–Developers ≠ domain experts–Consistent with user-centered design: “The user is not like me!”Authoring Conversational InterfacesTopic SegmentationTranscribed human-human conversationsACBCABConstructing a Master TemplateTopic Segmentation•Preprocess for InfoMagnets•But, an important computational linguistics problem in its own right!•Previous Work –Marti Hearst’s TextTiling (1994)–Beeferman, Berger, and Lafferty (1997)–Barzilay and Lee (2004) NAACL best paper award!– …..•But, should it all fall under “topic segmentation”?Topic Segmentation of Dialogue•Dialogue is Different:–Very little training data–Linguistic Phenomena •Ellipsis•Telegraphic Content–Coherence is organized around a shared task, not primarily around a single flow of information•Lots of places where there is no overlap in “meaningful” contentCoherence Defined Over Shared TaskCoherence Defined Over Shared TaskMultiple topic shifts in regions w/ zero lexical cohesionExperimental Condition•22 student-tutor pairs•Conversation captured through mainstream chat client•Thermodynamics domain•Training and test data coded by one coder•Results shown in terms of p_k (Lafferty & Beeferman, 1999)•Significant tests: 2-tailed, t-tests1st Attempt: TextTiling•TextTiling (Hearst, 1997)–Slide two adjacent “windows” down the text–At each state calculate cosine correlation–Use correlation values to calculate “depth”–“Depth” values higher than a threshold correspond to topic shiftsw1w2TextTiling ResultsAlgorithm (avg) PkNONE 0.489703303ALL 0.518040038EVEN 0.518040038TextTiling 0.535304902TT T-test p-valueTT (NONE) 0.160036665TT (ALL) 0.089668491TT (EVEN) 0.221969444•Trend for TextTiling to perform worse than degenerate baselines•Difference not statistically significant•Why doesn’t it work?•Lots of gaps where the correlation = 0•Must select boundary heuristically•And, still a heuristical improvement on originalTextTiling Results•But, topic shifts tend NOT to occur where corr > 0.TextTiling Results•Cluster utterances•Treat each cluster as a “state”•Construct HMM–Emission probabilities: state-specific language models–Transition probabilities: based on location and cluster-membership of the utterances•Viterbi re-estimation until convergence2nd Attempt: Barzilay and Lee (2005)•B&L statistically better than TT, but not better than degenerate algorithmsB&L ResultsAlgorithm (avg) PkNONE 0.489703303ALL 0.518040038EVEN 0.518040038TextTiling 0.535304902B&L0.5086039B&L T-test p-valueB&L (NONE) 0.586165431B&L (ALL) 0.471531964B&L (EVEN) 0.862335954B&L (TextTiling) 0.038565997B&L Results•Too fine grained topic boundaries•Most clusters based on “fixed expressions” (e.g. “ok”, “yeah”, “sure” )•Remember: cohesion based on shared task •Are state-based language models sufficiently different?Incorporating Dialogue Dynamics•Dialogue Act coding scheme•Not originally developed for segmentation, but for discourse analysis of human-tutor dialogues•4 main dimensions:–Action: open question, closed question, negation, etc.–Depth: (yes/no) is utterance accompanied with explanation or elaboration–Focus: (binary) is focus on speaker or other agent–Control: Initiation, Response, Feedback •Dialogue Exchange (Sinclair and Coulthart, 1975)•(Donmez, 2004)•Use estimated labels on some dimensions to learn other dimensions•3 types of Features:–Text (discourse cues)–Lexical coherence (binary)–Dialogue Acts labels•10-fold cross-validation•Topic Boundaries learned on estimated labels, not hand coded ones!3rd Attempt: Cross-Dimensional LearningX-Dimensional Learning Results•X-DIM statistically better than TT and degenerate algorithms!Algorithm (avg) PkNONE 0.489703303ALL 0.518040038EVEN 0.511727103TextTiling 0.535304902B&L 0.5086039X-DIM 0.45199219X-DIM T-test p-valueX-DIM (NONE) 0.024460654X-DIM (ALL) 0.023027116X-DIM (EVEN) 0.022848002X-DIM (TextTiling) 0.011833934Statistically Significant ImprovementTT B&L X-DIMNONENON-SIG NON-SIG SIGALLNON-SIG NON-SIG SIGEVENNON-SIG NON-SIG SIGTTSIG SIGB&LSIGFuture DirectionsMerge cross-dimensional learning (w/ dialogue act features) with B&L content modeling HMM approach.•Explore other work in topic segmentation of dialogueRecap•InfoMagnets and applications•Corpus exploration and authoring of


Download InfoMagnets: Making Sense of Corpus Data
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view InfoMagnets: Making Sense of Corpus Data and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view InfoMagnets: Making Sense of Corpus Data 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?