Stanford CS 224 - Clustering Semantically Similar and Related Questions - D2015895

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Clustering Semantically Similar and Related Questions

DOC PREVIEW

Stanford CS 224 - Clustering Semantically Similar and Related Questions

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Clustering Semantically Similar and Related QuestionsDeepa [email protected] ABSTRACTThe success of online question answering communities that allow humans to answer questions posed by other humans has opened up a whole new set of search, browse and clustering problems. One of the important problems arises from the need to show similar and related questions for a particular “probe” question. The first exercise is to define a measure for similarity and relatedness that leads to showing interesting results to the user, assuming that the user is interested in the probe question. However, this exercise reveals the inherent ambiguity of defining relatedness for this problem. The solution proposed takes this ambiguity into account and does not rely on a fixed similarity measure to filter questions and re-rank them. Instead the approach uses a two step method to show relevant questions for the probe. The first step involves identifying the main topic of the question using which the base set of questions having this same topic is constructed. This base set is then clustered taking into account the lexical and semantic similarity between the questions. The hypothesis is that each of the identified clusters defines a type of relatedness with the probe – one of these types is the identity relation which encompasses paraphrases and very similar questions. For evaluating the proposed technique, the results were evaluated manually and it was noted that for 90% of the cases, the clustering technique is effective and the results displayed in this manner seem appealing. 2 INTRODUCTIONOnline communities that allow you to ask and answer questions have recently become very popular. Unlike automated question answering systems (such as www.ask.com, www.brainboost.com), these online communities do not promise to answer your questions automatically. On the other hand, they provide a platform where other people in the community can answer your question and you can answer other questions posed on the forum. Yahoo Answers (www.answers.yahoo.com), Amazon’s AskVille (www.askville.com) and www.blurit.com are some of the examples of such online Q&A services. The huge success of these systems can somewhat be attributed to the failure of large scale automated question answering systems. The growth of such services has been phenomenal (Yahoo Answers had 65 million answers and more than 7 million questions in November of 2006. In six months, the questions and answers have increased to more than 15 million questions and 140 million answers). As a consequence, these services have created a huge repository of human generated question and answers. There’s a lot of research happening in treating this “user generated content” differently than most documents on the web are treated. This leads to a different set of search, browse and presentation problems for this “web” of user generated content. Different models for querying, ranking of results, clustering and categorization are evolving for this form of data present in Q&A form. Since this user generated content is in natural language, it is but obvious that several NLP techniques would be useful for handling these problems. These Q&A services might not wish to offer automated answers to question but they want to provide good presentation in terms of browsing, searching and grouping results. One of the features that most of these services wish to offer is to display similar and related questions for a particular “query” question. Unlike a normal text document, a question and answer page is more structured where the most useful portion describing the intent of the page is the question. Traditional document similarity measures use bag of words kind of models and approximate similarity between two documents to mean highoverlap in the terms in those documents. However, in order to determine similarity between just questions, it becomes important to define a similarity measure that is targeted more at the sentence level. In this project, I have played with several methods for displaying related and similar questions for a given question. The focus was not on identifying perfectly similar questions or paraphrases but to formulatedifferent ways of grouping questions that have the same aspect of relatedness with the given question. The solution that I propose and implement is effective for the problem that is defined below. 3 PROBLEM DEFINITIONThe problem is defined as follows: Given a question q, identify questions in the corpus that are similar and related to this question and group/cluster them as per their “relatedness” to the question q. I decided to focus on this problem more than identifying only paraphrases for a question for the following reason. When a user is “visiting” a question q it is not so interesting to look at only paraphrases or identical questions but to look at things which could be “related” to q. The assumption is that the user might be visiting this Q&A page through a web search and by browsing the online community. Thus, if we cluster results which potentially have somerelatedness with question q, the idea is that the clusters identified would correspond to one of the ways of defining “relatedness”. Take an example of qbeing, “How can I get rid of my fear of flying?” We show clusters of questions which are “related” to this question – one of the criterion for relatedness being semantic identity or exact question match. Hence, one of the clusters formed would contain questions that are exact duplicates of the above question plus paraphrases of the form “How can I overcome my fear of flying?” Another cluster could define a relation which considers other fears such as that of height, depth and ways to overcome those. One could think of many such relations according to which the questions could be grouped. There are some constraints that are imposed in this method on the definitions of paraphrasing and relatedness. In my definition of paraphrasing, there needs to be at least one lexically overlapping concept

View Full Document