Stanford CS 224 - Finding Experts in a Given Domain - D554341

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Finding Experts in a Given Domain

DOC PREVIEW

Stanford CS 224 - Finding Experts in a Given Domain

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Finding experts in a given domainChihiro FukamiAswath ManoharanIntroduction:The World Wide Web has vast amounts of information of various types ranging from news articles, academic papers, resumes and blogs. Consequently everything today, begins with a web search. From searching for reviews of nearby Thai restaurants, to finding show times for the Da Vinci Code, looking up the latest stock prices and searching for directions and maps before leaving on a trip, web searches have become ubiquitous. But web searches still present data in an unstructured form. Search engines present documents containing the search query as a series of links. The onus is still on theuser to open each document, scan through it and extract whatever information he finds relevant and discard the rest. This is still a time-consuming process, made worse by the fact that not all the documents presented by the search engine might even be relevant. Even worse could be that the most relevant document is somewhere in 10th page of the search results. It is quite inconceivable that any user actually opens search results that deep.One can imagine tools that automate this process. Such tools would scan through all the documents, extract relevant information, try to provide some sort of a structure to unstructured information and finally present this information in an interface that makes sense to what is being searched. For example, when a patient is searching for physicians in his area, instead of him just seeing a series of links, a potential tool could present just the names of physicians, their addresses and contact information, their specializations, their rates and office hours in a neat spreadsheet format. The user could sort this data on each of the different attributes; in case he wants to find the closest physician he could sortthem by location, if he is penny conscious he could sort by their rates. The tool would extract all this information from the series of links that were the result from a search engine.In this project, we attempt to build one such tool. Specifically we aim to find experts in a given field. People often search the web for experts in a particular field. Parents are on the lookout for the best tennis coaches for their prodigious kids. Recruiters and headhunters are constantly on the search for talent in particular fields. Attorneys need to scout for expert witnesses in some of their cases. Journalists need to interview experts foran article they are working on. Most of explorations begin with a simple web search such as “music industry experts” or “scholars in Latin American history”. Users are then confronted with a series of links just as always – we hope to alleviate their problem by trying to extract relevant information (experts) from the series of links.Who is an expert?The definition of an expert is itself quite nebulous and vague. For the purposes of this project we determined a few simple heuristics. All these heuristics were based on the fact that a web search would be the initial step. The heuristics are:- If a name occurs in more than one document (we call this cross-document frequency), the name is likely an expert’s name. If we are searching for “Famous physicists”, it is likely that Albert Einstein’s name would occur in more than once document. The more news articles that talk about the same person, the more ‘expert’ the person is. Note this is different from raw frequency. A name could occur in the same document 100 times (in aninterview with the person for instance), but if it does not occur in any other document, it gets a cross-document frequency of 1. A name that occurs once each in two different documents gets a cross document frequency of 2 and this gets a higher rank than the previous name that occurred 100 times in just 1 document.- If a name is explicitly characterized as an expert. There are distinct patterns by which a name is characterized as an expert such as “Person X is a noted expert in Domain Y”. We attempt to capture such instance.- If a person has won awards, has been quoted or his articles have been cited, he is probably an expert.It needs to be emphasized that all these heuristics were determined before we began workon the project and one of the goals of this project was to explore how well each of the different heuristics worked.Extracting Names:A key component of the project was to extract names from documents. However this is not the primary focus of the project, hence we decided to use pre-existing packages to analyze a given text document, we used LingPipe, which is a Java library that contains programs for linguistic analyses of the English language. One of the demos that were partof the LingPipe package was a tool that takes a block of text, divides it into individual sentences, and then finds proper nouns (i.e. people’s names, locations, and organizations) within each of the sentences. The output is an XML file that categorizes each of these and tags them.Because this demo directly connects to a database on a server that contains statistical dataon sentence structures and categorized names, in the manner that we stored data in hashtables in previous assignments, instead of creating our own software, we wrote a program that basically communicates with the demo to input our own files and receive the resulting XML data. Further, our program parses through the XML and extracts people’s names, writing them to a file. This data is then passed to other modules.We also looked at another open source package called Yamcha. However this required training data to be supplied and we did not want to spend time doing that. Lingpipe however came with a built in model. Hence we decided to go with Lingpipe.Patterns of Expert Characterization:As mentioned in an earlier section, most times in news articles, blogs, interviews, profilesand biographical sketches experts are often identified using a certain set of patterns. It is quite common to see phrases like, “Professor Angus Wallace, one of the foremost orthopedic surgeons…” or “Dr. Kain is an expert in science education for high school kids”. These patterns not only classify a name as an expert, they also distinguish names ofexperts and non-experts (such as the journalist who wrote the article) that occur in the same document. Our approach was to try to enumerate all these different patterns offline and then look for occurrences of these patterns in the search results

View Full Document