DOC PREVIEW
UMD LBSC 796 - Cross-Language and Multimedia Information Retrieval

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1LBSC 796/INFM 718R: Week 11Cross-Language and Multimedia Information RetrievalJimmy LinCollege of Information StudiesUniversity of MarylandMonday, April 17, 2006Topics covered so far…| Evaluation of IR systems| Inner workings of IR black boxes| Interacting with retrieval systems| Interfaces in support of retrievalQuestions for Today| What if the collection contains documents in a foreign language?| What if the collection isn’t even comprised of textual documents?Cross-Language IR| Or “finding documents in languages you can’t read”| Why would you want to do it?| How would you do it?Most Widely-Spoken Languages01002003004005006007008009001000ChineseEnglishSpanishRussianFrenchPortugueseArabicBengaliHindi/UrduJapaneseGermanNumber of Speakers (millions)SecondaryPrimarySource: Ethnologue (SIL), 1999Global TradeSource: World Trade Organization 2000 Annual Report020040060080010001200USAGermanyJapanChinaFranceUKCanadaItalyNetherlandsBelgiumKoreaMexicoTaiwanSingaporeSpainTrade (billions of dollars)ExportsImports231%18%9%7%7%5%4%3%3%2%11%EnglishChineseJapaneseSpanishGermanKoreanFrenchPortugueseIta lianRussianOtherNative speakers, Global Reach projection for 2004 (as of Sept, 2003)Global Internet Users68%4%6%2%6%1%3%1%2%2%5%31%18%9%7%7%5%4%3%3%2%11%EnglishChineseJapaneseSpanishGermanKoreanFrenchPortugueseIta lianRussianOtherWeb PagesA Community: CLEF| CLEF = “Cross-Language Evaluation Forum”| 8 tracks at CLEF 2005z Multilingual information retrievalz Cross-language information retrievalz Interactive cross-language information retrieval z Multiple language question answeringz Cross-language retrieval on image collectionsz Cross-language spoken document retrievalz Multilingual Web retrievalz Cross-language geographic retrievalThe Information Retrieval CycleSourceSelectionSearchQuerySelectionRanked ListExaminationDocumentsDeliveryDocumentsQueryFormulationResourcesource reselectionSystem discoveryVocabulary discoveryConcept discoveryDocument discoveryHow do you formulate a query?If you can’t understand the documents…How do you know something is worth looking at?How can you understand the retrieved documents?CLIR| CLIR = “Cross Language Information Retrieval”| Typical setupz User speaks only Englishz Wants access to documents in a foreign language (e.g., Chinese or Arabic)| Requirementsz User needs to understand retrieved documents!z Interface must support browsing of documents in foreign languages| How do we do it?Two Approaches| Query translationz Translate English query into Chinese queryz Search Chinese document collectionz Translate retrieved results back into English| Document translationz Translate entire document collection into Englishz Search collection in English| Translate both?Query TranslationRetrievalEngineTranslation SystemChinesequeriesChinesedocumentsResultsEnglishqueriesselectexamineChinese Document Collection3Document TranslationEnglishqueriesChinese Document CollectionRetrievalEngineTranslationSystemEnglish Document CollectionResultsselectexamineTradeoffs| Query Translationz Often easierz Disambiguation of query terms may be difficult with short queriesz Translation of documents must be performed at query time| Document Translationz Documents can be translate and stored offlinez Automatic translation can be slow| Which is better?z Often depends on the availability of language-specific resources (e.g., morphological analyzers)z Both approaches present challenges for interactionoilpetroleumprobesurveytake samplesWhichtranslation?Notranslation!restraincymbidium goeringiiWrongsegmentationCLIR Issuesoilpetroleumprobesurveytake samplesLearning to Translate| Lexiconsz Phrase books, bilingual dictionaries, …| Large text collectionsz Translations (“parallel”)z Similar topics (“comparable”)| PeopleHieroglyphicDemoticGreekModern Rosetta Stones| Newswire:z DE-News (German-English)z Hong-Kong News, Xinhua News (Chinese-English)| Government:z Canadian-Hansards (French-English)z Europarl (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish, Swedish)z UN Treaties (Russian, English, Arabic, …)| The Bible (many, many languages)4Parallel Corpus| Example from DE-News (8/1/1996)Diverging opinions about planned tax reformUnterschiedliche Meinungen zur geplanten SteuerreformThe discussion around the envisaged major tax reform continues .Die Diskussion um die vorgesehene grosse Steuerreform dauert an .The FDP economics expert , Graf Lambsdorff , today came out in favor of advancing the enactment of significant parts of the overhaul , currently planned for 1999 .Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heutedafuer aus , wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen .English:German:English:German:English:German:Word-Level AlignmentDiverging opinions about planned tax reformUnterschiedliche Meinungen zur geplanten SteuerreformEnglishGermanMadam President , I had asked the administration …EnglishSeñora Presidenta, había pedido a la administración del Parlamento …SpanishLearning Translations探测| From alignments, automatically induce a translation lexiconsurvey试探样品测量(p = 0.4)(p = 0.3)(p = 0.25)(p = 0.05)Multiple Translations Translation ProbabilitiesA Translation Model| From word-aligned bilingual text, we induce a translation model| Example:)|( efpi1)|( =∑ifiefpwhere,p(探测|survey) = 0.4p(试探|survey) = 0.3p(测量|survey) = 0.25p(样品|survey) = 0.05Using Multiple Translations| Weighted Structured Query Translationz Takes advantage of multiple translations and translation probabilities| TF and DF of query term e are computed using TF and DF of its translations:∑×=ifkiikDfTFefpDeTF ),()|(),(∑×=ifiifDFefpeDF )()|()(Experiment Setup| Does weighted structured query translation work?| Test collection (from CLEF 2000-2003)z ~ 44,000 documents in Frenchz 153 topics in English (and French, for comparison)| IR system: Okapi weights| Translation resourcesz Europarl parallel corpus: ~ 100M on each sidez GIZA++ Statistical MT toolkit5Does it work?| Runs:z Monolingual baselinez One-best translation baselinez Weighted structured query translation| Results:z Weighted structured query translation always beats one-best translationz Weighted structured query translation performance approaches monolingual performanceMorphology and Segmentation| For the query translation approachz The retrieval engine needs to perform monolingual IR in a foreign languagez Morphology and segmentation pose


View Full Document

UMD LBSC 796 - Cross-Language and Multimedia Information Retrieval

Download Cross-Language and Multimedia Information Retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Cross-Language and Multimedia Information Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Cross-Language and Multimedia Information Retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?