5/24/09'1'Cross‐Language'IR'CISC489/689‐010,'Lecture'#23'Monday,'May'11th'Ben'CartereCe'Cross‐Language'IR'• User'submits'a'query'in'one'language,'gets'results'in'a'different'language'• Documents'are'semi‐structured'and'heterogeneous'(as'almost'all'data'in'IR),'and'also'in'mulNple'languages'• InformaNon'may'only'be'available'in'documents'wriCen'in'one'of'the'languages'• Highly'useful'to'intelligence'community'5/24/09'2'Approaches'to'CLIR'• Translate'the'documents'into'the'users’'language,'and'let'the'users'submit'queries'in'their'own'language'• Translate'the'users’'queries'into'target'language(s)'and'use'the'translated'query'for'retrieval'• Translate'both'queries'and'documents'to'an'“intermediate”'language'AutomaNc'TranslaNon'• What'are'some'approaches'to'automaNc'translaNon?'– Language‐to‐language'dicNonaries'• Languages'do'not'translate'precisely'– One'word'with'several'meanings'in'one'language'might'translate'to'several'different'words'in'the'other'– Many'words'with'the'same'meaning'might'all'translate'to'a'single'word'– A'word'in'one'language'might'only'be'expressible'as'a'phrase'in'another'(or'vice‐versa)'– etc…'5/24/09'3'Example'• TranslaNons'of'“bank”:'– Orilla'(river'bank)'– Terraplen*(bank'of'earth)'– Banco*(bank'of'clouds)'– Bateria*(bank'of'lights)'– Banco*(financial'insNtuNon)'– Banca'(casino'bank)*• TranslaNons'of'“fraud”:'– Impostor*(fraudulent'person)'– Fraude'(decepNon)'• How'would'a'dicNonary‐based'system'know'which'pair'of'translaNons'to'use?'• English'queries'to'retrieve'Spanish'documents'• System'works'by'translaNng'query'to'Spanish'• Query:''“bank'fraud”'• Possibly'correct'translaNon:*• Fraude*bancario'StaNsNcal'Approach'• Instead'of'trying'to'translate'directly,'apply'staNsNcal'methods'• Learn'“translaNon'probabiliNes”'P(f'|'e)'–'probability'of'translaNng'string'e'in'language'E'to'string'f'in'language'F'• E.g.:'– P(orilla'fraude'|'bank'fraud),'P(orilla'impostor'|'bank'fraud),'P(banco'fraude'|'bank'fraud),'…'5/24/09'4'Cross‐Language'Language'Model'• Recall'query‐likelihood'language'model:'• Let’s'adapt'this'to'cross‐language'retrieval'using'staNsNcal'translaNon'P (Q|D)=!q∈QP (q|D)=!q∈Q(1 − αD)tfqD|D|+ αDctfq|C|P (Qf|De)=!qf∈QfP (qf|De)=!qf∈Qf"te∈EP (qf|te)P (te|De)=!qf∈Qf"te∈EP (qf|te)#(1 − αDe)tfteDe|De|+ αDectfte|Ce|$TranslaNon'Model'• What'is'P(qf'|'te)?'• The'transla6on*model:''probability'of'translaNng'word'te'in'language'E'to'word'qf'in'language'F'• Where'does'it'come'from?'– Maybe'a'dicNonary'approach:''every'possible'translaNon'of'te'has'equal'probability'– e.g.'P(orilla'|'bank)'='P(banco'|'bank)'='P(banca'|'bank)'='…'5/24/09'5'StaNsNcal'TranslaNon'Model'• An'alternaNve'approach:''parallel*corpora'StaNsNcal'TranslaNon'with'Parallel'Corpora'• Parallel'corpora'consist'of'documents'in'two'or'more'languages'that'are'known'to'be'translaNons'of'one'another'• The'parallel'copora'are'aligned:''string'e'and'string'f'are'marked'as'translaNons'of'each'other'• We'can'use'these'alignments'to'esNmate'a'translaNon'model'5/24/09'6'TranslaNon'Model'• To'esNmate'P(qf'|'te),'count'the'number'of'aligned'string'pairs'(e,'f)'such'that'te'is'a'word'in'e'and'qf'is'a'word'in'f'• Divide'by'the'total'number'of'strings'in'language'e'that'contain'te'P (qf|te)=|{(e, f )|te∈ e and qf∈ f}||{e|te∈ e}|Simple'Alignment'Example'• English'sentence:''“The'objecNve'was'clear:'arrest'and'extradite'to'Mexico'the'woman'against'whom'they'had'charged'for'fraud'to'a'recognized'banking'insNtuNon.”'• Spanish'sentence:''“El'objeNvo'era'claro:'detener'a'la'mujer'y'enviarla'de'regreso'a'México'pues'habían'cargos'en'su'contra'por'fraude'a'una'reconocida'insNtución'bancaria.”'• Every'pair'of'words'in'these'two'sentences'will'have'some'translaNon'probability'• Over'many'sentences,'the'highest'probabiliNes'will'be'the'pairs'of'words'that'are'most'closely'related'5/24/09'7'Alignments'• Alignments'can'be'much'more'detailed'Images'from'Brown'et'al.,'“The'MathemaNcs'of'StaNsNcal'Machine'TranslaNon”'Parallel'Corpora'• Where'do'we'get'parallel'corpora?'– Find'documents'that'we'know'to'be'translaNons'– Canadian'Hansard:''transcripts'of'Canadian'parliamentary'debates'in'both'English'and'French'– European'Union'law'in'22'languages'• Anything'that’s'not'law‐related?'– Wikipedia'arNcles'in'different'languages..''Not'necessarily'translaNons'though'5/24/09'8'CLIR'Experiments'• CLIR'track'ran'at'TREC'from'1998'through'2002'• Languages'used'include'English,'German,'French,'Italian,'Chinese,'and'Arabic'• Other'issues'in'CLIR:'– SegmentaNon,'stemming,'stopping,'phrases'require'different'approaches'in'different'languages'– I'am'going'to'focus'on'high‐level'problem'CLIR'Experiments'• In'2001'and'2002,'the'main'CLIR'task'was'English'queries'to'retrieve'Arabic'documents'• Documents:''383,872'news'arNcles'from'Agence'France'Press'from'1994‐2000'• InformaNon'needs:''25'queries,'descripNons,'and'narraNves'in'English'by'naNve'Arabic'speakers'– Translated'into'Arabic'and'French'as'well'• ParNcipaNng'sites'could'do'CLIR'(English'to'Arabic'or'French'to'Arabic)'or'normal'IR'(Arabic'to'Arabic)'5/24/09'9'Example'Topic'<num>'Number:'AR26'<Ntle>'' ﺲﻠﺠﻣ'ﺔﻣوﺎﻘﳌا'ﻲﻨﻃﻮﻟا'ﻲﻧﺎﺘﺳدﺮﻜﻟا
View Full Document