1Languages at InxightIan HerseyCo-Founder and SVP, Corporate Development and StrategyInxight Confidenti al220+ years of Xerox PARC research - 70 patents Content & linguistic analysis (27 languages today) Information visualization and discoverySilicon Valley HQ; offices in US, Europe250 major customersSeasoned management teamSolid investor backing: Vantage Point, Reed Elsevier, Deutsche Bank,Dresdner Bank, Xerox, In-Q-TelInxight at a GlanceInxight provides the only complete solution fororganizing and accessing unstructured data toincrease the speed and accuracy of informationdiscoveryInxight Confidenti al3What we mean by language support Not pure statistics “Language independence” is a fallacy when it comesto text Whitespace parsing + algorithmic stemming is acheap hack Stem-internal changes Compounding Agglutination Vocalization or lack thereof Non-breaking languages Phrases, terms and named entities can’t be extractedeffectively by n-gram indexing or pure machinelearningInxight Confidenti al4Text analysis fundamentalsBase layer Language and character set identification Document analysis Tokenization Stemming/normalizationContextual analysis Part-of-speech tagging “Grouping”Find the interesting stuff Named entity extraction Syntactic analysis (clause boundary identification,subject/object identification, etc.)Relate the interesting stuff; analyze meaning Semantic analysis (fact extraction, etc.)Inxight Confidenti al5Don’t ignore statistics Feed linguistic markup into probabilisticprocessing Categorization (choose your algorithm) Search/relevance ranking Summarization Co-occurrence analysis/entity resolution Link analysis Predictive analysis/data miningInxight Confidenti al6Base layer (LinguistX Platform) Morphological analyzer Lexicon + rules Compiled as a finite-state machine Resource efficient, very fast French lexicon recognizes 5M words; takes up 300K ondisk/RAM, and runs at over 2 GB/hr on a low-endmachine Xerox finite-state tools tested on many languages(Inxight’s 27 + others in research) Corpora to produce statistical models Language and character set detection Tagged corpus to produce Hidden Markov Model forPOS tagger Groupers Finite-state “chunkers” – compiled regex2Inxight Confidenti al7Named entity extraction(ThingFinder) Builds on base platform Requires additional resources Enhanced lexicon (POS tagset insufficient for highquality extraction) Entity-specific groupers Tagged corpus for accuracy testing Sometimes you need more Genre-specific document analysis Specialized tokenization, tagging Knowledge base (“Name Catalog”) Custom groupersInxight Confidenti al8Statistical models Summarization Base layer + feature model (feature weights, stopwords, cue phrases) Categorization Labeled training data …and lots of interactive toolsInxight Confidenti al9Fact extraction Builds on base of linguistic markup + namedentities Modeled on specific templates Rules populate the templatesAdditional linguistic resources Intra-document Document analysis/genre identification Subject/object identification Anaphora resolution Inter-document Entity resolutionInxight Confidenti al10Developing a new language Resource acquisition Corpora Lexicon Team Computation linguist familiar with tools Native speaker Resource enhancement Label tagged truth sets Build out morphological classes Fill lexical gaps Build, test and refineSoup to nuts: $500K to $1M for V1.0Inxight Confidenti al11Challenge of low-density languages Commercial non-viability Lack of lexical resources and corpora Lack of native speakers, or even proficientspeakers GreedInxight Confidenti al12Future developments on the languagefrontier New languages Increased depth in existing languages Named entity extraction Added Arabic, Farsi and Chinese this year Enhanced English for DoD and DOJ Fact extraction Other challenges Name transliteration Translation/glossing Question
View Full Document