UMD CMSC 723 - Languages at Inxight - D2721142

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 723> Languages at Inxight

UMD CMSC 723 - Languages at Inxight

School name University of Maryland, College Park

Course Cmsc 723- Computational Linguistics I

Pages 2

Download Save

Unformatted text preview:

Inxight at a Glance Inxight provides the only complete solution for organizing and accessing unstructured data to increase the speed and accuracy of information discovery Languages at Inxight Ian Hersey Co Founder and SVP Corporate Development and Strategy 20 years of Xerox PARC research 70 patents Content linguistic analysis 27 languages today Information visualization and discovery Silicon Valley HQ offices in US Europe 250 major customers Seasoned management team Solid investor backing 2 Vantage Point Reed Elsevier Deutsche Bank Dresdner Bank Xerox In Q Tel Inxight Confidential What we mean by language support Text analysis fundamentals Not pure statistics Language independence is a fallacy when it comes to text Whitespace parsing algorithmic stemming is a cheap hack Stem internal changes Compounding Agglutination Vocalization or lack thereof Non breaking languages Phrases terms and named entities can t be extracted effectively by n gram indexing or pure machine learning Base layer Language and character set identification Document analysis Tokenization Stemming normalization Contextual analysis Part of speech tagging Grouping Find the interesting stuff Named entity extraction Syntactic analysis clause boundary identification subject object identification etc Relate the interesting stuff analyze meaning Semantic analysis fact extraction etc 3 4 Inxight Confidential Inxight Confidential Don t ignore statistics Base layer LinguistX Platform Feed linguistic markup into probabilistic Morphological analyzer Lexicon rules Compiled as a finite state machine Resource efficient very fast French lexicon recognizes 5M words takes up 300K on disk RAM and runs at over 2 GB hr on a low end machine Xerox finite state tools tested on many languages Inxight s 27 others in research Corpora to produce statistical models Language and character set detection Tagged corpus to produce Hidden Markov Model for POS tagger Groupers Finite state chunkers compiled regex processing Categorization choose your algorithm Search relevance ranking Summarization Co occurrence analysis entity resolution Link analysis Predictive analysis data mining 5 6 Inxight Confidential Inxight Confidential 1 Named entity extraction ThingFinder Statistical models Builds on base platform Requires additional resources Enhanced lexicon POS tagset insufficient for high quality extraction Entity specific groupers Tagged corpus for accuracy testing Sometimes you need more Genre specific document analysis Specialized tokenization tagging Knowledge base Name Catalog Custom groupers 7 Summarization Base layer feature model feature weights stop words cue phrases Categorization Labeled training data and lots of interactive tools 8 Inxight Confidential Inxight Confidential Fact extraction Developing a new language Builds on base of linguistic markup named Resource acquisition Corpora Lexicon Team Computation linguist familiar with tools Native speaker Resource enhancement Label tagged truth sets Build out morphological classes Fill lexical gaps Build test and refine entities Modeled on specific templates Rules populate the templates Additional linguistic resources Intra document Document analysis genre identification Subject object identification Anaphora resolution Inter document Entity resolution Soup to nuts 500K to 1M for V1 0 9 10 Inxight Confidential Inxight Confidential Future developments on the language frontier Challenge of low density languages Commercial non viability Lack of lexical resources and corpora Lack of native speakers or even proficient New languages Increased depth in existing languages Named entity extraction Added Arabic Farsi and Chinese this year Enhanced English for DoD and DOJ Fact extraction speakers Greed Other challenges Name transliteration Translation glossing Question answering 11 12 Inxight Confidential Inxight Confidential 2

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 723 - Languages at Inxight

Sign up for free to view:

Please select your school