DOC PREVIEW
UMD CMSC 723 - Languages at Inxight

This preview shows page 1 out of 2 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Languages at InxightIan HerseyCo-Founder and SVP, Corporate Development and StrategyInxight Confidenti al220+ years of Xerox PARC research - 70 patents Content & linguistic analysis (27 languages today) Information visualization and discoverySilicon Valley HQ; offices in US, Europe250 major customersSeasoned management teamSolid investor backing: Vantage Point, Reed Elsevier, Deutsche Bank,Dresdner Bank, Xerox, In-Q-TelInxight at a GlanceInxight provides the only complete solution fororganizing and accessing unstructured data toincrease the speed and accuracy of informationdiscoveryInxight Confidenti al3What we mean by language support Not pure statistics “Language independence” is a fallacy when it comesto text Whitespace parsing + algorithmic stemming is acheap hack Stem-internal changes Compounding Agglutination Vocalization or lack thereof Non-breaking languages Phrases, terms and named entities can’t be extractedeffectively by n-gram indexing or pure machinelearningInxight Confidenti al4Text analysis fundamentalsBase layer Language and character set identification Document analysis Tokenization Stemming/normalizationContextual analysis Part-of-speech tagging “Grouping”Find the interesting stuff Named entity extraction Syntactic analysis (clause boundary identification,subject/object identification, etc.)Relate the interesting stuff; analyze meaning Semantic analysis (fact extraction, etc.)Inxight Confidenti al5Don’t ignore statistics Feed linguistic markup into probabilisticprocessing Categorization (choose your algorithm) Search/relevance ranking Summarization Co-occurrence analysis/entity resolution Link analysis Predictive analysis/data miningInxight Confidenti al6Base layer (LinguistX Platform) Morphological analyzer Lexicon + rules Compiled as a finite-state machine Resource efficient, very fast French lexicon recognizes 5M words; takes up 300K ondisk/RAM, and runs at over 2 GB/hr on a low-endmachine Xerox finite-state tools tested on many languages(Inxight’s 27 + others in research) Corpora to produce statistical models Language and character set detection Tagged corpus to produce Hidden Markov Model forPOS tagger Groupers Finite-state “chunkers” – compiled regex2Inxight Confidenti al7Named entity extraction(ThingFinder) Builds on base platform Requires additional resources Enhanced lexicon (POS tagset insufficient for highquality extraction) Entity-specific groupers Tagged corpus for accuracy testing Sometimes you need more Genre-specific document analysis Specialized tokenization, tagging Knowledge base (“Name Catalog”) Custom groupersInxight Confidenti al8Statistical models Summarization Base layer + feature model (feature weights, stopwords, cue phrases) Categorization Labeled training data …and lots of interactive toolsInxight Confidenti al9Fact extraction Builds on base of linguistic markup + namedentities Modeled on specific templates Rules populate the templatesAdditional linguistic resources Intra-document Document analysis/genre identification Subject/object identification Anaphora resolution Inter-document Entity resolutionInxight Confidenti al10Developing a new language Resource acquisition Corpora Lexicon Team Computation linguist familiar with tools Native speaker Resource enhancement Label tagged truth sets Build out morphological classes Fill lexical gaps Build, test and refineSoup to nuts: $500K to $1M for V1.0Inxight Confidenti al11Challenge of low-density languages Commercial non-viability Lack of lexical resources and corpora Lack of native speakers, or even proficientspeakers GreedInxight Confidenti al12Future developments on the languagefrontier New languages Increased depth in existing languages Named entity extraction Added Arabic, Farsi and Chinese this year Enhanced English for DoD and DOJ Fact extraction Other challenges Name transliteration Translation/glossing Question


View Full Document

UMD CMSC 723 - Languages at Inxight

Documents in this Course
Lecture 9

Lecture 9

12 pages

Smoothing

Smoothing

15 pages

Load more
Download Languages at Inxight
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Languages at Inxight and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Languages at Inxight 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?