Languages at InxightInxight at a GlanceWhat we mean by language supportText analysis fundamentalsDon’t ignore statisticsBase layer (LinguistX Platform)Named entity extraction (ThingFinder)Statistical modelsFact extractionDeveloping a new languageChallenge of low-density languagesFuture developments on the language frontierLanguages at InxightIan HerseyCo-Founder and SVP, Corporate Development and StrategyInxight Confidential220+ years of Xerox PARC research - 70 patentsContent & linguistic analysis (27 languages today)Information visualization and discoverySilicon Valley HQ; offices in US, Europe250 major customersSeasoned management teamSolid investor backing: Vantage Point, Reed Elsevier, Deutsche Bank, Dresdner Bank, Xerox, In-Q-TelInxight at a GlanceInxight provides the only complete solution for organizing and accessing unstructured data to increase the speed and accuracy of information discoveryInxight Confidential3What we mean by language supportNot pure statistics“Language independence” is a fallacy when it comes to textWhitespace parsing + algorithmic stemming is a cheap hackStem-internal changesCompoundingAgglutinationVocalization or lack thereofNon-breaking languagesPhrases, terms and named entities can’t be extracted effectively by n-gram indexing or pure machine learningInxight Confidential4Text analysis fundamentalsBase layerLanguage and character set identificationDocument analysisTokenizationStemming/normalizationContextual analysisPart-of-speech tagging“Grouping”Find the interesting stufNamed entity extractionSyntactic analysis (clause boundary identification, subject/object identification, etc.)Relate the interesting stuf; analyze meaningSemantic analysis (fact extraction, etc.)Inxight Confidential5Don’t ignore statisticsFeed linguistic markup into probabilistic processingCategorization (choose your algorithm)Search/relevance rankingSummarizationCo-occurrence analysis/entity resolutionLink analysisPredictive analysis/data miningInxight Confidential6Base layer (LinguistX Platform)Morphological analyzerLexicon + rulesCompiled as a finite-state machineResource efficient, very fastFrench lexicon recognizes 5M words; takes up 300K on disk/RAM, and runs at over 2 GB/hr on a low-end machineXerox finite-state tools tested on many languages (Inxight’s 27 + others in research)Corpora to produce statistical modelsLanguage and character set detectionTagged corpus to produce Hidden Markov Model for POS taggerGroupersFinite-state “chunkers” – compiled regexInxight Confidential7Named entity extraction (ThingFinder)Builds on base platformRequires additional resourcesEnhanced lexicon (POS tagset insufficient for high quality extraction)Entity-specific groupersTagged corpus for accuracy testingSometimes you need moreGenre-specific document analysisSpecialized tokenization, taggingKnowledge base (“Name Catalog”)Custom groupersInxight Confidential8Statistical modelsSummarizationBase layer + feature model (feature weights, stop words, cue phrases)CategorizationLabeled training data…and lots of interactive toolsInxight Confidential9Fact extractionBuilds on base of linguistic markup + named entitiesModeled on specific templatesRules populate the templatesAdditional linguistic resourcesIntra-documentDocument analysis/genre identificationSubject/object identificationAnaphora resolutionInter-documentEntity resolutionInxight Confidential10Developing a new languageResource acquisitionCorporaLexiconTeamComputation linguist familiar with toolsNative speakerResource enhancementLabel tagged truth setsBuild out morphological classesFill lexical gapsBuild, test and refineSoup to nuts: $500K to $1M for V1.0Inxight Confidential11Challenge of low-density languagesCommercial non-viabilityLack of lexical resources and corporaLack of native speakers, or even proficient speakersGreedInxight Confidential12Future developments on the language frontierNew languagesIncreased depth in existing languagesNamed entity extractionAdded Arabic, Farsi and Chinese this yearEnhanced English for DoD and DOJFact extractionOther challengesName transliterationTranslation/glossingQuestion
View Full Document