Project StatusProject OverviewDevelopment Plan reviewSlide 4Development ToolsOntology SubsetDefined ProcessSlide 8Slide 9Current Development FocusProject StatusDaniel BevisWilliam KingVillanova UniversitySpring 2006CS9010Project OverviewComplete a subset of the Ontology Project (Project Archive)Generate ontology from existing documentationDetermine if it is possible to generate an Ontological classification (categories) from raw data characteristicsSupport flexibility to define a process that allows the ontology to be naturally extended as raw data is incorporatedDevelopment Plan reviewSelect subset of subject areasInitially select limited subject areaImportant to support reasonably quick review and analysis of resultsExpand subject area iteratively if time permits Define characteristics associated with a subset of the raw data from the web siteConsider Processing of subject documentation Natural Language Indexing search with cross referencesConsider simple keyword searchesDevelopment Plan reviewBuild categories from the characteristicsConsider generating a tool that allows you to describe a different subset from the rest of the raw data Create higher level categories based upon common subsets of characteristicsRepeat process until top level categories or characteristics conform to existing high level classifications or prove alternate categoriesPlace subjects into categoriesReview categorizationManually analyze resultsTest existing categorization on remaining subjects of the initially selected subsetDevelopment ToolsNatural Language Recognition via NLTK is the basis for initial researchSlow but well documented and supportedInstallation details (Win32 API)NLTK Lite 0.6.3 w/ Corpora packagePython 2.4.2PyWordNetWordNet 2.1Numarray 1.5Ontology SubsetTake SIGMICRO category as a single subject setBreak data into subsetsInitial subset allows for simpler manual verification and validationInternational Symposium on MicroarchitectureInitially a small subset of the available archive material will be usedRemaining subsets provide for further testing and validation of techniqueAdditional subsets from the ACM documentation will be added as time permitsDefined ProcessTake a subset of the raw data elements and define the elements characteristicsRead text in for processingTokenize textPerform Probabilistic Parsing via ViterbiParseConsider other parsing techniques if time permitsConsider training parsing processSelect Tokens for analysis Supposition: Nouns will provide adequate tokens to define characteristicsPotential Goal: identify a ‘reasonable’ subset of tokens for use as characteristicsDefined ProcessSelect Tokens for analysis (continued)May be reasonable to use only a subset of nouns Proper nouns are likely to have little impact if removedRedundant terms and synonymous should likely be consolidatedWhat impact would the use of other types (e.g. verbs) have in generating characteristics?Limiting to Nouns will greatly reduce the amount of information to be processedReduce processing time thereby allowing for faster generation of results in an time consuming processDefines a bound on what constitutes a characteristic and thereby reduces volume of data to be manually reviewed during developmentWill initially require additional testing to verify conceptDefined ProcessBased on common characteristics develop categoriesAnalyze each individual document’s parse treeUse statistical analysis of parse trees between documentsSupposition: Higher frequency of terms relative to all documents implies higher level characteristicPotential Goal: Identify a ‘reasonable’ subset of term inter-relations for use as characteristicsAssume that some raw data values will cross categoriesGroup elements into those categoriesIdentify common characteristics associated with other characteristicsIdentify higher level characteristics and categories from categories generated associated with the raw dataRecursive categorization approachCurrent Development FocusAutomating retrieval of documentObtain documents from web sources automaticallyConvert documents for use in NLTK environmentAutomate Execution of the analysis of documentsPython based code to handle processing in batch style executionUse Existing NLTK tools where
View Full Document