Villanova CSC 9010 - Project Status - D2050508

Home> Schools> Villanova University> Computer Science (CSC) > CSC 9010> Project Status

DOC PREVIEW

Villanova CSC 9010 - Project Status

School name Villanova University

Course Csc 9010- TOP:High Performance Computing

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Project StatusProject OverviewDevelopment Plan reviewSlide 4Development ToolsOntology SubsetDefined ProcessSlide 8Slide 9Current Development FocusProject StatusDaniel BevisWilliam KingVillanova UniversitySpring 2006CS9010Project OverviewComplete a subset of the Ontology Project (Project Archive)Generate ontology from existing documentationDetermine if it is possible to generate an Ontological classification (categories) from raw data characteristicsSupport flexibility to define a process that allows the ontology to be naturally extended as raw data is incorporatedDevelopment Plan reviewSelect subset of subject areasInitially select limited subject areaImportant to support reasonably quick review and analysis of resultsExpand subject area iteratively if time permits Define characteristics associated with a subset of the raw data from the web siteConsider Processing of subject documentation Natural Language Indexing search with cross referencesConsider simple keyword searchesDevelopment Plan reviewBuild categories from the characteristicsConsider generating a tool that allows you to describe a different subset from the rest of the raw data Create higher level categories based upon common subsets of characteristicsRepeat process until top level categories or characteristics conform to existing high level classifications or prove alternate categoriesPlace subjects into categoriesReview categorizationManually analyze resultsTest existing categorization on remaining subjects of the initially selected subsetDevelopment ToolsNatural Language Recognition via NLTK is the basis for initial researchSlow but well documented and supportedInstallation details (Win32 API)NLTK Lite 0.6.3 w/ Corpora packagePython 2.4.2PyWordNetWordNet 2.1Numarray 1.5Ontology SubsetTake SIGMICRO category as a single subject setBreak data into subsetsInitial subset allows for simpler manual verification and validationInternational Symposium on MicroarchitectureInitially a small subset of the available archive material will be usedRemaining subsets provide for further testing and validation of techniqueAdditional subsets from the ACM documentation will be added as time permitsDefined ProcessTake a subset of the raw data elements and define the elements characteristicsRead text in for processingTokenize textPerform Probabilistic Parsing via ViterbiParseConsider other parsing techniques if time permitsConsider training parsing processSelect Tokens for analysis Supposition: Nouns will provide adequate tokens to define characteristicsPotential Goal: identify a ‘reasonable’ subset of tokens for use as characteristicsDefined ProcessSelect Tokens for analysis (continued)May be reasonable to use only a subset of nouns Proper nouns are likely to have little impact if removedRedundant terms and synonymous should likely be consolidatedWhat impact would the use of other types (e.g. verbs) have in generating characteristics?Limiting to Nouns will greatly reduce the amount of information to be processedReduce processing time thereby allowing for faster generation of results in an time consuming processDefines a bound on what constitutes a characteristic and thereby reduces volume of data to be manually reviewed during developmentWill initially require additional testing to verify conceptDefined ProcessBased on common characteristics develop categoriesAnalyze each individual document’s parse treeUse statistical analysis of parse trees between documentsSupposition: Higher frequency of terms relative to all documents implies higher level characteristicPotential Goal: Identify a ‘reasonable’ subset of term inter-relations for use as characteristicsAssume that some raw data values will cross categoriesGroup elements into those categoriesIdentify common characteristics associated with other characteristicsIdentify higher level characteristics and categories from categories generated associated with the raw dataRecursive categorization approachCurrent Development FocusAutomating retrieval of documentObtain documents from web sources automaticallyConvert documents for use in NLTK environmentAutomate Execution of the analysis of documentsPython based code to handle processing in batch style executionUse Existing NLTK tools where

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

Villanova CSC 9010 - Project Status

Sign up for free to view:

Please select your school