ODU CS 791 - Efficient, Automatic Web Resource Harvesting - D3004730

Home> Schools> Old Dominion University> Computer Science (CS) > CS 791> Efficient, Automatic Web Resource Harvesting

DOC PREVIEW

ODU CS 791 - Efficient, Automatic Web Resource Harvesting

School name Old Dominion University

Course Cs 791- Graduate Seminar

Pages 48

This preview shows page 1-2-3-23-24-25-26-46-47-48 out of 48 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Efficient, Automatic Web Resource HarvestingPresentation OverviewIntroductionCrawling DifficultiesProblemsMod_oai A Solution For Counting And RepresentationOAI-PMHSlide 8Slide 9Slide 10Complex Object Formats As MetadataDigital ItemData ModelXml ViewSlide 15Mod_oaiMod_oai View Of OAI-PMH Data Model:Supported Metadata Formats :Structural ViewSlide 20DemoQuantitative EvaluationExperimentsExperiments contd …Experiments contd …Comparison of crawling performanceAfter 25% file updatesResultsTesting the performance of Mod_oai using Resumption TokensSlide 30Discussion and future workThe Representation ProblemThe Counting ProblemSecurityHidden FilesSitemap XML FormatBasicsSample XML SitemapSample XML SitemapSitemap index filesSample XML Sitemap index fileSlide 42Sitemap file locationExtending the Sitemaps protocolMod_oai vs Google sitemapsMod_oai vs Google sitemapsConclusionReferencesEfficient, Automatic Web Resource HarvestingMichael L. Nelson, Joan A. Smith andIgnacio Garcia del CampoOld Dominion UniversityComputer Science DeptNorfolk VA 23529 USA{mln, jsmit, dgarcia}@cs.odu.eduHerbert Van de Sompeland Xiaoming LiuLos Alamos National LaboratoryResearch LibraryLos Alamos NM 87545 USA{herbertv, liu x}@lanl.govPresentation OverviewIntroductionOAI-PMHComplex data formats as MetadataMOD_OAIDemoQuantitative EvaluationRepresentation ProblemCounting ProblemSitemapsConclusionReferencesIntroductionWhat is a web crawler ? A program or automated script which browses the World Wide Web in a methodical, automated manner.What makes Web crawling difficult ?:Large volumeFast rate of change andDynamic page generationCrawling DifficultiesThe large volume implies that the crawler can only download a fraction of the web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that by the time the crawler is downloading the last pages from a site, it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted.ProblemsTwo problems associated with conventional web crawling techniques: 1. Counting Problem :A crawler cannot know if all resources at a web site have been discovered and crawled.2. Representation Problem :The human-readable format of the resources are not always suitable for machine processing.Mod_oai A Solution For Counting And Representation Solution:Via an Apache module: mod_oaiimplements OAI-PMH + MPEG-21 DIDLOAI-PMH: count everything (linked or not) using “List” verbsMPEG-21 DIDL: capture everything using a complex-object format and automated metadata extractionOAI-PMHWeb servers do not have the capability to answer questions of the form “what resources do you have?” and “what resources have changed since 2004-12-27?”Using OAI-PMH a web crawler can quickly get an update on the latest changes to a site.Requests only resources that are new or have changed since its last visit.Hence it can restrict its crawls.OAI-PMH Data ModelresourceitemDublin Coremetadata MARCXMLmetadata MPEG-21DIDL recordsOAI-PMH identifier = entry point to all records pertaining to the resourceMETS metadata pertainingto the resourcemodeled representation of the resourcesimplemodelmore expressivemodelcomplexmodelcomplexmodelOAI-PMHQueries return records containing metadata.The verbs Identify,ListMetadataFormats and Listsets helps a harvester understand the nature of the repository.ListIdentifiers,ListRecords and GetRecord are used for the actual harvesting of the metadata.OAI-PMHThe powerful feature of OAI_PMH is that it can support any metadata format defined by an XML schema. But !!!In most cases we are interested in transmitting the actual resource and not just the metadata.Complex Object Formats As MetadataTo enable resource harvesting we use XML based complex object formats.Dublin core metadata format is simple but its flat structure cannot be used for complex objects.Hence we use DIDL (Digital Item Declaration Language)Digital ItemA Digital Item is a combination of : Resources (such as videos, audio tracks, images, etc) Metadata (such as descriptors, identifiers, etc), and  Structure (describing the relationships between resources).Data ModelA Container is a grouping of Containers and/or Items. An Item is a grouping of Items and/or Components. A Component is a grouping of Resources. Multiple Resources in the same Component are considered equivalent and consequently an agent may use any one of them. A Resource is an individual datastream. A Descriptor conveys secondary information pertaining to a Container, an Item, or a Component.Xml View<DIDL> <CONTAINER> <ITEM> <ITEM> . . . </ITEM> <ITEM> . . . </ITEM> </ITEM> </CONTAINER></DIDL>Mod_oaimod_oai began as a research project at ODU.mod_oai is an Apache module that responds to OAI-PMH requests on behalf of a web server.Goal :To bring the efficiency of OAI-PMH to everyday web sites. If Apache and mod_oai are installed at http://www.foo.edu/, then the baseURL for OAI-PMH requests is http://www.foo.edu/mod_oai.Mod_oai View Of OAI-PMH Data Model:OAI-PMH identifier: The URL of the resource serves as the OAI-PMH identifier. OAI-PMH datestamp: The modification time of the resource is used as the OAI-PMH datestamp of all 3 metadata formats. OAI-PMH sets: A set organization is introduced based on the MIME type of resource.Supported Metadata Formats : oai_dc: Dublin Core is supported as mandated.  Only technical metadata that can be derived from http header information is included.http_header: Contains all http response headers that would be returned if a web resource were obtained by means of an http GET.oai_didl: Introduced to allow harvesting of the resource itself.  Web resource is represented by means of an XML wrapper document. Compliant with the MPEG-21 DIDLStructural ViewDoes http://www.foo.edu/ site’s server support mod oai? Request : http://www.foo.edu/modoai?verb=IdentifyIf response is valid then it Supports Mod_oaiDemo http://whiskey.cs.odu.edu/Quantitative EvaluationTo examine the performance of mod_oai, authors compared OAI-PMH harvesting using the OCLC Harvester with the wget web crawling utility.http://www.cs.odu.edu served as the testbed.Overall, the testbed included 5268 files that used

View Full Document