Chapter 12: Web Usage Mining - An introductionIntroductionSlide 3Web server logsWeb usage mining processData preparationPre-processing of web usage dataData cleaningIdentify sessions (sessionization)Sessionization strategiesSessionization heuristicsSessionization exampleUser identificationUser identification: an examplePageviewPath completionMissing references due to cachingSlide 18Integrating with e-commerce eventsProduct-Oriented EventsSlide 21Slide 22Integration with page contentIntegration with link structureE-commerce data analysisSession analysisSession analysis: aggregate reportsOLAPData miningData mining (cont.)Some usage mining applicationsPersonalization applicationStandard approachesSummaryChapter 12: Web Usage Mining - An introductionChapter written by Bamshad MobasherMany slides are from a tutorial given by B. Berendt, B. Mobasher, M. SpiliopoulouIntroductionWeb usage mining: automatic discovery of patterns in clickstreams and associated data collected or generated as a result of user interactions with one or more Web sites.Goal: analyze the behavioral patterns and profiles of users interacting with a Web site. The discovered patterns are usually represented as collections of pages, objects, or resources that are frequently accessed by groups of users with common interests.Bing Liu 3IntroductionData in Web Usage Mining:Web server logsSite contentsData about the visitors, gathered from external channelsFurther application dataNot all these data are always available.When they are, they must be integrated.A large part of Web usage mining is about processing usage/ clickstream data. After that various data mining algorithm can be applied.Web server logsBing Liu 4Web usage mining processBing Liu 5Bing Liu 6Data preparationPre-processing of web usage dataBing Liu 7Data cleaningData cleaningremove irrelevant references and fields in server logsremove references due to spider navigationremove erroneous referencesadd missing references due to caching (done after sessionization)Bing Liu 8Identify sessions (sessionization)In Web usage analysis, these data are the sessions of the site visitors: the activities performed by a user from the moment she enters the site until the moment she leaves it.Difficult to obtain reliable usage data due to proxy servers and anonymizers, dynamic IP addresses, missing references due to caching, and the inability of servers to distinguish among different visits.Bing Liu 9Sessionization strategiesBing Liu 10Sessionization heuristicsBing Liu 11Sessionization exampleBing Liu 12User identificationBing Liu 13User identification: an exampleBing Liu 14PageviewA pageview is an aggregate representation of a collection of Web objects contributing to the display on a user’s browser resulting from a single user action (such as a click-through). Conceptually, each pageview can be viewed as a collection of Web objects or resources representing a specific “user event,” e.g., reading an article, viewing a product page, or adding a product to the shopping cart.Bing Liu 15Path completionClient- or proxy-side caching can often result in missing access references to those pages or objects that have been cached. For instance, if a user returns to a page A during the same session, the second access to A will likely result in viewing the previously downloaded version of A that was cached on the client-side, and therefore, no request is made to the server. This results in the second reference to A not being recorded on the server logs.Bing Liu 16Missing references due to cachingBing Liu 17Path completionThe problem of inferring missing user references due to caching.Effective path completion requires extensive knowledge of the link structure within the siteReferrer information in server logs can also be used in disambiguating the inferred paths.Problem gets much more complicated in frame-based sites.Bing Liu 18Integrating with e-commerce eventsEither product oriented or visit orientedUsed to track and analyze conversion of browsers to buyers.Major difficulty for E-commerce events is defining and implementing the events for a site, however, in contrast to clickstream data, getting reliable preprocessed data is not a problem.Another major challenge is the successful integration with clickstream dataBing Liu 19Product-Oriented EventsProduct ViewOccurs every time a product is displayed on a page viewTypical Types: Image, Link, TextProduct Click-throughOccurs every time a user “clicks” on a product to get more informationBing Liu 20Product-Oriented EventsShopping Cart ChangesShopping Cart Add or RemoveShopping Cart Change - quantity or other feature (e.g. size) is changedProduct Buy or BidSeparate buy event occurs for each product in the shopping cartAuction sites can track bid events in addition to the product purchasesBing Liu 21Web usage mining processBing Liu 22Integration with page content Bing Liu 23Integration with link structureBing Liu 24E-commerce data analysisBing Liu 25Session analysisSimplest form of analysis: examine individual or groups of server sessions and e-commerce data.Advantages:Gain insight into typical customer behaviors.Trace specific problems with the site.Drawbacks:LOTS of data.Difficult to generalize.Bing Liu 26Session analysis: aggregate reportsBing Liu 27OLAPBing Liu 28Data miningBing Liu 29Data mining (cont.)Bing Liu 30Some usage mining applicationsBing Liu 31Personalization applicationBing Liu 32Standard approachesBing Liu 33SummaryWeb usage mining has emerged as the essential tool for realizing more personalized, user-friendly and business-optimal Web services.The key is to use the user-clickstream data for many mining purposes. Traditionally, Web usage mining is used by e-commerce sites to organize their sites and to increase profits. It is now also used by search engines to improve search quality and to evaluate search results, etc, and by many other
View Full Document