INFM 700: Session 14 Understanding PubMed Users for Enhanced Text RetrievalContextExecutive SummaryUnderstanding UsersDatasetSequence AnalysisDistribution of User ActionsSessions and EpisodesEpisode Length: TransactionsEpisode Length: DurationSingleton EpisodesLanguage ModelsSlide 13Slide 14N-Gram Activity ModelsPrediction AccuracySo what?Activity CollocatesActivity Sequences in PubMedAre PubMed users like rats?Slide 21Why are related links useful?Right Tool for the JobTREC 2005 Genomics TrackSlide 25Experimental DesignViz Tool: SocialActionHigh-Density NetworkMedium-Density NetworkLow-Density NetworkDensity of NetworksExpected RecallRecall by BrowsingFindingsSlide 35Exploiting Network StructureBrief Detour: What’s PageRank?PageRank: VisuallyPageRank: DefinedComputing PageRankSlide 41Detailed SetupPageRank + TerrierMore ObservationsSlide 45AcknowledgementsINFM 700: Session 14Understanding PubMed Users for Enhanced Text RetrievalJimmy LinThe iSchoolUniversity of MarylandMonday, May 5, 2008This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsiSchoolContextEnhancing text retrieval with PubMedDeliver better result set to usersSupport serendipitous knowledge discoveryHow?First, understand behavior of current usersSee what works: enhance itSee what doesn’t work: fix itiSchoolExecutive SummaryHow do users interact with PubMed?Finding: There is some predictability in users’ interactions with PubMed.Finding: Related article links appear to be a very useful.Finding: Relevant articles tend to cluster together, thus browsing related article links is useful. Finding: Related document networks can be exploited using PageRank to improve retrieval effectiveness.Methodology: statistical analysis of log dataWhy is related article search useful?Methodology: visual analysis and statistical characterization of related article networksCan we better exploit related article networks?Methodology: reranking experiments with ad hoc retrieval test collectionsiSchoolUnderstanding UsersPubMed users leave a record of their activitiesMine logs to characterize users?Mine logs to improve search results?Everyone’s doing it!Privacy issues need to be thought through…iSchoolDatasetCollection characteristicsCollected over 8-day span (June 20-27, 2007)8.68 million browser sessions41.8 million transactionsPre-processing steps:Removed singleton sessions (5.5m, 63%)Removed sessions with over 500 transactions (162 sessions, 271k transactions)Removed sessions not primarily involving PubMed (2.72m sessions)Working data set: 476k sessions, 7.65m transactionsiSchoolSequence AnalysisTreat user modeling as a sequence analysis problemDevelop an alphabet of user actionsEncode user activity as string sequencesWhy?Leverage techniques from natural language processingLeverage techniques form bioinformaticsiSchoolDistribution of User ActionsQuery: user issued a query 2.52m 32.9%Retrieve: user looked at an abstract 3.04m 39.7%Next: user asked for next page of results 658k 8.6%Related Link: user clicked on a related article 285k 3.7%More Links: user requested more related articles 53k 0.7%Modify View: user employed advanced feature 516k 6.7%P: other MEDLINE 288k 3.8%X: other PubMed 291k 3.8%Total 7.65mQNRRRRLRQNRQQQQQQRR…QNQQQQQQQNQNQQQQN…QNNNNNQNRQVNRRQNRQNRNRLNRNVNRRRQQQQNQRR…Example of real sessions:iSchoolSessions and EpisodesSessions can be divided into multiple meaningful units of activitiesCall these “episodes”Standard technique is to use an inactivity thresholdWhat’s the distribution of PubMed user episodes?Based on different inactivity thresholdsiSchoolEpisode Length: TransactionsDistribution of Episode Length: Number of TransactionsEpisode Length (Number of Transactions)FractioniSchoolEpisode Length: DurationDistribution of Episode Length: DurationEpisode Length (Increments of 5 minutes)FractioniSchoolSingleton EpisodesAnalysis of Singleton EpisodeInactivity ThresholdCount (thousands)iSchoolLanguage ModelsLanguage models define a probability distribution over string sequencesWhy are they useful?)()...(11321nnnwPwwwwwP )ideas sleepgreen furiously colorless()furiously sleep ideasgreen colorless()lamb little a hadMary (PPPiSchoolLanguage ModelsHow do you compute the probability of a sequence?That’s a lot of probabilities to keep track of!nkkknnnnnwwPwwPwwPwwPwPwwwwPwwwPwwPwPwP111112131211212131211)|()|()...|()|()()...|()...|()|()()()little a hadMary |lamb()a hadMary |little()hadMary |a()Mary|had()Mary()lamb little a hadMary (PPPPPPiSchoolLanguage ModelsMarkov assumption: consider only N preceding symbolsBigrams:Trigams:N-grams:For example, with bigrams:What’s the tradeoff with longer histories?)|()|(111 nnnnwwPwwP)|()|(1211 nnnnnwwwPwwP)|()|(1111nNnnnnwwPwwP)little|lamb()a|little()had |a()Mary|had()Mary()lamb little a hadMary (PPPPPPiSchoolN-Gram Activity ModelsN-gram language models in NLP tasks:Automatic speech recognitionMachine translation…Can we apply n-gram language models to activity sequences?Experimental setup:Build models of episodes: 2-grams to 8-gramsUse in a prediction task: predict most likely next actionEvaluate in terms of prediction accuracyiSchoolPrediction AccuracyUser action prediction accuracy with different n-gram language models n-gram language model Prediction AccuracyBaselineiSchoolSo what?There’s signal here!Some level of predictability of user actionsImpoverished data (no privacy concerns)Possible improvements with richer featuresImplicationsIt is possible to build user models to capture strategies, topics, etc.Demographics is one key to good Web searchLots of future work here…What’s the equivalent of targeted advertising in PubMed?iSchoolActivity CollocatesCollocates in natural language: words that co-occur much more frequently than chanceThese are usually meaningful multi-world phrasesCommon techniques for learning collocates: PMI, Log-likelihood ratio, …Activity collocates: patterns of activities that co-occur much more frequently than chanceWhat do they mean?My hypothesis: fragments of information seeking strategies, or search tacticsExamples: hot
View Full Document