UMD INFM 700 - Understanding PubMed Users for Enhanced Text Retrieval

Unformatted text preview:

INFM 700: Session 14 Understanding PubMed Users for Enhanced Text RetrievalContextExecutive SummaryUnderstanding UsersDatasetSequence AnalysisDistribution of User ActionsSessions and EpisodesEpisode Length: TransactionsEpisode Length: DurationSingleton EpisodesLanguage ModelsSlide 13Slide 14N-Gram Activity ModelsPrediction AccuracySo what?Activity CollocatesActivity Sequences in PubMedAre PubMed users like rats?Slide 21Why are related links useful?Right Tool for the JobTREC 2005 Genomics TrackSlide 25Experimental DesignViz Tool: SocialActionHigh-Density NetworkMedium-Density NetworkLow-Density NetworkDensity of NetworksExpected RecallRecall by BrowsingFindingsSlide 35Exploiting Network StructureBrief Detour: What’s PageRank?PageRank: VisuallyPageRank: DefinedComputing PageRankSlide 41Detailed SetupPageRank + TerrierMore ObservationsSlide 45AcknowledgementsINFM 700: Session 14Understanding PubMed Users for Enhanced Text RetrievalJimmy LinThe iSchoolUniversity of MarylandMonday, May 5, 2008This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsiSchoolContextEnhancing text retrieval with PubMedDeliver better result set to usersSupport serendipitous knowledge discoveryHow?First, understand behavior of current usersSee what works: enhance itSee what doesn’t work: fix itiSchoolExecutive SummaryHow do users interact with PubMed?Finding: There is some predictability in users’ interactions with PubMed.Finding: Related article links appear to be a very useful.Finding: Relevant articles tend to cluster together, thus browsing related article links is useful. Finding: Related document networks can be exploited using PageRank to improve retrieval effectiveness.Methodology: statistical analysis of log dataWhy is related article search useful?Methodology: visual analysis and statistical characterization of related article networksCan we better exploit related article networks?Methodology: reranking experiments with ad hoc retrieval test collectionsiSchoolUnderstanding UsersPubMed users leave a record of their activitiesMine logs to characterize users?Mine logs to improve search results?Everyone’s doing it!Privacy issues need to be thought through…iSchoolDatasetCollection characteristicsCollected over 8-day span (June 20-27, 2007)8.68 million browser sessions41.8 million transactionsPre-processing steps:Removed singleton sessions (5.5m, 63%)Removed sessions with over 500 transactions (162 sessions, 271k transactions)Removed sessions not primarily involving PubMed (2.72m sessions)Working data set: 476k sessions, 7.65m transactionsiSchoolSequence AnalysisTreat user modeling as a sequence analysis problemDevelop an alphabet of user actionsEncode user activity as string sequencesWhy?Leverage techniques from natural language processingLeverage techniques form bioinformaticsiSchoolDistribution of User ActionsQuery: user issued a query 2.52m 32.9%Retrieve: user looked at an abstract 3.04m 39.7%Next: user asked for next page of results 658k 8.6%Related Link: user clicked on a related article 285k 3.7%More Links: user requested more related articles 53k 0.7%Modify View: user employed advanced feature 516k 6.7%P: other MEDLINE 288k 3.8%X: other PubMed 291k 3.8%Total 7.65mQNRRRRLRQNRQQQQQQRR…QNQQQQQQQNQNQQQQN…QNNNNNQNRQVNRRQNRQNRNRLNRNVNRRRQQQQNQRR…Example of real sessions:iSchoolSessions and EpisodesSessions can be divided into multiple meaningful units of activitiesCall these “episodes”Standard technique is to use an inactivity thresholdWhat’s the distribution of PubMed user episodes?Based on different inactivity thresholdsiSchoolEpisode Length: TransactionsDistribution of Episode Length: Number of TransactionsEpisode Length (Number of Transactions)FractioniSchoolEpisode Length: DurationDistribution of Episode Length: DurationEpisode Length (Increments of 5 minutes)FractioniSchoolSingleton EpisodesAnalysis of Singleton EpisodeInactivity ThresholdCount (thousands)iSchoolLanguage ModelsLanguage models define a probability distribution over string sequencesWhy are they useful?)()...(11321nnnwPwwwwwP )ideas sleepgreen furiously colorless()furiously sleep ideasgreen colorless()lamb little a hadMary (PPPiSchoolLanguage ModelsHow do you compute the probability of a sequence?That’s a lot of probabilities to keep track of!nkkknnnnnwwPwwPwwPwwPwPwwwwPwwwPwwPwPwP111112131211212131211)|()|()...|()|()()...|()...|()|()()()little a hadMary |lamb()a hadMary |little()hadMary |a()Mary|had()Mary()lamb little a hadMary (PPPPPPiSchoolLanguage ModelsMarkov assumption: consider only N preceding symbolsBigrams:Trigams:N-grams:For example, with bigrams:What’s the tradeoff with longer histories?)|()|(111 nnnnwwPwwP)|()|(1211 nnnnnwwwPwwP)|()|(1111nNnnnnwwPwwP)little|lamb()a|little()had |a()Mary|had()Mary()lamb little a hadMary (PPPPPPiSchoolN-Gram Activity ModelsN-gram language models in NLP tasks:Automatic speech recognitionMachine translation…Can we apply n-gram language models to activity sequences?Experimental setup:Build models of episodes: 2-grams to 8-gramsUse in a prediction task: predict most likely next actionEvaluate in terms of prediction accuracyiSchoolPrediction AccuracyUser action prediction accuracy with different n-gram language models n-gram language model Prediction AccuracyBaselineiSchoolSo what?There’s signal here!Some level of predictability of user actionsImpoverished data (no privacy concerns)Possible improvements with richer featuresImplicationsIt is possible to build user models to capture strategies, topics, etc.Demographics is one key to good Web searchLots of future work here…What’s the equivalent of targeted advertising in PubMed?iSchoolActivity CollocatesCollocates in natural language: words that co-occur much more frequently than chanceThese are usually meaningful multi-world phrasesCommon techniques for learning collocates: PMI, Log-likelihood ratio, …Activity collocates: patterns of activities that co-occur much more frequently than chanceWhat do they mean?My hypothesis: fragments of information seeking strategies, or search tacticsExamples: hot


View Full Document

UMD INFM 700 - Understanding PubMed Users for Enhanced Text Retrieval

Download Understanding PubMed Users for Enhanced Text Retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Understanding PubMed Users for Enhanced Text Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Understanding PubMed Users for Enhanced Text Retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?