UMD INFM 700 - Understanding PubMed Users for Enhanced Text Retrieval - D929375

Home> Schools> University of Maryland, College Park> Information Management (INFM) > INFM 700> Understanding PubMed Users for Enhanced Text Retrieval

UMD INFM 700 - Understanding PubMed Users for Enhanced Text Retrieval

School name University of Maryland, College Park

Course Infm 700- Information Architecture

Pages 46

Download Save

Unformatted text preview:

INFM 700: Session 14 Understanding PubMed Users for Enhanced Text RetrievalContextExecutive SummaryUnderstanding UsersDatasetSequence AnalysisDistribution of User ActionsSessions and EpisodesEpisode Length: TransactionsEpisode Length: DurationSingleton EpisodesLanguage ModelsSlide 13Slide 14N-Gram Activity ModelsPrediction AccuracySo what?Activity CollocatesActivity Sequences in PubMedAre PubMed users like rats?Slide 21Why are related links useful?Right Tool for the JobTREC 2005 Genomics TrackSlide 25Experimental DesignViz Tool: SocialActionHigh-Density NetworkMedium-Density NetworkLow-Density NetworkDensity of NetworksExpected RecallRecall by BrowsingFindingsSlide 35Exploiting Network StructureBrief Detour: What’s PageRank?PageRank: VisuallyPageRank: DefinedComputing PageRankSlide 41Detailed SetupPageRank + TerrierMore ObservationsSlide 45AcknowledgementsINFM 700: Session 14Understanding PubMed Users for Enhanced Text RetrievalJimmy LinThe iSchoolUniversity of MarylandMonday, May 5, 2008This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsiSchoolContextEnhancing text retrieval with PubMedDeliver better result set to usersSupport serendipitous knowledge discoveryHow?First, understand behavior of current usersSee what works: enhance itSee what doesn’t work: fix itiSchoolExecutive SummaryHow do users interact with PubMed?Finding: There is some predictability in users’ interactions with PubMed.Finding: Related article links appear to be a very useful.Finding: Relevant articles tend to cluster together, thus browsing related article links is useful. Finding: Related document networks can be exploited using PageRank to improve retrieval effectiveness.Methodology: statistical analysis of log dataWhy is related article search useful?Methodology: visual analysis and statistical characterization of related article networksCan we better exploit related article networks?Methodology: reranking experiments with ad hoc retrieval test collectionsiSchoolUnderstanding UsersPubMed users leave a record of their activitiesMine logs to characterize users?Mine logs to improve search results?Everyone’s doing it!Privacy issues need to be thought through…iSchoolDatasetCollection characteristicsCollected over 8-day span (June 20-27, 2007)8.68 million browser sessions41.8 million transactionsPre-processing steps:Removed singleton sessions (5.5m, 63%)Removed sessions with over 500 transactions (162 sessions, 271k transactions)Removed sessions not primarily involving PubMed (2.72m sessions)Working data set: 476k sessions, 7.65m transactionsiSchoolSequence AnalysisTreat user modeling as a sequence analysis problemDevelop an alphabet of user actionsEncode user activity as string sequencesWhy?Leverage techniques from natural language processingLeverage techniques form bioinformaticsiSchoolDistribution of User ActionsQuery: user issued a query 2.52m 32.9%Retrieve: user looked at an abstract 3.04m 39.7%Next: user asked for next page of results 658k 8.6%Related Link: user clicked on a related article 285k 3.7%More Links: user requested more related articles 53k 0.7%Modify View: user employed advanced feature 516k 6.7%P: other MEDLINE 288k 3.8%X: other PubMed 291k 3.8%Total 7.65mQNRRRRLRQNRQQQQQQRR…QNQQQQQQQNQNQQQQN…QNNNNNQNRQVNRRQNRQNRNRLNRNVNRRRQQQQNQRR…Example of real sessions:iSchoolSessions and EpisodesSessions can be divided into multiple meaningful units of activitiesCall these “episodes”Standard technique is to use an inactivity thresholdWhat’s the distribution of PubMed user episodes?Based on different inactivity thresholdsiSchoolEpisode Length: TransactionsDistribution of Episode Length: Number of TransactionsEpisode Length (Number of Transactions)FractioniSchoolEpisode Length: DurationDistribution of Episode Length: DurationEpisode Length (Increments of 5 minutes)FractioniSchoolSingleton EpisodesAnalysis of Singleton EpisodeInactivity ThresholdCount (thousands)iSchoolLanguage ModelsLanguage models define a probability distribution over string sequencesWhy are they useful?)()...(11321nnnwPwwwwwP )ideas sleepgreen furiously colorless()furiously sleep ideasgreen colorless()lamb little a hadMary (PPPiSchoolLanguage ModelsHow do you compute the probability of a sequence?That’s a lot of probabilities to keep track of!nkkknnnnnwwPwwPwwPwwPwPwwwwPwwwPwwPwPwP111112131211212131211)|()|()...|()|()()...|()...|()|()()()little a hadMary |lamb()a hadMary |little()hadMary |a()Mary|had()Mary()lamb little a hadMary (PPPPPPiSchoolLanguage ModelsMarkov assumption: consider only N preceding symbolsBigrams:Trigams:N-grams:For example, with bigrams:What’s the tradeoff with longer histories?)|()|(111 nnnnwwPwwP)|()|(1211 nnnnnwwwPwwP)|()|(1111nNnnnnwwPwwP)little|lamb()a|little()had |a()Mary|had()Mary()lamb little a hadMary (PPPPPPiSchoolN-Gram Activity ModelsN-gram language models in NLP tasks:Automatic speech recognitionMachine translation…Can we apply n-gram language models to activity sequences?Experimental setup:Build models of episodes: 2-grams to 8-gramsUse in a prediction task: predict most likely next actionEvaluate in terms of prediction accuracyiSchoolPrediction AccuracyUser action prediction accuracy with different n-gram language models n-gram language model Prediction AccuracyBaselineiSchoolSo what?There’s signal here!Some level of predictability of user actionsImpoverished data (no privacy concerns)Possible improvements with richer featuresImplicationsIt is possible to build user models to capture strategies, topics, etc.Demographics is one key to good Web searchLots of future work here…What’s the equivalent of targeted advertising in PubMed?iSchoolActivity CollocatesCollocates in natural language: words that co-occur much more frequently than chanceThese are usually meaningful multi-world phrasesCommon techniques for learning collocates: PMI, Log-likelihood ratio, …Activity collocates: patterns of activities that co-occur much more frequently than chanceWhat do they mean?My hypothesis: fragments of information seeking strategies, or search tacticsExamples: hot

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD INFM 700 - Understanding PubMed Users for Enhanced Text Retrieval

Sign up for free to view:

Please select your school