UW-Madison LIS 341 Topic - HW1: Text Preprocessing & Unigram Language Model - D3474479

Home> Schools> University of Wisconsin, Madison> Library & Information Studies (LIS) > LIS 341 Topic> HW1: Text Preprocessing & Unigram Language Model

UW-Madison LIS 341 Topic - HW1: Text Preprocessing & Unigram Language Model

School name University of Wisconsin, Madison

Course Lis 341 Topic- Research Data Management Across the Disciplines

Pages 3

Download Save

Unformatted text preview:

HW1: Text Preprocessing & Unigram Language Model Due: 11:59pm, Feb 21, 2021 Points: 15% of your total The excel file “data_book_history.xlsx” includes the metadata of 500 articles related to book history collected from the ISI Web of Science database. In HW1, we will examine the NLP preprocessing and parsing results for the column with the name “Article Title” (which stores the titles of these articles). We concatenate all the articles’ titles as a long list of text (separated using “. ”, a period followed by a white space). Then, we use Spacy to for text preprocessing and parsing (please check Week 02’s lecture). The excel file “tokens.xlsx” stores the results for each token, including its original text, lowercased text, lemma, part-of-speech (POS) tag, and named entity information in the IOB format.Please complete either Part I or Part II (you only need to finish one of them) and submit to Canvas: Part I – analysis Q1. Check at least the first 500 tokens in the “tokens.xlsx” file. Discuss 1) your impression of how well the NLP preprocessing and parsing perform on this dataset, and 2) the possible reasons why the tools performed well or badly. Include specific examples of tokens and results in your discussion. Q2. Based on the first 50 tokens in the “tokens.xlsx” file, estimate the following word stem’s probability (the “lemma” column) using add-one smoothing. • of • culture • history Submit a report with answers to Q1 and Q2 to Canvas. Part II – programming Follow instructions in hw1.ipynb (the starter code) and implement the following: • Q1.1 Count the top 50 most frequent tokens (excluding punctuations) • Q1.2 Count the top 50 most frequent tokens (excluding stop words and punctuations)• Q1.3 Count the top 50 most frequent tokens that are nouns (excluding stop words and punctuations) • Q1.4 Count the top 50 most frequent named entities • Discuss if the above statistics in Q1.1-1.4 are useful for understanding the topics of the articles. • Q2.1 Implement a function to estimate the probability of a word stem based on the articles’ titles using add-one smoothing. • Q2.2 Implement a function to calculate the log probability of a text based on Q2.1’s function. Submit your implementation (.ipynb) and a short report to Canvas (submit as a zip

View Full Document


School:
Email:
New Password:
Confirm Password:

UW-Madison LIS 341 Topic - HW1: Text Preprocessing & Unigram Language Model

Sign up for free to view:

Please select your school