New version page

UW-Madison LIS 341 Topic - HW1: Text Preprocessing & Unigram Language Model

This preview shows page 1 out of 3 pages.

View Full Document
View Full Document

End of preview. Want to read all 3 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

HW1: Text Preprocessing & Unigram Language Model Due: 11:59pm, Feb 21, 2021 Points: 15% of your total The excel file “data_book_history.xlsx” includes the metadata of 500 articles related to book history collected from the ISI Web of Science database. In HW1, we will examine the NLP preprocessing and parsing results for the column with the name “Article Title” (which stores the titles of these articles). We concatenate all the articles’ titles as a long list of text (separated using “. ”, a period followed by a white space). Then, we use Spacy to for text preprocessing and parsing (please check Week 02’s lecture). The excel file “tokens.xlsx” stores the results for each token, including its original text, lowercased text, lemma, part-of-speech (POS) tag, and named entity information in the IOB format.Please complete either Part I or Part II (you only need to finish one of them) and submit to Canvas: Part I – analysis Q1. Check at least the first 500 tokens in the “tokens.xlsx” file. Discuss 1) your impression of how well the NLP preprocessing and parsing perform on this dataset, and 2) the possible reasons why the tools performed well or badly. Include specific examples of tokens and results in your discussion. Q2. Based on the first 50 tokens in the “tokens.xlsx” file, estimate the following word stem’s probability (the “lemma” column) using add-one smoothing. • of • culture • history Submit a report with answers to Q1 and Q2 to Canvas. Part II – programming Follow instructions in hw1.ipynb (the starter code) and implement the following: • Q1.1 Count the top 50 most frequent tokens (excluding punctuations) • Q1.2 Count the top 50 most frequent tokens (excluding stop words and punctuations)• Q1.3 Count the top 50 most frequent tokens that are nouns (excluding stop words and punctuations) • Q1.4 Count the top 50 most frequent named entities • Discuss if the above statistics in Q1.1-1.4 are useful for understanding the topics of the articles. • Q2.1 Implement a function to estimate the probability of a word stem based on the articles’ titles using add-one smoothing. • Q2.2 Implement a function to calculate the log probability of a text based on Q2.1’s function. Submit your implementation (.ipynb) and a short report to Canvas (submit as a zip


View Full Document
Loading Unlocking...
Login

Join to view HW1: Text Preprocessing & Unigram Language Model and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view HW1: Text Preprocessing & Unigram Language Model and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?