Unformatted text preview:

Stat 133: Concepts in Computing with DataTHEME:Use the computer expressively to conduct statistical analysis of data.We will useexisting software rather than build routines from the ground up.We focus on various aspects ofcomputing to conduct statistical analysis,NOT the computational aspects of statistical methods.Statistical Thinking in the context of computing with data.DATA Technologies – Statisticians work includes interfacing/working closely with the originaldata and those who own it.What areDATA ?– Typeset by FoilTEX – 1NumbersEXAMPLE: Daily precipitation amounts from a network of stations from the Colorado FrontRange• 56 weather stations• Daily precipitation – hundredths of an inch (˜400,000 measurements)• Date – year (1948 to 2001) and day• Location of weather station – latitude, longitude, and elevation[1] 0 10 11 1 0 0 0 0 0 0 0 0 10 0 0 0 0 0[19] 0 0 0 7 18 0 0 0 0 0 0 0 0 0 0 0 0 0[37] 4 15 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0[55] 0 0 0 45 0 3 28 0 0 0 0 41 2 0 0 0 0 3[73] 0 0 0 0 112 0 0 0 0 0 0 0 0 0 0 0 9 2[91] 0 0 2 18 0 0 0 0 0 8 7 3 0 0 14 53 0 0[109] 0 0 0 10 0 0 0 0 0 63 5 0 0 6 0 0 0 0[127] 66 76 5 13 2 2 103 8 25 0 1 2 0 0 0 0 0 0[145] 0 0 0 0 4 0 0 0 0 4 6 90 257 2 159 6 18 30[163] 55 5 33 16 1 0 0 0 0 0 0 0 0 0 0 1 11 4[181] 0 0 0 0 0 18 22 31 16 25 42 0 2 9 0 0 0 19[199] 0 0 16 0 0 0 0 0 30 0 0 0 0 0 0 0 0 0[217] 9 0 9 0 0 25 32 1 9 5 0 0 0 0 4 0 0 0– Typeset by FoilTEX – 2Statistical problem:GOAL:• Plan for floods – how should land and roadways be developed?• Agriculture and vegetation – does precipitation come in a limited series of intense eventsor is more evenly distributed over many days?• Climate change – global warming, how will extreme precipitation events change?Statistical Investigations• What is the distribution of large precipitation events and how does this distribution varyover space?• How can irregular station observations be extrapolated to locations where measures arenot made?• How well does a climate model simulation reproduce the features in the observedmeteorology?– Typeset by FoilTEX – 3TextEXAMPLE: SPAM = Unsolicited, mass, junk email• > 50% of electronic mail is SPAM• Offensive, time-consumingReturn-Path: [email protected]: Fri Sep 6 20:53:36 2002From: [email protected] (David LeBlanc)Date: Fri, 6 Sep 2002 12:53:36 -0700Subject: [Spambayes] DeploymentIn-Reply-To: <[email protected]>Message-ID: <[email protected]>You missed the part that said that spam is kept in the "eThunk" and wasviewable by a simple viewer for final disposition?Of course, with Outbloat, you could fire up PythonWin and stuff the spaminto the Junk Email folder... but then you loose the ability to retrain onthe user classified ham/spam.David LeBlancSeattle, WA USA– Typeset by FoilTEX – 4> -----Original Message-----> From: [email protected]> [mailto:[email protected]]On Behalf Of Tim> Peters> Sent: Friday, September 06, 2002 12:24> To: [email protected]> Subject: RE: [Spambayes] Deployment>> [Guido]> > ...> > - A program that acts both as a pop client and a pop server. You> > configure it by telling it about your real pop servers. You then> > point your mail reader to the pop server at localhost. When it> > receives a connection, it connects to the remote pop servers, reads> > your mail, and gives you only the non-spam.>> FYI, I’ll never trust such a scheme: I have no tolerance for false> positives, and indeed do nothing to try to block spam on any of my email> accounts now for that reason. Deliver all suspected spam to a Spam folder> instead and I’d love it.> _______________________________________________– Typeset by FoilTEX – 5Statistical problem:GOAL: Identify SPAM before we read it.Use statistical methodology to filter our electronic mail.• Get sample, classified messages• Convert or transduce text to response and predictor variables• Fit statistical model to data– use information from mail headers (i.e. sender, routing information, date, returnaddress, etc.)– use information in the content of the message body• Tune the algorithm/model– how often do we reject regular message as SPAM?– accept SPAM as regular message?• Deploy classifier as filter.– Typeset by FoilTEX – 6Images, Sound, VideoEXAMPLE: Traffic flow on highways in CaliforniaVideo recordings 24-7; Loop detectors at 22,000 locations, transmit data every 30 seconds,collect 2GB a day, and store 4TB– Typeset by FoilTEX – 7– Typeset by FoilTEX – 8Statistical problem:GOAL: Understand how traffic flows under various road conditions• What is the distribution of lane occupancy and how does occupancy in different lanesrelate to each other?• When traffic flow breaks down and then recovers at a later time, is the level at which itbreaks down higher than the flow level at which it recovers? This phenomena is calledhysteresis.• Researchers validate theories such as hysteresis and calibrate simulation models.– Typeset by FoilTEX – 9Statistical Thinking and the Data Analysis CycleLearn how to think about the data process• Data ACQUISITION – Input/output, regular expressions• Data CLEANING, verification, and manipulation – graphics, exploratory data analysis• Data ORGANIZATION – data frames, XML, databases• MODEL the data – fit statistical models to the data• Data as a PSEUDO-POPULATION – assess the fit of the model via the bootstrap, cross-validation• SIMULATED data – simulation studiesIn this cycle we encounter:• Statistical Concepts• Computing Concepts• Software– Typeset by FoilTEX – 10Statistical Concepts• Graphics– elements of graphing data– grammar of graphics– advanced plotting• Computationally intensive methods– Classification and Regression Trees– Kth Nearest Neighbor clustering– Thin plate splines• Simulation tools– Bootstrap– Cross-validation– Monte Carlo Markov Chain– Typeset by FoilTEX – 11Computing Concepts• Programming concepts - e.g. loops, recursion, trees• Regular expressions and text manipulation• Relational Databases• Random number generation• Representation of numbers in the computer• Event handling and GUI development– Typeset by FoilTEX – 12Software• R - statistical software• Unix - shell commands• SQL - Structured Query Language for relational databases• XML - Extensible Markup language• Gtk -


View Full Document
Download Lecture Notes Introduction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes Introduction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes Introduction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?