STAT 312 Statistics for Biology Some slides are modified from those of Holmes and Huber Instructor Rebecca Lee Lecture 1 Introduction Fall 2024 Rebecca Lee Stat 312 Lecture 1 0 47 Course information TR 2 20 pm 3 35 pm at BLOC 169 Instructor Rebecca Lee email llrebecca21 tamu edu Instructor s Zoom office hours 2 00 pm 3 00 pm every Wed by appointment and Q A every Wed 3 00 4 00pm by appointment TA Sayantan Roy email rivurox tamu edu TA s Zoom office hour Mon Wed at 4 00 pm 5 00 pm by appointment Thursday Optional recitation in BLOC 162 1 pm 2 pm every other Rebecca Lee Stat 312 Lecture 1 1 47 Course information Canvas will be used by the instructor for uploading course materials posting announcements answering questions Textbook Modern Statistics for Modern Biology by Holmes and Huber freely available at http web stanford edu class bios221 book https www huber embl de msmb index html Additional references are listed in the syllabus Rebecca Lee Stat 312 Lecture 1 2 47 Grading 40 homework 25 midterm 25 final 10 quizzes In class midterm delivered online Oct 3rd Take home final Dec 10th Dates are subject to change Rebecca Lee Stat 312 Lecture 1 3 47 Course overview This course is an integration of biological domain knowledge statistical methodology and programming skills Rebecca Lee Stat 312 Lecture 1 4 47 Course overview Our goals Understand basic concepts and principles in statistical analysis learning Learn how to use R to visualize and analyze biological data Prerequisites MATH 147 or its equivalent and STAT 201 or its equivalent Rebecca Lee Stat 312 Lecture 1 5 47 Introduction to R You can either install R on your personal laptop or use some web based R coding environment For the latter we recommend https rstudio cloud We will use RStudio a free IDE for R to write and run R scripts You need to use R and RStudio to do homework and exams Lecture notes slides will also be prepared using R Rebecca Lee Stat 312 Lecture 1 6 47 Web based RStudio https rstudio cloud is free and easy to use You only need to register with your email Rebecca Lee Stat 312 Lecture 1 7 47 Install R and RStudio on personal laptops R runs on Unix Linux MacOS Windows You need to first install R a programming language and then RStudio an interactive programming environment for R Install R http cran us r project org Install RStudio download the free version https www rstudio com products rstudio download Rebecca Lee Stat 312 Lecture 1 8 47 RStudio on your laptop Rebecca Lee Stat 312 Lecture 1 9 47 Homework Complete two R tutorial courses Learning R by Barton Poulson on Linkedin Learning https linkedinlearning tamu edu You can sign in using your TAMU email address Please see the homework assignment for details Some chapters are not necessary and should be skipped courses Introduction to R on DataCamp https www datacamp com Please use the invitation link in HW 0 5 to register so that you have free access to all DataCamp courses If time allows challenge yourself with the intermediate level Rebecca Lee Stat 312 Lecture 1 10 47 Homework In particular make sure you understand how to use RStudio data types and data structures in R arithmetic operations including exponentiation and modulo variable assignment and comparison how to create and use a sequence how to create a vector and insert delete or access its elements how to create a matrix and access its rows columns and elements arithmetic operations involving vectors and matrices Rebecca Lee Stat 312 Lecture 1 11 47 R tutorial on DataCamp Rebecca Lee Stat 312 Lecture 1 12 47 R tutorial on DataCamp Rebecca Lee Stat 312 Lecture 1 13 47 Advantages of R Availability of 10 000 packages All statistical machine learning methods available High quality graphics ggplot2 Reproducibilty R Markdown Free and open source Seamless interaction with standard databases such as GenBank Gene Ontology Consortium UCSC Human Genome Project KEGG Our goal is to understand the general principles of R programming and get familiar with the uses of those basic built in functions operators and keywords Rebecca Lee Stat 312 Lecture 1 14 47 R packages are tailored to various data formats Trees networks and graphs Microarrays mass spectroscopy Maps Text Images DNA AA sequencing data HTS RNA seq QIIME mothur formats Rebecca Lee Stat 312 Lecture 1 15 47 Specificities of modern biological data Genetic data are discrete counts transitions states Independence is not the norm Large heterogeneous data sets Need to interface statistics programs with databases Non standard parameters we need estimate trees graphs etc Complex plotting procedures Reproducibility of all research diaries write ups documentation Rebecca Lee Stat 312 Lecture 1 16 47 General principles do apply Generative probability models Poisson binomial Statistical methods maximum likelihood estimation High quality graphics at three different levels Data transformations Removing unwanted variation Experimental design Rebecca Lee Stat 312 Lecture 1 17 47 Some aims of the course Learn useful probabilistic tools for analyzing genotype expression protein metabolic immunological or microbial data Discrete random variables binomial multinomial Poisson Monte Carlo simulation nonparametric testing power Filtering de noising modeling transforming data Mixture models latent variables Expectation conditional probability variance we only need the basics Rebecca Lee Stat 312 Lecture 1 18 47 Some aims of the course Learn the statistical tools for analyzing large data sets Data preprocessing variance stabilization Maximum likelihood estimation Bayesian methods Multiple testing Machine Learning Multivariate analysis PCA clustering High quality visualization ggplot2 Rebecca Lee Stat 312 Lecture 1 19 47 Worlds of variability Biology cannot be easily summarized into simple principles because it is a world of complex variation Variation differences between organisms It is variability that has enabled evolution and it is variability that ensures the robustness of complex biological systems This is the rule rather than the exception in biological systems Statistics and probability provide many tools for decomposing the signals in medical genetic and ecological data Rebecca Lee Stat 312 Lecture 1 20 47 Worlds of variability Rebecca Lee Stat 312 Lecture 1 21 47 Particularities of genomic data Genetic sequence data are often discrete either binary or categorical A C G T Most of the data come in the form of counts or frequency tables which we call these contingency tables
View Full Document