Laboratory 28 Preamble script block to identify host user and kernel import sys hostname whoami print sys executable print sys version print sys version info DESKTOP 425VE1E desktop 425ve1e brent C Users brent anaconda3 python exe 3 9 12 main Apr 4 2022 05 22 27 MSC v 1916 64 bit AMD64 sys version info major 3 minor 9 micro 12 releaselevel final serial 0 Full name Brentyn Melton R 11784727 Title of the notebookLab 288 Date 11 20 2022 Let s go over some important terminology Linear Regression a basic predictive analytics technique that uses historical data to predict an output variable The Predictor variable input the variable s that help predict the value of the output variable It is commonly referred to as X The Output variable the variable that we want to predict It is commonly referred to as Y To estimate Y using linear regression we assume the equation where Y is the estimated or predicted value of Y based on our linear equation Our goal is to find statistically significant values of the parameters and that minimise the difference between Y and Y If we are able to determine the optimum values of these two parameters then we will have the line of best fit that we can use to predict the values of Y given the value of X So how do we estimate and We can use a method called Ordinary Least Squares OLS The objective of the least squares method is to find values of and that minimise the sum of the squared difference between Y and Y distance between the linear fit and the observed points We will not go through the derivation here but using calculus we can show that the values of the unknown parameters are as follows where X is the mean of X values and is the mean of Y values is simply the covariance of X and Y Cov X Y devided by the variance of X Var X Covariance In probability theory and statistics covariance is a measure of the joint variability of two random variables If the greater values of one variable mainly correspond with the greater values of the other variable and the same holds for the lesser values i e the variables tend to show similar behavior the covariance is positive In the opposite case when the greater values of one variable mainly correspond to the lesser values of the other i e the variables tend to show opposite behavior the covariance is negative The sign of the covariance therefore shows the tendency in the linear relationship between the variables The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables The normalized version of the covariance the correlation coefficient however shows by its magnitude the strength of the linear relation The Correlation Coefficient Correlation coefficients are used in statistics to measure how strong a relationship is between two variables There are several types of correlation coefficient but the most popular is Pearson s Pearson s correlation also called Pearson s R is a correlation coefficient commonly used in linear regression Correlation coefficient formulas are used to find how strong a relationship is between data The formulas return a value between 1 and 1 1 A correlation coefficient of 1 means that for every positive increase in one variable there is a positive increase of a fixed proportion in the other For example shoe sizes go up in almost perfect correlation with foot length 1 A correlation coefficient of 1 means that for every positive increase in one variable there is a negative decrease of a fixed proportion in the other For example the amount of gas in a tank decreases in almost perfect correlation with speed 0 Zero means that for every increase there isn t a positive or negative increase The two just aren t related We had a table of recoded times and speeds from some experimental observations Elapsed Time s Speed m s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 0 3 7 12 20 30 45 6 60 3 77 7 97 3 10 0 121 1 First let s create a dataframe Load the necessary packages import numpy as np import pandas as pd import statistics from matplotlib import pyplot as plt Create a dataframe time 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 speed 0 3 7 12 20 30 45 6 60 3 77 7 97 3 121 2 data pd DataFrame Time time Speed speed data Time Speed 0 1 2 3 4 5 6 7 8 9 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 0 0 3 0 7 0 12 0 20 0 30 0 45 6 60 3 77 7 97 3 10 10 0 121 2 Now let s explore the data data describe Time Speed count 11 000000 11 000000 mean 5 000000 43 100000 std min 3 316625 41 204077 0 000000 0 000000 25 2 500000 9 500000 50 5 000000 30 000000 75 7 500000 69 000000 max 10 000000 121 200000 Is there a relationship based on covariance correlation between time and speed time var statistics variance time speed var statistics variance speed print Variance of recorded times is time var print Variance of recorded speed is speed var Variance of recorded times is 11 0 Variance of recorded speed is 1697 7759999999998 To find the covariance data cov Time Speed Time 11 00 131 750 Speed 131 75 1697 776 To find the correlation among the columns using pearson method data corr method pearson Time Speed Time 1 000000 0 964082 Speed 0 964082 1 000000 Let s do linear regression with primitive Python To estimate y using the OLS method we need to calculate xmean and ymean the covariance of X and y xycov and the variance of X xvar before we can determine the values for alpha and beta In our case X is time and y is Speed Calculate the mean of X and y xmean np mean time ymean np mean speed Calculate the terms needed for the numator and denominator of beta data xycov data Time xmean data Speed ymean data xvar data Time xmean 2 Calculate beta and alpha beta data xycov sum data xvar sum alpha ymean beta xmean print f alpha alpha print f beta beta alpha 16 78636363636363 beta 11 977272727272727 We now have an estimate for alpha and beta Our model can be written as Y 11 977 X 16 786 and we can make predictions X np array time ypred alpha beta X print ypred 16 78636364 4 80909091 7 16818182 19 14545455 31 12272727 43 1 55 07727273 67 05454545 79 03181818 91 00909091 102 98636364 Let s plot our prediction ypred against the actual values of y to get a better visual understanding of our model Plot regression against …
View Full Document