When to Picnic Peter Barnum and Vinithra Varadharajan The Robotics Institute Carnegie Mellon University Pittsburgh PA 15213 pbarnum cs cmu edu vvaradha ri cmu edu 1 Introduction The quantitative and reliable prediction of the level of precipitation is important for scientific economic and ecological reasons 3 Currently there are many global climate models but there is still a great deal of uncertainty in their predictions Not only do they not agree with themselves they do not even predict the past correctly 7 This is a similar idea to having poor training error Such poor performance rates are primarily due to the vast number of local and global factors that influence weather Hence the need to accurately predict precipitation levels given the datasets of historical records from different areas is an ideal machine learning problem Many supervised learning techniques are applicable with different levels of accuracy In addition the choice of a technique also needs to take into consideration that these datasets often suffer from missing values This report discusses the use of one such machine learning technique namely nearest neighbor to predict the level of precipitation expected on a particular day at a particular location given data on the level of precipitation that occurred on previous days at the same location and at neighboring locations We begin by stating the problem and describing the data provided Discussing previous work on the topic sets the scene for our approach to the problem by discussing previous work on the topic This is followed by a description of the nearest neighbor machine learning technique and how it is used in weather prediction We then describes the design and implementation of experiments and discuss the results obtained The report ends with a discussion of future work and conclusions 2 Problem definition We use a machine learning technique to predict the level of precipitation based on historical precipitation data We use the Widmann and Bretherton dataset that includes 45 years of daily precipitation data across 50 km x 50 km from the Northwest of the US in netCDF format The data has three dimensions latitude longitude and time in days The unit for each entry is mm day and refers to the precipitation that occurred at a particular location specified by the latitude and longitude values and on a particular day specified by the time value Such an objective requires understanding the data and its features selection of a machine learning technique application of the technique by making assumptions and evaluation of the entire approach by analyzing the results The data extracted from the netCDF file is scaled by a factor of 0 1 and has several missing values The missing value is indicated by a value of 32767 The data has been prepared by descaling the data and setting the missing value entry as 0 instead of 32767 Dealing with missing data explicitly adds unnecessary complexity Given this we have no intelligent way to pick a prior except that there is more often no rain than some rain 3 Related Work People have tried to predict weather with various techniques and models for millenia Early weather prediction algorithms involved memorizing lists of predictive algorithms such as red sky at night sailor s delight which was probably based on a data driven approach that recognized that if the sky was red the night before it often rained the next day Advances in modeling have led to additional techniques According to Beniston 1 a variety of models are used based on the resolution that is needed For example much more specific physical effects are used for local weather prediction compared to global climate prediction Wikipedia 12 divides these precisions into two categories Global models and Regional models Common global models are GFS NOGAPS GEM ECMWF UKMET and GME Common local models are WRF NAM NMM WRF AR WRF MM and HIRLAM These models predict a variety of factors such as temperature dew point wind speed and direction precipitation and precipitation type In contrast our work is only trying to predict the amount of precipitation We have at our disposal only a smaller set of features than those that these models take advantage so we cannot use these models directly Many different machine learning methods and assumptions have been suggested to predict weather and their accompanying difficulties have been listed In 4 Palmer approaches the problem of uncertainty in forecasts of weather and climate using ensembles of integrations of comprehensive weather and climate prediction models with explicit perturbations to both initial conditions and model formulation resulting in an ensemble of forecasts that can be interpreted as a probabilistic prediction He then uses singular vector methods to determine the linearly unstable component of the initial probability density function He bases his prediction systems on timescales of days seasons and decades He states that many of the difficulties in forecasting predictability arise from the large dimensionality of the climate system In 7 the Bayesian approach to model based data interpretation has been used to investigate global climate modeling and prediction It has been found to be particularly useful in applications where a large amount of prior domain knowledge is available The Bayesian approach can not only find the most probably model but it can also say how accurate the prediction is Two methods that have been suggested specifically related to prediction of precipitation are neural networks 5 9 2 and prognostic equations 10 In 3 Ehrendorfer states that quantification of atmospheric predictability asks for the rate at which two initially close trajectories diverge for given atmospheric dynamics Such estimates place upper bounds on time horizons over which useful forecasts may be expected The literature stated here has led us to believe that given our dataset two key features worth analyzing are the influence of time and space on the precipitation at a particular location 4 Using nearest neighbors for prediction As discussed above weather prediction is a complex and unsolved problem The problem is complex largely due to the huge number of hidden factors It would not be unreasonable to say that if weather was a graphical model there would be a million latent variables for every observed one Given this complexity we do not want to try to use a parametric method as it is not clear that such a model would be able to capture subtle
View Full Document