DOC PREVIEW
UI STAT 5400 - Simulation - Quantiles of Large Datasets Using Liu & Su Method

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Simulation: Quantiles of Large Datasets UsingLiu & Su MethodRashidat Brobbey Shih-Wei SuDecember 4, 20061 Background InformationBusinesses all over the world are developing rapidly and as such informationgathering and storage is increasingly becoming a challenge. Thousands ofcompanies, such as insurance companies, banks, and manufacturing industriesneed to save huge private data.For example, customer’s salary, average life ofcustomers, or other personal qualities are some of the few important infor-mation(data) insurance companies may collect. However, as the populationincreases, these companies need large storage system to save their customers’data, and also need good statistical software to obtain useful information fromanalyzing these huge data.The development of the computer has been able tohandle the first problem; that is how to save and store huge data.While weare satisfied by large storage of today’s computer, unfortunately we face an-other issue.No business software at present can analyze more than one milliondata.For example, given 10 million dataset, it is a challenge if an insurancecompany wants to find the median of the life expentancy of human in order todetermine the price of life insurance it should charge.Solutions to this problem are now widely discussed.One interesting remedyin finding quantiles of large datasets is to use the frequency table methodwhich was offered by Liu and Su in 2001, namely the L&S method. So thenatural question is: What is L&S method? The main idea of L&S method isto use the frequency table to store the frequency of data instead of storing allvalues of the data. This is done to reduce the amount of memory space allthe data may occupy. for example, if we use a frequency table which includes2000 intervals and use this frequency table to store all frequency of one milliondata. It is obviously the memory will redure from one million to 2000. So how1to do we get the Quantile of the data using infomation from the frequencytable? Well, the L&S method suggests to create a four degree polynomial tomodel the distribution of data in each interval of the frequency table. So forexample, if we want to create a four degree polynomial in the kth interval,there are five unknown parameters.So, we can use five frequencies which liein (k-2)th, (k-1)th, kth, (k+1)th,(k+2)th interval to solve all five equations.That means L&S method tries to use the distribution of data amoung intervalof (k-2)th, (k-1)th, kth, (k+1)th, (k+2)th to model the distribution of data inthe kth interval.2 Method of SimulationWe want to investigate and to see how good the L&S method is by simulation.In our simulation, we chose three factors of interest. Three factors are:1. Distribution:Normal(0,1), Gamma(2,4), Exp(0.1), and Mixed Distribuion2. Sample Size: 500000 & 10000003. Amount of Intervals: 500 & 2000In the mixed distribution case, we mixed the first three distributions to be ournew distribution. In order to be consistent,we chose the same sample size. Inother words, the sample size for each distribution in the mixed distributioncase is of 166667 and 333333 respectively.The method of evaluation we chose is the average difference between theestimated quantiles and the true quantiles of our data. Also,because generated1000 independent datasets for the simulation,we get the average difference byadding all differences and dividing by 1000.3 Tables of ResultsNo. of Intervals=500 No. of Intervals=2000Quantiles N=500000 N=1000000 N=500000 N=1000000ξ.25-6.64 -20.10 5.40 -0.19ξ.513.99 0.50 5.11 3.49ξ.7529.92 26.69 12.35 5.018Table 1: Average Bias(×10−6) of Quantiles Estimation for N(0,1)2No. of Intervals=500 No. of Intervals=2000Quantiles N=500000 N=1000000 N=500000 N=1000000ξ.25-1.68 -0.17 1.16 0.98ξ.511.91 10.28 1.23 1.99ξ.7516.51 19.10 5.07 3.98Table 2: Average Bias(×10−6) of Quantiles Estimation for Gamma(2,4)No. of Intervals=500 No. of Intervals=2000Quantiles N=500000 N=1000000 N=500000 N=1000000ξ.257.91 -79.84 1.26 7.78ξ.5-15.95 -78.19 1.52 -0.48ξ.7548.71 46.27 17.15 12.48Table 3: Average Bias(×10−6) of Quantiles Estimation for Exp(0.1)No. of Intervals=500 No. of Intervals=2000Quantiles N=500000 N=1000000 N=500000 N=1000000ξ.25-4322.71 -4453.87 -291.85 -331.45ξ.52806.36 3046.92 467.14 566.64ξ.75-239.09 -361.40 181.25 199.94Table 4: Average Bias(×10−6) of Quantiles Estimation for Mixed Distributionof N(0,1), Gamma(2,4) and Exp(0.1)4 Summary of ResultsFrom the table of results, the larger the number of intervals we choose thebetter the estimated quantiles we get.That is when the number of intervalswas 2000 the quantiles estimated was closer to the true quantiles than whenthe number of intervals was 500. This is because the range of the data is fixedand thus the larger number of intervals chosen will narrow the range of eachinterval of the frequency table. In L&S method, the author suggest that thebigger the sample size, the more accurate the simulated estimates will be. Butin our simulation, the results we got for both sample size of 500000 and 1000000are not significantly different. The results obtained in Table 4 for the mixeddistribution looks bigger the other 3 distributions above. This may be due tothe fact that the mixed distribution is more complicated than the other three3distribution,therefore the four degree polynomial is not good way to modelthis distribution, perhaps we need a higher degree polynomial. In conclusion,the results obtained from simulation seems reasonable. The average bias issignificantly small because the values shown above is ×10−6.This shows thatthe L&S method is a good way of estimating quantiles of large data.A R code for SimulationBelow is the code for the Normal Distribution.For the other distributions, wejust do the substitions accordingly.q<-function(x){dd1<-0dd2<-0dd3<-0for (j in 1:1000){x<-rexp(1000000,10)N<-length(x)A<-2000quan<-matrix(0,A,6)count<-rep(0,A)constant<-matrix(0,A,5)int<-(max(x)-min(x))/Aindix<-rep(0,A)for(i in 1:(A+1)){indix[i]<-min(x)+(i-1)*int}xx<-ceiling((x-min(x))/int)for(i in 1:N){count[xx[i]]<-count[xx[i]]+1}p1<-0p2<-0p3<-0for(i in


View Full Document

UI STAT 5400 - Simulation - Quantiles of Large Datasets Using Liu & Su Method

Documents in this Course
Load more
Download Simulation - Quantiles of Large Datasets Using Liu & Su Method
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Simulation - Quantiles of Large Datasets Using Liu & Su Method and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Simulation - Quantiles of Large Datasets Using Liu & Su Method 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?