DOC PREVIEW
Berkeley STAT 133 - Accessing a Database from R

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1 Accessing a Database from RWe have noted already that SQL has limited numerical and statistical fea-tures. For example, it has no least squares fitting procedures, and to findquantiles requires a sophisticated query. (Celko discusses the pros and consof more than eight different advanced queries to find a median [?].) Notonly are basic statistical functions missing from SQL, but in many cases thenumerical algorithms used in the basic aggregate functions are not imple-mented to safeguard numerical accuracy. Also, the wide range of data typesmay have drawbacks when it comes to performing arithmetic calculationsacross a row, as some of the conversions from one numeric type to anothermay produce unexpected truncation and rounding. For these reasons, itmay be desireable or even necessary to perform a statistical analysis in astatistical package rather than in the database. One way to do this, is toextract the data from the database and import it into statistical software.The statistical software may either reside on the server-side, i.e. onthe machine which hosts the database, or it may reside on the client-side,i.e. the user’s machine. The DBI package in R provides a uniform, client-side interface to different database management systems, such as MySQL,PostgreSQL, and Oracle. The basic model breaks the interface between theclient and the server into three main elements: the driver facilitates thecommunication between the R s ess ion and a particular type of databasemanagement system (e.g. MySQL); the connection encapsulates the actualconnection (with the aid of the driver) to a particular database managementsystem and carries out the requested queries; and the result which tracksthe status of a query, such as the number of rows that have been fetchedand whether or not the query has completed.The DBI package provides a general interface to a database manage-ment system. Additional packages that handle the specifics for particulardatabase management systems are required. For example, the RMySQL1package extends the DBI package to provide a MySQL driver and the de-tailed inner workings for the generic functions to connect, disconnect, andsubmit and track queries. The RMySQL package uses client-side softwareprovided by the database vendor to manage the connection, send queries,and fetch results. The R code the user writes to establish a MySQL driver,connect to a MySQL database, and request results is the same code for allSQL-standard database managers.We provide a simple example here of how to extract data from a MySQLdatabase in an R session. The first step: load a driver for a MySQL-typedatabase:drv = dbDriver("MySQL")The next step is to make a connection to the database management serverof interest. This connection stays alive for as long as you want it. Forsome types of database management systems, such as MySQL, the user canestablish multiple connections: each one to a different database or differentserver. Below, the user s133cs establishes a connection, called con, to thedatabase named BaseballDataBank on the host statdocs.berkeley.edu. Sincethe database is not password protected, the user need not provide a passwordto gain access to it.con = dbConnect(drv, user="s133cs", dbname="BaseballDataBank",host="statdocs.berkeley.edu")Once the connection is established, queries can be sent to the database.Some queries are sent via R functions. For example, the following call to thedbListTables function submits a SHOW TABLES query that gets remotelyexecuted on the database s erver. It returns the names of the tables in theBaseballDataBank database.dbListTables(con)2As another example, the dbReadTable function p erforms simple SE-LECT queries that mimics the R counterpart ’get.’ That is, dbReadTableimports the Allstar table from the database into R as a data frame, usingthe attribute PlayerID as the row.names for the data frame.dbReadTable(con, "Allstar", row.names = "PlayerID")Other RMySQL functions are dbWriteTable, dbExistsTable, and dbRe-moveTable, which are equivalent to the R functions ’assign’, ’exists’, and’remove’, respectively.Other queries can be executed by supplying the SQL statement. Forexample, to perform a simple aggregate query, there is no need to pull adatabase table into R and apply an R function to the data frame. Instead,we issue a select statement and retrieve the results table as a data frame.Below is an example where we obtain the number of tuples in the Allstartable of BaseballDataBank.dbGetQuery(con,"SELECT COUNT(*) FROM Allstar;")When the result table is huge, we may not want to bring it into R inits entirety, but instead fetch the tuples in batches, possibly reducing thebatches to simple summaries before requesting the next batch. We providea detailed example of this approach in Section ??. Instead of dbGetQuery,we use dbSendQuery to fetch results in batches. The DBI package providesfunctions to keep track of whether the statement produces output, how manyrows were affected by the operation, how many rows have been fetched (ifstatement is a query), and whether there are more rows to fetch.In the example below, rather than using dbReadTable to pull over theentire TCPConnections table, the dbSendQuery function is used to sendthe query to the database without retrieving the results. Then, the fetchfunction pulls over tuples in blocks. In this example, the first 500 tuplesare retrieved, then the next 200, after which we determine that there are3more results to be fetched (dbHasCompleted) and clear the results object(dbClearResult) without bringing over any more tuples from the SQL server.rs = dbSendQuery(con2, "SELECT * FROM TCPConnections;")firstBatch = fetch(rs, n = 500)secondBatch = fetch(rs, n = 200)dbHasCompleted(rs)dbClearResult(rs)In addition, the n = −1 assignment for the parameter specifies that allremaining tuples are to fetched. The fetch function converts each attribute inthe result set to the corresponding type in R. In addition, dbListResults(con)gives a list of all currently active result set objects for the connection con, anddbGetRowCount(rs) provides a status of the number of rows that have beenfetched in the query. When finished, we free up resources by disconnectingand unloading the


View Full Document
Download Accessing a Database from R
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Accessing a Database from R and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Accessing a Database from R 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?