BU CS 105 - Data Mining for Factors That Affect US Crime Rates
Pages 24

Unformatted text preview:

Data Mining for Factors That Affect US Crime Rates Yoni Yudin Introduction Crime rate is a very important way to determine where one ought to live. It is a statistical tool that gives a measure of the quality of life in a given location. Thus, if someone is looking to move to a new place, crime rate is a crucial factor when making such a decision. This project will explore some factors that influence crime rate in the US, and utilize data to predict and generalize trends in the US. Hence, this project addresses the problem of estimating a numeric value for crime rate in a given location. It engages SQLite3, a database management system, and Weka, a data mining program, to determine a model that calculates crime rate. Furthermore, by using SQLite3 for query processing of statistical data and graphic visualizations in Many-Eyes, I discovered trends in US crime rates and additional insights when combining other factors. For data mining, I used Numeric Estimation; specifically the Linear Regression, SMOreg, and Multilayer Perception algorithms to create models that would predict crime rates and generalize about the circumstances that contribute to them. Dataset Description The data I used to compile the dataset came from the U.S. Department of Justice, Federal Bureau of Investigation and from the U.S. Census Bureau. The data is for the year 2005.Websites: http://www.fbi.gov/ucr/05cius/ http://www.census.gov/ The dataset includes 2934 counties from 49 states in the U.S. and has 38 attributes. I picked these attribute because I wanted to see what models Weka would produce based on attributes such as age, race, and economic conditions. List of Attributes in the Dataset: County- The name of the County State- The name of the County’s State. (Format is by two letter initials NY, MA, TX…) POP_SIZE- The size of the population. Under_5_years- Percentage of the population under the age of 5. BTW_5_14- Percentage of the population between the ages 5 and 14. BTW_15_24- Percentage of the population between the ages 15 and 24. BTW_25_34- Percentage of the population between the ages 25 and 34. BTW_35_44- Percentage of the population between the ages 35 and 44. BTW_45_54- Percentage of the population between the ages 45 and 54. BTW_55_64- Percentage of the population between the ages 55 and 64. BTW_65_74- Percentage of the population between the ages 65 and 74. Over_75- Percentage of the population over 75. White- Percentage of the population that is Caucasian. Black- Percentage of the population that is African-American. Asian- Percentage of the population that is Asian. AMR_IND- Percentage of the population that is Native American. Hisp_Latin- Percentage of the population from Hispanic or Latin origin.Male_per_100_FEM- The number of males per 100 females. High_School- Percentage of the population that has a high school degree. Bachelor- Percentage of the population that has a bachelor degree or higher. Foreign_Born- Percentage of the population born outside the U.S. Poverty- The poverty rate. Total_Labor- The total number of people in the labor force. Num_Unemp- The total number of people unemployed. Unemp_Rate- The unemployment rate. Numb_Establishment- The number of business establishments. Numb_Employed- The total number of people employed in non-farming businesses. Tot_Violent- The total number of violent crime. MUR_MAN_SLA- The total number of reported murder and involuntary man slaughter. FOR_RAPE- The total number of reported forcible rape. ROBBERY- The total number of reported robberies. AGG_ASSAU- The total number of reported aggravated assault. TOT_PROP- The total number of reported property crimes BURGLARY- The total number of reported burglaries LARCENY_THEFT- The total number of reported larceny and thefts. MOT_VEH_THF- The total number of reported motor vehicle thefts. TOTAL_CRIME- The total number of reported violent and property crime. CRIME_RATE- The crime rate calculated by dividing total crime per 100,000 inhabitants. These attributes came from separate excel spreadsheets and were divided into several tables in SQLite3. At a later stage, the attributes were combined into one large table in a different database in order to make it easier to perform SQL commands and for data mining in Weka.The tables and their schemas are: Population(County VARCHAR(40), State VARCHAR(5), Pop_Size INTEGER); Age_Race_Sex(County VARCHAR(40), State VARCHAR(5), Under_5_years FLOAT, BTW_5_14 FLOAT, BTW_15_24 FLOAT, BTW_25_34 FLOAT, BTW_35_44 FLOAT, BTW_45_54 FLOAT, BTW_55_64 FLOAT, BTW_65_74 FLOAT, Over_75 FLOAT, White FLOAT, Black Float, Asian FLOAT, AMR_IND FLOAT, Hisp_Latin FLOAT, Male_per_100_FEM FLOAT); County_Edu_Pov(County text, State text, High_School real, Bachelor real, Foreign_Born real, Poverty real); Crime(County VARCHAR(40), State VARCHAR(5), Tot_Violent INTEGER, MUR_MAN_SLA INTERGER, FOR_RAPE INTEGER, ROBBERY INTERGER, AGG_ASSAU INTEGER, TOT_PROP INTERGER, BURGLARY INTEGER, LARCENY_THEFT INTEGER, MOT_VEH_THF INTEGER, TOTAL_CRIME INTEGER, CRIME_RATE FLOAT); Labor_Unemployed (County VARCHAR(40), State VARCHAR(5), Total_Labor INTEGER, Num_Unemp INTEGER, Unemp_Rate FLOAOT); Private_Nonfarm_Business (County VARCHAR(40), State VARCHAR(50), Numb_Establishment INTEGER, Numb_Employed INTEGER); The second database with one big table, named “Combination,” included the same attributes mentioned above for every county. I performed SQL queries on this database rather than the first since SQLite3 would stall when it had to perform queries with several join conditions on large tables. Furthermore, the second database included data that was denormalized and organized for Weka. Dataset Preparation After acquiring the appropriate spreadsheets and converting them to CSV files, I wanted to transfer the data to SQLite3 in order to perform queries and create one file that included all the attributes I needed for data mining. I started by creating a database in SQLite3 (final1.db), whichhad six tables that would hold information on 3140 counties in 49 States (See Dataset Description for schemas of the tables). However, I encountered a problem when converting the spreadsheets to CSV files, since some of the values included spaces or noise-creating characters that would not work well in SQLite3 or Weka: spaces (“ “), dashes (“-“), apostrophes (“’”), and various Null values (“###”, “VALUE!”, “(Z)”). I addressed this problem by using a Python program


View Full Document

BU CS 105 - Data Mining for Factors That Affect US Crime Rates

Download Data Mining for Factors That Affect US Crime Rates
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Data Mining for Factors That Affect US Crime Rates and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Data Mining for Factors That Affect US Crime Rates 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?