DOC PREVIEW
UW-Madison ECE 539 - Fast Solution of `1-norm Minimization Problems When the Solution May be Sparse

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1 Age of Abalones using Physical Characteristics: A Classification Problem ECE 539 Fall 2010 Project Report Hiran Mayukh Department of Electrical and Computer Engineering University of Wisconsin-Madison [email protected] Abstract Abalones [1], also called ear-shells or sea ears, are sea snails (marine gastropod mollusks) found world-wide. The age of an abalone can be determined by counting the number of layers in its shell. However, age determination is a cumbersome process: It involves cutting a sample of the shell, staining it, and counting the number of rings through a microscope. A data set provided by the University of California Irvine Machine Learning Repository [2] consists of physical characteristics of abalones and their ages. This study is a classification problem that aims to predict the age range of abalones using their physical characteristics. The data set is pre-processed to transform the problem of predicting the age to a classification problem. Two clustering algorithms are used to cluster the training data set without supervision. Cluster centers are then grouped together and assigned class labels based on votes by data points within each cluster region. The testing data set is used to estimate accuracy of the classifier. Experiments are also run to obtain an order of physical characteristics reflecting their contribution to classification accuracy. 1. Introduction Machine learning (ML) algorithms are used to recognize patterns and make decisions based on empirical data. The problem of classification of a data set, that is, assigning a class label to each sample of the data set can be complex, especially if the data set large or if the data set has a large feature dimension. If the feature vector representing a sample of data is of n dimensions, the problem of classification boils down to carving out regions in n-dimensional feature space with the understanding that any point within a space is to be assigned a certain class label. In this study, unsupervised clustering algorithms are used to divide feature space into a large number of regions, and based on the class labels of training set samples present in each region, each region (specified by the region’s clustering center) is assigned a class label. In classification problems with feature vector dimensions more than one, an interesting question is the relative importance of each feature to classification. However, features may be correlated with each other in complex ways, and the contribution of each feature to enabling better classification may be difficult to determine. This study experiments with reduced number of features (i.e., by elision of different combinations of features) in order to generate an approximate ordering of data set features used in the classification problem dealt with in the paper. This paper is organized as follows: Section 2 introduces the data set and the preprocessing done in order to recast it as a classification problem. Section 3 describes the methodology, section 4 contains the experiments and results, and section 5 concludes the paper.2 2. Data Set The University of California, Irvine Machine Learning Repository [2] provides a data set consisting of 4177 samples of physical characteristics of abalones and their age. Abalones are sea-snails that are fished for their shells and meat. Scientific studies on abalones require knowing the age of an abalone, but the process of determining age is complicated. It involves measuring the number of layers of shell (“rings”) that make up the abalone’s shell. This is done by taking a sample of shell, staining it and counting the number of rings under the microscope. To circumvent the cumbersome process, this data set provided has been used to build learning algorithms to predict age using easily and quickly measurable physical characteristics. This project uses this data set recast as a classification problem, rather than a prediction problem. The data set consists of 8 features and the number of rings (which is directly related to the age). The 8 features are sex, length, diameter, height, whole weight, shucked weight, viscera weight, shell weight. For ease, we refer to these as F1, F2 … and F8 in this paper. The number of rings varies from 1 to 29. This could be looked at as a class label that can take 29 possible values. To reduce the time taken by code to run experiments, the number of classes is reduced from 29 to 8. Data samples with number of rings from 1 to 4 are assigned class label 1, 5 to 8 are assigned class label 2, and so on, until 24 to 28 gets class label 7 and samples with 29 rings is said to be of class 8. Each class, thus, forms an age range. Figure 1 shows the histogram of the entire data set. Other minor preprocessing done to the data set include converting the three possible values of feature sex (male, female, infant) to numbers; and dividing the data set into a training set and a testing set on the same lines as other studies with this data set [4,5]. The first 75% of samples (3133) form the training set and the remaining (1044) form the testing set. Figure 1. Histogram of the Abalone data set 3. Methodology Two clustering algorithms, the k-means algorithm and a hierarchical clustering algorithm are developed in MATLAB [6], and run to generate a large number of cluster centers. All test data samples that lie within Voronoi regions created by the cluster centers are then polled to find the most common class label, and the cluster center (and hence the cluster’s Voronoi region) is assigned that class label. The idea is that if the number of clusters is made large enough, we get a fine grained division of the feature space based on class label, and hence can be used to predict the class label of testing data samples based on the region it lies in. The algorithm is spelt out in Table 1: The experiment has to be repeated many times because the clustering algorithms are not deterministic – the clustering configuration depends strongly on initial positions of cluster centers, which are randomly initialized to decrease the chances of


View Full Document

UW-Madison ECE 539 - Fast Solution of `1-norm Minimization Problems When the Solution May be Sparse

Documents in this Course
Load more
Download Fast Solution of `1-norm Minimization Problems When the Solution May be Sparse
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Fast Solution of `1-norm Minimization Problems When the Solution May be Sparse and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Fast Solution of `1-norm Minimization Problems When the Solution May be Sparse 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?