DOC PREVIEW
Exploratory Data Analysis with Categorical Variables

This preview shows page 1-2-3-24-25-26-27-49-50-51 out of 51 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Exploratory Data Analysis with Categorical Variables: An Improved Rank-by-Feature Framework and a Case Study Jinwook Seo and Heather Gordish-Dressman {jseo, hgordish}@cnmcresearch.org Research Center for Genetic Medicine Children’s Research Institute 111 Michigan Ave NW, Washington, DC 20010 RUNNING HEAD: RANK-BY-FEATURE FRAMEWORK FOR CATEGORICAL DATA Acknowlegement: This work was supported by NIH 5R24HD050846-02 Integrated molecular core for rehabilitation medicine, and NIH 1P30HD40677-01 (MRDDRC Genetics Core). We also thank the FMS study group, especially Joseph Devaney and Eric Hoffman, for providing genotype data. Corresponding Author’s Contact Information: Jinwook Seo Research Center for Genetic Medicine Children’s Research Institute 111 Michigan Ave NW, Washington, DC 20010 Tel: +1 202-884-4942 Fax: +1 202-884-6014 1ABSTRACT Multidimensional datasets often include categorical information. When most dimensions have categorical information, clustering the dataset as a whole can reveal interesting patterns in the dataset. However, the categorical information is often more useful as a way to partition the dataset: gene expression data for healthy vs. diseased samples or stock performance for common, preferred, or convertible shares. We present novel ways to utilize categorical information in exploratory data analysis by enhancing the rank-by-feature framework. First, we present ranking criteria for categorical variables and ways to improve the score overview. Second, we present a novel way to utilize the categorical information together with clustering algorithms. Users can partition the dataset according to categorical information vertically or horizontally, and the clustering result for each partition can serve as new categorical information. We report the results of a longitudinal case study with a biomedical research team, including insights gained and potential future work. Color figures are available at www.cs.umd.edu/hcil/ben60 21 INTRODUCTION In many analytic domains, multidimensional datasets frequently include categorical information that plays an important role in the data analysis. In our work in bioinformatics, many of the biologists we collaborate with have datasets that include categorical information, such as labels for healthy vs. diseased samples. In that case, the goal is to compare gene expression levels (quantitative measurements of gene activity) to determine which genes might have higher or lower expression levels in the diseased samples as compared to the healthy samples. Other biology researchers compare male and female patients because some genes are differentially expressed in one gender but not in the other. We have received similar requirements from stock market analysts, meteorologists, and others. The inclusion of such categorical information in a multidimensional dataset imposes a different challenge to the way researchers analyze the dataset. First, different test statistics are necessary for the dataset. For example, a chi-square test is typically used for testing an association between categorical variables while a linear correlation coefficient is typically used for testing an association between real (continuous) variables. Secondly, stratification by categorical information is crucial to delve into such datasets, without which features hidden in the stratified subgroups cannot be identified during exploratory data analysis. Ignoring or mistreating such information could result in a flawed conclusion costing days or even months of effort. Most statistical packages support functionalities for categorical data, but biologists and biostatisticians are in need of exploratory visualization tools with which they can interactively examine their large multidimensional datasets containing categorical information. To address 3these issues and requests, we developed new features in our order-based data exploration framework, or rank-by-feature framework (Seo & Shneiderman, 2005b) in the Hierarchical Clustering Explorer (HCE, www.cs.umd.edu/hcil/hce/, also see Section 2), which enabled users to effectively explore multidimensional datasets containing categorical entities or variables. First, we added to the rank-by-feature framework new ranking criteria for categorical or categorized variables in multidimensional datasets. The score overview (see Section 2) that gives users a brief overview of the ranking result was also improved by introducing size-coding by strength in addition to the color coding. The inclusion of strength information from a ranking criterion for categorical data can make the score overview of the rank-by-feature framework more informative at a glance. Second, we enabled users to stratify their datasets according to the categorical information in the datasets. We support two different partitioning mechanisms according to the direction they split the datasets: vertical and horizontal partitioning. In this paper, we assume that the input dataset is organized in a tabular way such that the rows represent items or entities and the columns represent dimensions or attributes. Users can stratify their datasets vertically by separating columns according to the categorical information conveyed by a special row and then conduct comparisons among items in different partitions. We call this vertical partitioning. For example, biologists are often interested in partitioning samples according to their phenotypes (normal vs. diseased) in case-control microarray projects. Clustering of the rows is then performed in each partition to generate two clustering results of the rows, each of which is homogeneous (i.e. only includes the same value for the special categorical row). By comparing the partitioned clustering results, users can get meaningful insights into finding an 4interesting group of genes that are differentially or similarly expressed in the normal and the diseased groups. Users can also stratify their datasets horizontally by separating rows according to a column of a categorical attribute such as gender and ethnicity. We call this horizontal partitioning, where the rows of the dataset are partitioned and each partition has the same set of columns. For example, in genome-wide association studies where the study subjects are in the rows and the genotype and the phenotype information are in the columns, it is often inevitable to partition the subjects (or the rows) according to the gender or the ethnicity column


Exploratory Data Analysis with Categorical Variables

Download Exploratory Data Analysis with Categorical Variables
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Exploratory Data Analysis with Categorical Variables and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Exploratory Data Analysis with Categorical Variables 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?