DOC PREVIEW
Berkeley COMPSCI 294 - Visualizing Relationships among Categorical Variables

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Visualizing Relationships among Categorical Variables Seth Horrigan Abstract—Centuries of chart-making have produced some outstanding charts tailored specifically to the data being visualized. They have also produced a myriad of less-than-outstanding charts in the same vein. I instead present a set of techniques that may be applied to arbitrary datasets with specific properties. In particular, I describe two techniques – Nested Category Maps and Correlation Maps – for visualizing, analyzing, and exploring multi-dimensional sets of categorical and ordinal data. I also describe an implementation of these two techniques. Index Terms—information visualization, questionnaires, multi-dimensional data visualization, statistical analysis, treemaps 1 INTRODUCTION Many surveys, both professional and amateur, are based on banks of questions presented in a questionnaire. These surveys may be created and distributed by major institutions or they may be impromptu constructions from students using tools like SurveyMonkey [1], Zoomerang [2], or QuestionPro [3]. Information collected by institutions like the Pew Charitable Trust in their annual Pew Internet Survey undergoes a great deal of educated, thorough analysis. Statisticians can make entire careers from analyzing the results of this data and subsequently drawing and publishing conclusions based on the data. Marketing researchers will collect information about potential customers or reviews of new and existing products using questionnaires – be it online, in malls, on street corners, or by random digit telephone dialing. Basic social science statistics such as pair-wise correlations, chi-square tests of independence, and analysis of variance (ANOVA) can reveal vital information hidden beneath the distribution of answers. Unfortunately, this data is seldom presented in a format that makes visual exploration simple. Researchers customarily have specific correlations they expect and they confirm or disprove their hypotheses by testing the empirically obtained numeric values against the expectations. Cross tables of raw sums constrained by responses on related variables can reveal much information to the highly trained eye, and statistical packages such as STATA and SPSS provide simple ways to issue these queries, thus providing a limited degree of interactive exploration [4, 5]. The increase in processing power of personal computers has allowed such comparisons to be rendered on-demand in near real time. Still, in all these cases, the data is seen as banks of row upon row of numbers and text. 1.1 Analysis The communities built around this data have become highly skilled at analyzing these numbers and running the proper tests to find out the information they expect as well as occasionally finding unexpected results that warrant further study. Unfortunately, many potential interesting comparisons may go completely ignored simply for lack of a skilled analyst with the time and motivation to thoroughly explore the dataset. This problem is compounded when one considers as well the staggering number of surveys conducted by non-experts using ready-made tools like SurveyMonkey. Such sites provide very simple aggregation of numbers according to question, which allows unskilled investigators to identify basic trends in response, but offers little or none of the more interesting comparison of interrelation among responses (see Fig. 4). Happily, in most cases the data collected via the online tool can be exported to common spread-sheet applications such as Microsoft Excel, or in commonly shared formats like Comma Separated Value files for analysis later. When the data is collected through secondary agencies or directly via paper questionnaires it will likely also be recorded and distributed in spreadsheet formats that could be analyzed given the proper tools. 1.1.1 Textual Many of the questions on such questionnaires are open-ended, “free-response” inquiries. Answers to such questions are notoriously difficult to analyze and categorize. Often analysts will sort through them searching for keywords, or subjectively categorizing each response. If the number of respondents is small enough, humans can manually parse all individual responses and present their own subjective evaluation of the responses in aggregate, but as the number of respondents grows, this becomes an increasingly daunting task. With the growth of the internet, the question of visualizing large corpora of computerized text becomes ever more important. Research in this area has produced very useful techniques like Word Trees, ThemeRiver, and TextArc [6, 7, 8]. ManyEyes, in particular, provides an interface for employing such techniques to visualize arbitrary datasets [6]. Applied correctly, such textual visualization methods can be used to visualize and explore the results of free-response survey questions - an invaluable tool when the number of responses grows far too large to analyze manually. 1.1.2 Interval Due to the complexity of interacting with, summarizing, exploring, and quantifying large numbers of free-form textual responses, when the expected number of respondents is large, survey designers often attempt to construct the survey in such a way that the responses can be easily represented numerically and analyzed using the statistical methods mentioned earlier. Certain types of inquiries, such as the respondent’s age or the number of hours spent weekly washing dishes, lend themselves to numerical definition. These interval variables allow robust interaction and aggregation. Their continuous nature lends itself to representing the values using simple two-dimensional encodings like scatterplots and line graphs that rely on position according to a specific x-y grid (see Fig. 1). Such interval variables allow analysts to quickly identify groupings along the continuum of possible variables. For example, they may identify that, although respondents can specify any number of hours a week for dishwashing, they generally grouped themselves into around 3 hours or around 6 hours


View Full Document

Berkeley COMPSCI 294 - Visualizing Relationships among Categorical Variables

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download Visualizing Relationships among Categorical Variables
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Visualizing Relationships among Categorical Variables and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Visualizing Relationships among Categorical Variables 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?