UMD CMSC 828G - Inventing discovery tools - D2355872

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 828G> Inventing discovery tools

DOC PREVIEW

UMD CMSC 828G - Inventing discovery tools

School name University of Maryland, College Park

Course Cmsc 828g- Advanced Topics in Information Processing:Data-Intensive Computing with MapReduce

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

UncorrectedProofORIGINAL PAPERSInventing discovery tools: combininginformation visualization with data mining{Ben Shneiderman11Department of Computer Science, University ofMaryland, College Park, Maryland, U.S.A.Correspondence:Ben Schneiderman, Department ofComputer Science, Human-ComputerInteraction Laboratory, Institute forAdvanced Computer Studies, and Institutefor Systems Research, University ofMaryland, College Park, MD 20742 U.S.A.E-mail: [email protected] for Discovery Science 2001Conference, November 25–28, 2001,Washington, DC.AbstractThe growing use of information visualization tools and data mining algorithmsstems from two separate lines of research. Information visualization researchersbelieve in the importance of giving users an overview and insight into the datadistributions, while data mining researchers believe that statistical algorithmsand machine learning can be relied on to find the interesting patterns. Thispaper discusses two issues that influence design of discovery tools: statisticalalgorithms vs visual data presentation, and hypothesis testing vs exploratorydata analysis. I claim that a combined approach could lead to novel discoverytools that preserve user control, enable more effective exploration, andpromote responsibility.Keywords: ????IntroductionGenomics researchers, financial analysts, and social scientists hunt forpatterns in vast data warehouses using increasingly powerful softwaretools. These tools are based on emerging concepts such as knowledgediscovery, data mining, and information visualization. They also employspecialized methods such as neural networks, decisions trees, principalcomponents analysis, and a hundred others.Computers have made it possible to conduct complex statistical analysesthat would have been prohibitive to carry out in the past. However, thedangers of using complex computer software grow when user comprehen-sion and control are diminished. Therefore, it seems useful to reflect on theunderlying philosophy and appropriateness of the diverse methods thathave been proposed. This could lead to a better understanding of whento use given tools and methods, as well as contribute to the invention ofnew discovery tools and refinement of existing ones.Each tool conveys an outlook about the importance of human initiativeand control as contrasted with machine intelligence and power.1Theconclusion deals with the central issue of responsibility for failures andsuccesses. Many issues influence design of discovery tools, but I focus ontwo: statistical algorithms vs visual data presentation and hypothesis test-ing vs exploratory data analysisStatistical algorithms vs visual data presentationEarly efforts to summarize data generated means, medians, standard devia-tions, and ranges. These numbers were helpful because their compactness,relative to the full data set, and their clarity supported understanding,comparisons, and decision making. Summary statistics appealed to therational thinkers who were attracted to the objective nature of datacomparisons that avoided human subjectivity. However, they also hidinteresting features such as whether distributions were uniform, normal,skewed, bi-modal, or distorted by outliers. A remedy to these problemsReceived: ?? ?? 20??Revised: ?? ?? 20??Accepted: ?? ?? 20??Information Visualization (2001) 00, 00 – 00ª 2001 Palgrave. All rights reserved 1473 – 8716 $15.00www.palgrave-journals.com/ivsIVS 05_01UncorrectedProofwas the presentation of data as a visual plot so interestingfeatures could be seen by a human researcher.The invention of times-series plots and statisticalgraphics for economic data is usually attributed toWilliam Playfair (1759–1823) who published TheCommercial and Political Atlas in 1786 in London. Visualpresentations can be very powerful in revealing trends,highlighting outliers, showing clusters, and exposinggaps. Visual presentations can give users a richer senseof what is happening in the data and suggest possibledirections for further study. Visual presentations speakto the intuitive side and the sense-making spirit that ispart of exploration. Of course, visual presentations havetheir limitations in terms of dealing with large data sets,occlusion of data, disorientation, and misinterpretation.By early in the 20th century statistical approaches,encouraged by the Age of Rationalism, became prevalentin many scientific domains. Ronald Fisher (1890–1962)developed modern statistical methods for experimentaldesigns related to his extensive agricultural studies. Hisdevelopment of analysis of variance for design of factorialexperiments2helped advance scientific research in manyfields.3His approaches are still widely used in cognitivepsychology and have influenced most experimentalsciences.The appearance of computers heightened the impor-tance of this issue. Computers can be used to carry outfar more complex statistical algorithms and they also beused to generate rich visual, animated, and user-controlled displays. Typical presentation of statistical datamining results is by brief summary tables, induced rules,or decision trees. Typical visual data presentations showdata-rich histograms, scattergrams, heatmaps, treemaps,dendrograms, parallel coordinates, etc. in multiple coor-dinated windows that support user-controlledexploration with dynamic queries for filtering (Figure 1).Comparative studies of statistical summaries and visualpresentations demonstrate the importance of user famil-iarity and training with each approach and the influenceof specific tasks. Of course, statistical summaries andvisual presentations can both be misleading or confusing.An example may help clarify the distinction. Promo-ters of statistical methods may use linear correlationcoefficients to detect relationships between variables,which works wonderfully when there is a linear relation-ship between variables and when the data is free fromanomalies. However, if the relationship is quadratic (orexponential, sinusoidal, etc.) a linear algorithm may failto detect the relationship. Similarly if there are datacollection problems that add outliers or if there arediscontinuities over the range (e.g. freezing or boilingpoints of water), then linear correlation may fail. A visualpresentation is more likely to help researchers find suchphenomena and suggest richer hypotheses.Hypothesis testing vs exploratory data analysisFisher’s approach not only promoted statistical meth-ods over visual presentations, but also

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 8 pages.

UMD CMSC 828G - Inventing discovery tools

Sign up for free to view:

Please select your school