UMD CMSC 828G - Principles of Data Mining - D2354578

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 828G> Principles of Data Mining

DOC PREVIEW

UMD CMSC 828G - Principles of Data Mining

School name University of Maryland, College Park

Course Cmsc 828g- Advanced Topics in Information Processing:Data-Intensive Computing with MapReduce

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CMSC828G Principles of Data Mining Lecture #25• Today’s Reading:– HMS ch. 12, articles on web page•Today’s Lecture:– Data Organization– Link Analysis• Upcoming Due Dates:– Project Writeup due today– Project Reviews due 5/7Peer Review• 1-2 pages•summary summarize the paper succinctly and dispassionately. This is not the place to criticize. Perhaps discuss how it fits into the big picture. • general comments Give the big critical picture, before sinking into the details. This is the place to take a breath, keep your perspective, and explain what the papers weaknesses are and whether they are serious, or intrinsic to our current state of knowledge, or whatever. • constructive criticism not only of technical issues, but also organization and clarity. • minor errors typos and grammatical errors, and minor textual problems. It's not the reviewer's job to copy edit the paper, so don't go out of your way to look for typos. If there are serious grammatical problems, please say so but please be charitable, especially if English is not the author's native language. • from http://www.cs.unm.edu/~bap/how-to-review.htmlProject Presentation• 10 minutes– describe your data mining objective– problem definition and algorithms– experimental evaluation • methodology•results– lessons learned/adviceData Organization• data mining distinguished from other data analysis by the sheer quantity of data• n number of rows (may be millions)– algorithms that are exponential in n are unusable– algorithms that are O(n) or O(n log n) are generally feasible such as:• counting frequencies• finding the mode•sorting– multiple passes over the data may not be feasible (even if algorithm is linear time)• p number of columns (may be thousands)– for large p, even O(p2) algorithm may be infeasibleComputer Architecture 101• Memory hierarchysequential accessseconds, minutesgigabytes, terabytesmagnetic tapesequential access10 ms1 – 1000s GBdisk memorydirect access70 ns16 MB – several GBmain memoryassociative access20 ns16-1000 KBon-board cachedirect access100 wordsregistersaccessaccess timesizememorygoal: increase locality of referenceIndexing Structures•An index on an attribute A is data structure that makes it possible to access data points with a particular value of A more efficiently than a linear scan.2Data Structures 101• Set of S data vectors = {x(1),…,x(n)}•Queries: – point queries: want to find all points having a particular value of an ordinal attribute x.A– range queries: all points with b ≤ x.A ≤ c• Simplest solution? binary tree•Time complexity?if tree balanced, O(log n) * # of recordsB*-trees• idea similar to BST, used to access data on disk B*-tree of degree M is tree where:- all leaves are at same depth- each leaf contains between M/2 and M keys- each interior node (except possibly the root) has M/2 ≤ K ≤ M children, with values a1,…,ak-1- values stored at the leaves of subtree Ciare larger than ai-1and at most aiM chosen so that each node fits into single disk pageFor M = 100, 3 disk accesses required to find any pointin 300 million point datasetHash Indices• based on idea: If number of possible values for attribute A is small, keep list with pointer to all points with attribute value A = a.• If number of possible values for A is large, use hash function h: Dom(A) →{1,…,M}• r[j] is list of pointer to records xisuch that h(ai) = j• To find all points with A = a– for each element xiin r[h(a)]• check whether xi.A = a• Typical hash function: a mod MMultidimensional Indexing• especially relevant in multimedia DB applications such as image databases•example:–user: retrieve all images similar to a given example image – The system extracts features from the query image and searches the database for images having similar features. The features from all of the images in the DB are indexed using appropriate set of features• types of queries: point queries, range queries, nearest neighbor queriesTypes of Multidimensional Indexes• space filling curves/z ordering/linear quadtreesThe data is “linearized” and then one of the traditional index structures is used• grid files: The data space is divided into a “grid” and indexed in each dimension.• R-trees: Similar to B-trees where space is hierarchically partitioned and indexed at each level by region.R-trees• A R+-tree is a spatial data structure optimized for searching areas in N dimensional space.• R+-tree is a spatial access methodwhich splits space with hierarchically nested boxes. Objects are indexed in each box which intersects them. The tree is height-balanced.• The R+-tree data structure was created by Timos Sellis, Nick Roussopoulos, and Christos Faloutsos and is described in their paper "The R+-Tree: A Dynamic Index for Multi-Dimensional Objects3What is Data Warehouse?• Defined in many different ways, but not rigorously.– A decision support database that is maintained separately from the organization’s operational database– Support information processing by providing a solid platform of consolidated, historical data for analysis.• “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon• Data warehousing:– The process of constructing and using data warehousesData Warehouse—Subject-Oriented• Organized around major subjects, such as customer, product, sales.• Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.•Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.Data Warehouse vs. Operational DBMS• OLTP (on-line transaction processing)– Major task of traditional relational DBMS– Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.• OLAP (on-line analytical processing)– Major task of data warehouse system– Data analysis and decision making• Distinct features (OLTP vs. OLAP):– User and system orientation: customer vs. market– Data contents: current, detailed vs. historical, consolidated– Database design: ER + application vs. star + subject– View: current, local vs. evolutionary, integrated– Access patterns: update vs. read-only but complex queriesFrom

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

UMD CMSC 828G - Principles of Data Mining

Sign up for free to view:

Please select your school