Unformatted text preview:

Data Reduction Fundamentals Essential to make large datasets manageable for analysis storage and processing Key techniques Sampling Selecting a representative subset of the data Dimensionality Reduction Reducing the number of attributes variables by transformation or elimination 2 Sampling Techniques Random Sampling Selecting points randomly though it may miss outliers Stratified Sampling Sampling based on clusters or strata to better represent the population 3 Similarity and Distance Measures Important for clustering and data reduction Common metrics Euclidean Distance Measures straight line distance Manhattan Distance Measures distance along axes at right angles Cosine Similarity Good for high dimensional data measures angle between vectors Jaccard Similarity Measures similarity in binary or categorical data 4 Clustering for Data Reduction Clustering groups similar data points and reduces redundancy K Means Clustering Divides data into k clusters by minimizing the mean squared error within each Elbow Method Determines the optimal k by observing where the error reduction cluster curve flattens 5 Attribute Reduction e g kilometers vs miles storage needs 6 Advanced Reduction Techniques Eliminate Correlated Attributes Remove attributes that provide similar information Cluster Centroids Store only cluster centers to represent similar data points reducing Reservoir Sampling A streaming data technique to maintain a representative sample CURE Algorithm Effective for high dimensional clustering preserving outliers and over time using kd trees for efficiency 7 Using Distance and Correlation for Clustering Correlation Helps identify relationships between variables for clustering and data Combining clustering and reduction can create an unbiased representative dataset for elimination efficient analysis


View Full Document

SBU CSE 332 - Midterm Guide 4

Download Midterm Guide 4
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Midterm Guide 4 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Midterm Guide 4 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?