Data Reduction Fundamentals Essential to make large datasets manageable for analysis storage and processing Key techniques Sampling Selecting a representative subset of the data Dimensionality Reduction Reducing the number of attributes variables by transformation or elimination 2 Sampling Techniques Random Sampling Selecting points randomly though it may miss outliers Stratified Sampling Sampling based on clusters or strata to better represent the population 3 Similarity and Distance Measures Important for clustering and data reduction Common metrics Euclidean Distance Measures straight line distance Manhattan Distance Measures distance along axes at right angles Cosine Similarity Good for high dimensional data measures angle between vectors Jaccard Similarity Measures similarity in binary or categorical data 4 Clustering for Data Reduction Clustering groups similar data points and reduces redundancy K Means Clustering Divides data into k clusters by minimizing the mean squared error within each Elbow Method Determines the optimal k by observing where the error reduction cluster curve flattens 5 Attribute Reduction e g kilometers vs miles storage needs 6 Advanced Reduction Techniques Eliminate Correlated Attributes Remove attributes that provide similar information Cluster Centroids Store only cluster centers to represent similar data points reducing Reservoir Sampling A streaming data technique to maintain a representative sample CURE Algorithm Effective for high dimensional clustering preserving outliers and over time using kd trees for efficiency 7 Using Distance and Correlation for Clustering Correlation Helps identify relationships between variables for clustering and data Combining clustering and reduction can create an unbiased representative dataset for elimination efficient analysis
View Full Document