Storing and Processing Multi-dimensional Scientific Datasets

Home> Academic Documents> Storing and Processing Multi-dimensional Scientific Datasets

DOC PREVIEW

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Storing and ProcessingMulti-dimensional Scientific DatasetsAlan SussmanUMIACS & Department of Computer Sciencehttp://www.cs.umd.edu/~alsAlan Sussman - 3/5/082Data Exploration and Analysis Large data collections emerge as important resources– Data collected from sensors and large-scale simulations– Multi-resolution, multi-scale, multi-dimensionalo data elements often correspond to points in multi-dim attribute spaceo medical images, satellite data, hydrodynamics data, etc.– Terabytes to petabytes today Low-cost, high-performance, high-capacity commodity hardware– 5 PCs, 5 Terabytes of disk storage for << $10,000Alan Sussman - 3/5/083Large Data Collections Scientific data exploration and analysis– To identify trends or interesting phenomena– Only requires a portion of the data, accessed through spatial index e.g., Quad-tree, R-tree Spatial (range) query often used to specify iterator– computation on data obtained from spatial query– computation aggregates data (MapReduce) - resulting data product size significantly smaller than results of range queryAlan Sussman - 3/5/084Specify portion of raw sensor data correspondingto some search criterionOutput grid ontowhich a projectionis carried outTypical QueryAlan Sussman - 3/5/085Target example applicationsProcessing Remotely-Sensed DataNOAA Tiros-Nw/ AVHRR sensorAVHRR Level 1 DataAVHRR Level 1 Data• As the TIROS-N satellite orbits, the Advanced Very High Resolution Radiometer (AVHRR)sensor scans perpendicular to the satellite’s track.• At regular intervals along a scan line measurementsare gathered to form an instantaneous field of view(IFOV).• Scan lines are aggregated into Level 1 data sets. A single file of Global Area Coverage (GAC) data represents:• ~one full earth orbit.• ~110 minutes.• ~40 megabytes.• ~15,000 scan lines.One scan line is 409 IFOV’sWater Contamination StudyPathologySatellite Data ProcessingMulti-perspective volume reconstructionAlan Sussman - 3/5/086Outline Active Data Repository Overall architecture Query planning  Query execution Experimental Results DataCutterAlan Sussman - 3/5/087Active Data Repository (ADR)  An object-oriented framework (class library + runtime system) for building parallel databases of multi-dimensional datasets– enables integration of storage, retrieval and processing of multi-dimensional datasets on distributed memory parallel machines. – can store and process multiple datasets.– provides support and runtime system for common operations such as  data retrieval,  memory management,  scheduling of processing across a parallel machine.– customizable for application specific processing.DatasetServiceAttribute SpaceServiceData AggregationServiceIndexingServiceQuery ExecutionServiceQuery PlanningServiceQuery InterfaceServiceQuery SubmissionServiceFront EndApplication Front End QueryClient 2(sequential)ResultsClient 1(parallel)Back EndADR ArchitectureAlan Sussman - 3/5/089Active Data Repository (ADR) Dataset is collection of user-defined data chunks– a data chunk contains a set of data elements– multi-dim bounding box (MBR) for each chunk, used by spatial index– chunks declustered across disks to maximize aggregate I/O bandwidth Separate planning and execution phases for queries– Tile output if too large to fit entirely in memory– Plan each tile’s I/O, data movement and computation Identify all chunks of input that map to tile Distribute processing for chunks among processors – All processors work on one tile at a timeAlan Sussman - 3/5/0810Query Planning Index lookup Tiling Workload partitioning Index lookup Select data chunks of interest Compute mapping between input and output chunks Tiling Partition output chunks so that each tile fits in memory Use Hilbert curve to minimize total length of tile boundaries Workload partitioning Each aggregation operation involves an input/output chunk pair Want good load balance and low communication overheadAlan Sussman - 3/5/0811Query Execution Broadcast query plan to all processors For each output tile:– Initialization phaseRead output chunks into memory, replicate if necessary– Reduction phaseRead and process input chunks that map to current tile– Combine phaseCombine partial results in replicated output chunks, if any– Output handlingCompute final output valuesO ← Output dataset, I ← Input datasetA ← Accumulator (for intermediate results)[SI, SO] ← Intersect(I, O, Rquery)foreach oein SOdoread oeae← Initialize(oe)foreach iein SIdoread ieSA← Map(ie) ∩ SOforeach aein SAdoae← Aggregate(ie, ae)foreach aein SOdooe← Output(ae)write oeADR Processing LoopAlan Sussman - 3/5/0813Query Execution Strategies Distributed Accumulator (DA)– Assign aggregation operation to owner of output chunk Fully Replicated Accumulator (FRA)– Assign aggregation operation to owner of input chunk– Requires combine phase Sparsely Replicated Accumulator (SRA)– similar to FRA, but only replicate output chunk when neededAlan Sussman - 3/5/0814Performance Evaluation 128-node IBM SP, with 256MB memory per node Datasets generated by Application Emulators– Satellite Data Processing (SAT) – non-uniform mapping– Virtual Microscope (VM)1-5-11-40-20Comp (ms)tinit-tred-tcomb1.016-128192MB1.5-24GBVM4.6161-130725MB1.6-26GBSATFan-out(avg)Fan-inOutputInputAppAlan Sussman - 3/5/0815Query Execution Time (sec)0501001502002503008163264128Number of ProcessorsFRADASRA051015202530358163264128Number of ProcessorsFRADASRASAT VM(Fixed input size)Alan Sussman - 3/5/0816Summary of Experimental Results Communication volume– Comm. VolumeDA∝ fan-out– Comm. VolumeFRA/SRA∝ fan-in DA may have computational load imbalance due to non-uniform mapping Relative performance depends on– Query characteristics (e.g., fan-in, fan-out)– Machine configurations (e.g., number of processors) No strategy always outperforms the othersAlan Sussman - 3/5/0817ADR queries vs. Other Approaches Similar to out-of-core reductions (more general MapReduce)– Commutative & associative– Most reduction optimization techniques target in-core data– Out-of-core techniques require data redistribution Similar to relational group-by queries– Distributive & algebraic [Gray96]– spatial-join + group-by– For ADR, output data items and extents known prior to processingdouble x[max_nodes],y[max_nodes];integer


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Please select your school