CMS Grid Virtual Microscope Malina Kirn May 1 2007 CMS Grid Physics Terminology Monte Carlo MC aka generation or simulation simulated physics used to test code and to compare to real data Reconstruction RECO turning data in the form of electronic response such as voltage into particle physics events such as particle type trajectory and energy Analysis comparison of MC to reconstructed data for extraction of interesting physics May 1 2007 CMS Grid Virtual Microscope 3 CMS Grid Structure Tier 0 CERN Tier 1 CERN FNAL this paper Tier 2 Moscow State U MIT UCSD May 1 2007 CMS Grid Virtual Microscope 4 CMS Grid User Level Tasks Analysis of published datasets Automatic resource and data discovery Production of specialized samples User provided resources Detector studies Dedicated and localized resources May 1 2007 CMS Grid Virtual Microscope 5 Test Environment All sites provide CE computing element SE storage element WN worker nodes Use Condor g srmcp Validation job is run before mass submission to make sure site is responsive May 1 2007 CMS Grid Virtual Microscope 6 Software ALPGEN theoretical physics simulator produces generation level output used for MC only PYTHIA slightly more complicated theoretical physics simulator produces generation level output used for MC only FAMOS takes generation level input and produces simulation level output turns theoretical physics into what it looks like in an experiment used for MC only ORCA takes simulation level and real data input and produces reconstruction level output turns experiment output into theoretical physics May 1 2007 CMS Grid Virtual Microscope 7 Workflow newer than paper Generation no input ALPGEN PYTHIA FAMOS real data ORCA Skimming May 1 2007 Simulation Reconstruction Offline CMS Grid Virtual Microscope 8 Statistics May 1 2007 CMS Grid Virtual Microscope 9 Observations Data splitting and merging to keep the number of events per file constant PYTHIA data not kept simulation level to reduce storage consumption Output directed to only 1 or 2 SEs for simple retrieval No logical file naming due to fast evolution File names and locations provide metadata No automatic resubmission of failed jobs because human intuition is better most failed jobs are never resubmitted Some jobs have access to information allowing them to know if an equivalent job has already been performed Grid job failure rate is similar to normal cluster failure rate May 1 2007 CMS Grid Virtual Microscope 10 May 1 2007 CMS Grid Virtual Microscope 11 Parameter Tuning The length of execution time for each job and output data are largely tuned to minimize the pressure on the infrastructure and maintain high efficiency Running time 6 hours for failures 10 MB output 200 MB for dCache a hierarchical SRM No directory holds more than a few thousand files for PNFS May 1 2007 CMS Grid Virtual Microscope 12 Virtual Microscope Client GUI May 1 2007 CMS Grid Virtual Microscope 14 Server Frontend Interprets client queries as database queries Allows asynchronous client queries to the data server Handles clients behind firewalls Generally responses are sent directly to the client from the data server so frontend generally handles queries not responses May 1 2007 CMS Grid Virtual Microscope 15 Data Server Fast retrieval of data from disk to memory Low latency Efficient directory structure Fast processing of queries on data in memory Project high resolution onto resolution corresponding to magnification requested Scalable parallel or computing cluster Asynchronous operation on multiple blocks May 1 2007 CMS Grid Virtual Microscope 16 Active Data Repository ADR Describes every piece of data with coordinates in n dimensional coordinate space VM is 3D Retrieves all data inside a range query range specified in each dimension Aggregates appropriate input data to output data through intermediate accumulator provided by application May 1 2007 CMS Grid Virtual Microscope 17 Application Deployment on ADR parallel sequential May 1 2007 CMS Grid Virtual Microscope 18 ADR Back end Attribute space service manages registration and use of application defined mapping functions Dataset service manages datasets stored in the ADR back end and provides utility functions for loading datasets into ADR Indexing service manages various indices default and user provided for the datasets stored in ADR Data aggregation service manages applicationprovided functions to be used in aggregation operations and functions to generate the final outputs May 1 2007 CMS Grid Virtual Microscope 19 Datasets in ADR Data stored and retrieved as chunks for bandwidth where chunks specified by range in n dimensional space Chunks nearby in n dimensional space distributed evenly across disk space using Hilbert curves by default May 1 2007 CMS Grid Virtual Microscope 20 Processing in ADR Query planning Index lookup find blocks in coordinate range Tiling if all data doesn t fit in memory Workload partitioning local accumulator processing Query execution Initialization memory allocated process started Local reduction local accumulator Global combine inter processor combination of local reductions Output handling output from combined accumulators May 1 2007 CMS Grid Virtual Microscope 21 VM on ADR 2D space used for Hilbert distribution of data across disks Third dimension of data kept clustered ADR block size tweaked for both disk access efficiency and minimal data waste JPEG compression Query process retrieve decompress clip crop subsample Results of local processing on ADR blocks sent directly to client no global combine May 1 2007 CMS Grid Virtual Microscope 22 May 1 2007 CMS Grid Virtual Microscope 23 DataCutter Indexing service Indexes data by position in n dimensional space Filtering service Supports execution of groups of processes filters on different machines concurrently independently or in order emulate parallel asynchronous or sequential over different machines Communication between filters termed a stream uni directional pipes that deliver data in fixed size buffers May 1 2007 CMS Grid Virtual Microscope 24 Multi Level Indexing Summary index files Associate metadata with one or more data chunks and or index files Detailed index files Provides metadata for all data chunks in some data set May 1 2007 CMS Grid Virtual Microscope 25 Filter Stream Optimizations Filters placed as locally as possible to the data source Transparent copies identical filters that run concurrently with the goal of no single filter causing a bottleneck
View Full Document
Unlocking...