Local Decisions and Triggering Mechanisms for Adaptive Fault Tolerance

Home> Academic Documents> Local Decisions and Triggering Mechanisms for Adaptive Fault Tolerance

DOC PREVIEW

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Local Decisions and Triggering Mechanisms for Adaptive Fault-TolerancePhillip Stanley-Marbell, Diana MarculescuDepartment of Electrical and Computer EngineeringCarnegie Mellon UniversityPittsburgh, PA 15213-3890{pstanley, dianam}@ece.cmu.eduAbstractDynamic fault-tolerance management (DFTM) was pre-viously introduced as a means of providing environment-and workload-driven adaptation for failure-prone batterypowered systems.This paper introduces and analyzes the role of local de-cision policies in a DFTM environment, and presents a pre-cise formulation for when it is beneficial to activate a givenDFTM algorithm with respect to metrics that combine per-formance, reliability, power consumption and battery life.In particular, local decision algorithms are described inthe context of an imaging array application running on anetwork of resource-constrained processing elements. It isdemonstrated that DFTM algorithms, in conjunction withappropriately chosen activation times, increase the meancomputation before battery failure for a single battery, by afactor between 1.1 to 5.8, for the application investigated.1. IntroductionFailure, in its many forms, is becoming an increasinglyimportant design constraint in computing systems. Unliketraditional fault-tolerant computing systems, where failureis the exceptional case that must be addressed with ap-propriate preventive and recovery techniques, technologytrends are making it increasingly difficult to guarantee de-vice operability, with the result that larger fractions of man-ufactured devices must be declared defective when they failtesting. These factors make it increasingly important to in-clude reliability as a design constraint,inmuch the samemanner as power consumption has, in recent years, gainedimportance as a critical constraint alongside performance.By including failure as a design constraint, and characteriz-ing systems in terms of a combination of performance, reli-ability, power consumption and where appropriate, batterylife, it will be possible to derive greater utility from devicesthat would otherwise have to be discarded.Failures in a system may be the result of manufacture-time defects, or may be the manifestation of runtime effectssuch as depletion of battery resources. In this regard, tra-ditional power management may be considered a subset ofthe more general idea of dynamic adaptation of a system tofailures.Low battery levels can then be seen as a predictablefailure event, whose occurrence must be delayed as much aspossible by taking appropriate actions, embodied by tradi-tional power management algorithms. However, in the gen-eral case of failures, as is the case when final battery re-source depletion does occur, fault free devices must be usedas replacements for failing or failed devices. This requiresthe existence of such replacement devices or “spares” in asystem, to be harnessed on the occurrence, or on the verge,of a failure.Dynamic fault-tolerance management (DFTM) [8] wasproposed as a framework in which dynamic adaptation tofailures in the presence of redundantly deployed devicescould be performed. It consists of three classes of algo-rithms which address adaptation with regard to local actions(subsuming traditional power management algorithms), re-dundantly deployed devices, and the choice of which ofthese spare devices should be utilized, respectively.ContributionsDynamic fault-tolerance management [8], as previouslyproposed, provided neither an investigation of DFTM poli-cies which adapt the behavior of a device in a systemwithout taking advantage of other redundantly deployedresources, nor a precise formulation for when a specificDFTM algorithm should be activated.This paper presents an investigation of a set of DFTML-class algorithms, and presents a precise formulation forwhen it is beneficial to activate a given DFTM algorithm,with respect to metrics which combine performance, reli-ability, power consumption and battery life. As previouslyintroduced [8], L-class algorithms control local adaptationdecisions, M-class algorithms control decisions of when toemploy redundantly deployed resources, and D-class algo-rithms control the choice of which redundantly deployeddevices to be harnessed. The results of this paper are ap-plicable to systems which must adapt to the presence offailures such as mechanical failures or energy resource de-pletion, and must do so while providing maximal batterylifetime and sustained performance. Such system require-ments are embodied by many emerging technology plat-forms, such as ambient intelligent systems [1] and inte-grated computational sensing and actuation surfaces.The remainder of this paper is organized as follows. Thefollowing section presents motivation for performing dy-namic adaptation to failures in the context of a driver ap-plication, a network of embedded processing elements act-ing as an imaging array. It is followed in Section 3 by areview of previously proposed measures [8] for quantifyingthe efficacy of a system in terms of a combination of energy-efficiency, battery life, performance and reliability, knownas ebformability measures [8]. A precise formulation for de-termining the conditions under which an algorithm shouldbe activated, is presented in Section 4, followed by an exper-imental evaluation of the driver application in the context ofthe presented ideas, in Section 5. The paper concludes witha summary of contributions in Section 6.2. Driver application : Imaging arrayThe driver application employed was one in which com-puting devices arranged in a 2-dimensional grid, samplevalues from sensors, and send these samples via a multi-hopnetwork to a designated member of the grid. Such a config-uration of devices is representative of several applicationsof relevance in ambient intelligent systems, such as an ul-trasound imaging array for detecting the position, shape andmotion of objects, or an acoustic beamformer. It is also rep-resentative of an active antenna array, used, for example, toprovide spatial filtering for an ultra-wideband radio.Many of the issues raised by the application are appli-cable to any system utilizing multiple resource-constrainedprocessing elements that communicate via a multi-hop net-work. In the case of an ultrasound imaging array, the sam-ples collected at each device in the network will contain in-formation about the distance of objects from the particularsensor, and the final image (the collection of samples fromall nodes in the


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

Please select your school