Irecently had an assignment inwhich I was responsible foridentifying ways to write effi-cient embedded code. What Idiscovered isn’t rocket sci-ence—common mistakes, misunder-standings, or assumptions about thedemands made on the compiler andover-estimating the power of themicroprocessor can adversely impactthe execution time of an application.Most of my effort focused on imple-menting code that doesn’t enable float-ing-point operations, but instead relieson the math libraries supplied by thecompiler vendor. Examples are pre-sented primarily in C, but compiled inC++. I will leave the analysis of virtu-al tables and the like to the C++experts. I hope a good compiler vendordoes a reasonable job of implementingsuch things. Most of what I learned,though, can be applied to any program-ming language.Inefficient code seems to be moreclosely related to the human conditionthan to the chosen programming lan-guage. Slow code is probably slowbecause that’s the way it was written,however unintentionally. I do believethat it’s better to first write code that iscorrect and then to optimize it. Thereby BILL TRUDELLKeys to Writing Efficient Embedded Code52 EMBEDDED SYSTEMS PROGRAMMING OCTOBER 1997A key to writing efficient real-time embedded software is to under-stand clearly your processor’s architecture, the programming lan-guage, the compiler’s features, and the object model used by thecompiler. With this understanding, you can identify potentially slow code, make the code faster, and thus write more efficient applications. Nance Paternoster<BACKwill always be another compilerswitch, faster clock chip, or newerprocessor around the bend. Well-writ-ten test drivers can prove equalitybetween two implementations.My general assumption is that toimplement efficient embedded soft-ware, a developer must be familiarwith the code that the compiler gener-ates, as well as with the microproces-sor architecture. Writing efficientcode, though, can sometimes make itless portable. Code written inAssembler is usually processor-specif-ic and not portable. The code you writetoday will very likely need to run on adifferent processor in a year or two.Using Assembler, then, to make thecode fast might not be prudent, exceptfor interrupt service routines or fre-quently-used functions.Analysis of any improvements madeto working code is very important.Validate changes to ensure that errorshave not been introduced. Make surethat the desired level of precision andthe accuracy with which calculationsare performed is maintained or is ade-quate. You can easily overlook round-ing and truncation errors.DATA TYPE SPECIFICATIONSOne common oversight is speci-fying the wrong data type andthen allowing the compiler orpreprocessor to convert the type auto-matically. Automatic type conversionis generally taken for granted, but itdoes chew up valuable processor time.Without looking at the relatedAssembler, the code might compile,link, run, and produce the right output,yet be very inefficient.Table 1 contrasts two code segmentsgenerated for a processor without anFPU. The segment on the left omittedthe use of the single-precision floating-point specifier “f”—an easily over-looked mistake. The value 10.0defaulted to double precision andforced the numerator to first be con-verted to double precision before thedouble-precision divide. The result ofthe division, a double-precision value,is then converted back to single preci-sion. Without an FPU, these conver-sion operations, as implemented in asoftware math library, are very expen-sive as compared to integer operations.Correctly specifying the divisor as asingle-precision value produces thecode shown in the right-hand segmentin Table 1. A single-precision divisionis used and the type conversions areavoided. If the numerator were an inte-ger, a conversion from long to floatwould be required. This would be atime-consuming operation that couldbe avoided if the data type were speci-fied as float. (See section A.6 of The CProgramming Language by Kernighanand Ritchie.1)Single-precision accuracy is proba-bly used more frequently than doubleprecision. Therefore, if double preci-sion is required, a simple comment inthe code would remove all doubt as tothe developer’s intention and design.AUTOMATIC PROMOTIONSImplied promotions can easily betaken for granted. In some casesyou’ll find it desirable or necessaryto write and debug the code first in aPC environment, and later port orrecompile it for the embedded proces-sor. The clock speed on the PC willusually be much faster than the embed-ded hardware, and the PC will surelyOCTOBER 1997 EMBEDDED SYSTEMS PROGRAMMING 53Automatic typeconversion isgenerally taken for granted, but it does chewup valuable processor time.54 EMBEDDED SYSTEMS PROGRAMMING OCTOBER 1997Efficient Embedded Codehave a floating-point unit. The impliedpromotions performed by the compilerfor the embedded code might not exe-cute as fast as they did on the worksta-tion. Standard math routines usuallytake double-precision inputs and returndouble-precision outputs. If only singleprecision is required, the return valueshould immediately be cast back tosingle precision, provided that accura-cy and overflow conditions are satis-fied. If this isn’t done, further promo-tions can be precipitated, causing slow-er execution.Table 2 contrasts the use of automat-ic promotion using the sqrt() functionas is, with the casting of the sqrt()function’s return value. Using thesqrt() function as is forces the othervariables to be promoted. Casting thereturn of the sqrt() functions replacesthe double-precision multiply anddivides with single-precision versions,which should execute faster because inthis case, they’re implemented in thesoftware. If the input to sqrt() were oftype double instead of float, the costlycall to convert the float to double couldbe avoided. REWRITING AND REARRANGINGEXPRESSIONSRearranging operands and opera-tors in an equation can give thepreprocessor a better chance atpre-evaluating expressions at compiletime instead of run time, saving clockcycles of execution for other importantoperations. The equation used in Table2 can be rearranged for faster execu-tion without losing readability, asshown in Table 3. Significant savingsresult because a single precision divi-sion is no longer necessary, as17.0f/10.0f is equivalent to 1.7f. In general, for both native instruc-tion sets and floating-point emulation,divides take much longer to executethan multiplies.
View Full Document