DOC PREVIEW
UW-Madison ECE 734 - “MMX Technology” An Optimization Outlook

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

“MMX Technology”An Optimization Outlookfor ECE 734, Fall 2000Data Path and Control Unit Implementation.“MMX Technology”An Optimization Outlookfor ECE 734, Fall 2000Manoj Geo VargheseDepartment of Electrical and Computer Engineering University of Wisconsin-MadisonMadison, WI 53706, [email protected] My study aims toprovide an overviewof the factors thatare key to MMXTechnologyOptimizations. Theinspiration to thiswork is my in-classpresentation dated10/18/2000 , throughwhich I started tounderstand theperformance issuesfacing MMXtoday.“MatchingAlgorithms to theMMX instructioncapabilities and themicro-architecture ofthe processor is thekey to extracting thebest performance.Algorithms, softwareand hardware mustbe designed togetherfor goodimplementation.”[3].This study ismainly focused onhardwarecapabilities neededfor optimization inMMX. The referenceof the study is IntelMMXTM Technologyand Intel micro-architecture ofPentium-4. Keywords: MMXTechnology, Intel Micro-Architecure, Pipelining,Cache, Multimedia.1. INTRODUCTIONDeveloping History : MMX Technologystarted to boom earlyin the 1990s and in1996 Intel released theMMXTM into IntelPentium Processors.57 new instructionswere added to theexisting micro-architecture ofPentium that treatedthe data in SIMD[1,7] fashion. Theseinstruction couldmultiply, shift,saturating and wraparound add/subtractand do logicaloperations etc. on 64bits at a time andthese bits could bepacked as Bytes,Words, Double words,Quad Words[1] (Dataparallelism).Theparallel operationsyielded a speed up of2-8 times over theexiting integerimplementation[3,10].Following a similarunderline concept, in1999 Intel introduceda new generation ofIA-32 microprocessorPentium III. Theprocessor introduced70 new instructionswhich includedMMXTM technologyenhancements, SIMDFloating PointInstructions, andcacheabilityinstructions[1].Intelalso introduced SSE2[1,5] , which enabled4 single precisionfloating pointoperations per clockcycle.Then came the “NextGenerationTechnology”[7],Intel Pentium 4Processor,in 2000,which makes theconcentrate of thisstudy.2. DESCRIPTION OFTHE ISSUES The first generationsof the MMXTechnology facedmany issues :-Latency : The use ofMMX instructionsshould follow certainlatencyrules[1,4,6,11]. Someof these rules withreference to PentiumProcessor are:-1. Aftermodifying an MMXRegister, wait until thenext clock cycle beforereading the sameregister. [1,3]. Forexample consider thecode:movq mm0,[eax] ;U-Pipemovq mm3,mm2 ;V-Pipepaddw mm0,mm1 ;U-Pipemovq mm2,mm0;STALL . Theexecution sequencenormally should takeonly two clock cyclesto execute, but thissequence takes onemore clock cycle dueto the stall that occurssince the latency rulesare violated.If theabove sequence isshifted by half a clockand make it asmovq mm0,[eax] ;V-Pipemovq mm3,mm2 ;U-Pipepaddw mm0,mm1 ;V-Pipemovq mm2,mm0 ;U-Pipe – no stalloccurs now. 2.After issuing amultiply instruction(PMADDWD,PMULHW or PMULLW),wait until three clockslater before using theresult. 3.Aftermodifying the MMXregister, wait for untiltwo clocks beforestoring the result toeither memory or aninteger register. Toavoid the latency, theprogrammer has tofollow many rules ofpairing. For examplein each clock cycle,only one out ofpmadddwd, pmulhw,pmullw or any othermultiplicationinstruction should beexecuted[1,3]. Else itwould take 2 clockcycles or more toexecute if there arestalls due to latencyrules.Paring MMXInstructions[1] :There are four basicMMX instructionpairing rules on thePentium Processor.Two fo the major rulesare-1. In each clock atmost one MMXmultiplicationinstruction(PMADDWD, PMULHW orPMULLW) can beexecuted. 2. In eachclock, atmost oneMMX shift or unpackinstruction can beexecuted.Paring of Integer andMMX instructions :There are many rulesfor paring these twotypes of instructions.Integer instruction is apairable instructionfor the pipe[5] whereit is being executedprovided the MMXinstruction doesn’trefer to the memory orany integer register.Paring of FloatingPoint and MMXinstructions[1] Thereshould not be any mixof these types ofinstructions at theinstructionlevel[1,7].If neededthey can be mixed atthe module level buteach transitionbetween these twoinstruction takes about50 instruction clocksand hence theperformancedecreases. Cache Consideration: If we are trying toimplement an imageprocessing applicationwe need to work withlarge images, say640*480. If bpp is 8,then it requires 2.3MB if DPCM isused[1] whereinsecond pixel is storedas the difference ofintensity value fromthe previous pixel.Such a large cache isvery complex andcostly and hence thisbecomes a bottle neckin such applications.[8]Pipelining :Thetechnology ofpipelining was thekey to increment thespeed of theprocessing for MMXTechnology. InPentium4 there are 20stages of Pipeliningand thus we cananticipate great resultsfrom it. Only time andother competitors cantell the exact worth ofall the welcomechanges that Intelhas introduced intothis field. But we willstudy what hadPentium4 has to offeras a new GenerationProcessor. The major issues thathave been mentionedabove are due to thelimitations of thehardware supportavailable to theinstruction set. ThePentium-4 ,micro-architecture is arevolution in thataspect and it providesone of the best micro-architectures that thefiled of processortechnology has seentill now.[1,5,7]What determinestrue processorperformance[1,5]?Processor performanceis determined by thetime taken to executethe given application.The performance canbe described by theformula.Performance = ClockFrequency *Instructions Per ClockCycle.[9]


View Full Document

UW-Madison ECE 734 - “MMX Technology” An Optimization Outlook

Documents in this Course
Load more
Download “MMX Technology” An Optimization Outlook
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view “MMX Technology” An Optimization Outlook and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view “MMX Technology” An Optimization Outlook 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?