MIT 6 375 - Data Movement Control on a PowerPC - D507514

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 375> Data Movement Control on a PowerPC

DOC PREVIEW

MIT 6 375 - Data Movement Control on a PowerPC

School name Massachusetts Institute of Technology

Course 6 375- Complex Digital Systems

Pages 48

This preview shows page 1-2-3-23-24-25-26-46-47-48 out of 48 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Data Movement Control on a PowerPC Silas Boyd Wickizer and Asif Khan What this presentation is about Intuition for why multicore caches are underutilized Preliminary design for three new instructions Toy benchmarks show improved performance Caches are crucial for performance CORE CORE CORE CORE Caches are crucial for performance 250 cycles CORE CORE CORE CORE Potential solution is one giant shared cache CORE CORE CORE CORE Potential solution is one giant shared cache CORE CORE CORE CORE Applications have access to entire cache capacity no false sharing issues etc Potential solution is one giant shared cache 50 cycles CORE CORE CORE CORE Applications have access to entire cache capacity no false sharing issues etc Multicore caches are distributed L1 CORE L2 CORE L3 CORE CORE Multicore caches are distributed 3 cycles L1 CORE L2 CORE L3 CORE CORE Multicore caches are distributed 3 cycles 13 cycles L1 CORE L2 CORE L3 CORE CORE Multicore caches are distributed 3 cycles 13 cycles L1 CORE L3 L2 CORE 50 cycles CORE CORE Difficult to use multicore caches efficiently L1 CORE L2 CORE L3 CORE CORE Hard to access all of on chip cache size 64 Kbytes size 2 Mbytes L1 CORE size 512 Kbytes L2 CORE size 1 7 Mbytes L3 CORE CORE Expensive to access far away caches L1 CORE L2 CORE L3 CORE CORE Expensive to access far away caches L1 CORE L2 CORE L3 CORE CORE Expensive to access far away caches L1 CORE L2 CORE 100 cycles L3 CORE CORE Prototype extensions to hardware DMC instructions cpush store a cache line in another core s cache clookup lookup which cache holds an address cmsg efficient access to data in another core s cache Provide some of the benefits of a single fast shared cache Prototype extensions to hardware DMC instructions cpush store a cache line in another core s cache clookup lookup which cache holds an address cmsg efficient access to data in another core s cache Provide some of the benefits of a single fast shared cache Memory hierarchy Per core L1 caches Inclusive shared L2 MSI cache coherence protocol cpush copy cache line to another core s cache cpush address core id Copies cache line at address to core with core id cpush copy cache line to another core s cache cpush address core id Copies cache line at address to core with core id If address is marked S in source L1 copy to destination and mark S If address is marked M in source L1 set source copy to I copy to destination and mark M If address is marked I in source L1 ignore cpush example thread migration To migrate thread source core saves register values in buffer source core puts buffer on destination core s runqueue destination core restores register values to execute thread cpush example thread migration To migrate thread source core saves register values in buffer source core puts buffer on destination core s runqueue destination core restores register values to execute thread cpush example thread migration To migrate thread source core saves register values in buffer source core puts buffer on destination core s runqueue destination core restores register values to execute thread Source core s cache will hold the buffer and thread s working set cpush example thread migration To migrate thread source core saves register values in buffer source core puts buffer on destination core s runqueue destination core restores register values to execute thread Source core s cache will hold the buffer and thread s working set Use cpush to move the buffer and thread s working set clookup lookup location of an address clookup address returns the nearest core ID that caches address clookup lookup location of an address clookup address returns the nearest core ID that caches address If address is M or S in source L1 return source ID If address is invalid in source L1 it s marked S or M in L2 directory return remote ID If address in invalid in source L1 and invalid in L2 directory return 1 clookup example cache management run times Originally implemented clookup to help test cmsg Some software run times try to manage cache contents Maintain a map from object address to cache clookup example cache management run times Originally implemented clookup to help test cmsg Some software run times try to manage cache contents Maintain a map from object address to cache Essentially tries duplicates hardware state Inaccurate Expensive clookup example cache management run times Originally implemented clookup to help test cmsg Some software run times try to manage cache contents Maintain a map from object address to cache Essentially tries duplicates hardware state Inaccurate Expensive Replace software map with clookup cmsg efficient access to data in another core s cache cmsg address pc argument Looks up that core that caches address Interrupts the core causing it to execute the function at pc passing argument as an argument cmsg efficient access to data in another core s cache cmsg address pc argument Looks up that core that caches address Interrupts the core causing it to execute the function at pc passing argument as an argument If address is M or S in source L1 return 0 drop message If address is I in L2 return 0 drop message If address is cached in a remote L1 return 1 send message cmsg efficient access to data in another core s cache cmsg address pc argument Looks up that core that caches address Interrupts the core causing it to execute the function at pc passing argument as Cost an argument roughly equivalent to L2 cache miss or the cost of inter core miss If address is M or S in source L1 return 0 drop message If address is I in L2 return 0 drop message If address is cached in a remote L1 return 1 send message cmsg example shared data structures Many applications used shared data structures E g Linux uses linked lists in many subsystems do something spin lock lock item list pop list update metadata list spin unlock lock cmsg example shared data structures Many applications used shared data structures E g Linux uses linked lists in many subsystems do something spin lock lock item list pop list update metadata list spin unlock lock cmsg example shared data structures Many applications used shared data structures Three cache misses list entry list entry next prev list entry prev next E g Linux uses linked lists in many subsystems do something spin lock lock item list pop list update metadata list spin unlock lock cmsg example shared data structures Many applications used shared data structures E g Linux uses linked lists in many subsystems do

View Full Document