CGRA : Attractive Alternative to ASICs Control Power Explosion Code Compressions Fine-grain Code Compression Dynamic Instruction Format Discovery

CGRA : Attractive Alternative to ASICs Control Power Explosion Code Compressions Fine-grain Code Compression Dynamic Instruction Format Discovery www.phwiki.com

CGRA : Attractive Alternative to ASICs Control Power Explosion Code Compressions Fine-grain Code Compression Dynamic Instruction Format Discovery

Maese, Kathryn, Contributing Writer has reference to this Academic Journal, PHwiki organized this Journal Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, in addition to Scott Mahlke University of Michigan Coarse-Grained Reconfigurable Architecture (CGRA) Array of PEs connected in a mesh-like interconnect High throughput with a large number of resources Distributed hardware offers low cost/power consumption High flexibility with dynamic reconfiguration CGRA : Attractive Alternative to ASICs Suitable as long as running multimedia applications as long as future embedded systems High throughput, low power consumption, high flexibility Morphosys : 8×8 array with RISC processor SiliconHive : hierarchical systolic array ADRES : 4×4 array with tightly coupled VLIW viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW Morphosys SiliconHive ADRES

Ursinus College PA www.phwiki.com

This Particular University is Related to this Particular Journal

Control Power Explosion Large number of configuration signals Distributed interconnect, many resources to control Nealy 1000 bits each cycle No code compression technique developed as long as CGRAs Fully decoded instructions are stored in memory 45% total power Single PE PE Instruction Code Compressions Huffman encoding High efficiency, but sequential process Dictionary-based Recurring patterns stored in dictionary Not many patterns found in CGRAs Instruction level code compression No-op compression : Itanium, DSPs Only 17% are no-ops in CGRA Fine-grain Code Compression Compress unused fields rather than the whole instruction Opcode, MUX selection, register address 35% of fields contain valid in as long as mation Instruction as long as mat needs be stored in the memory In as long as mation regarding which fields exist in the memory Significant overhead : 172 bits (20%) as long as a 4×4 CGRA

Dynamic Instruction Format Discovery Resources need configuration only when data flows through them Instruction as long as mat can be discovered by looking at the data flow Token network from dataflow machines can be utilized Token is 1 bit in as long as mation indicating incoming data in next cycle Each PE observes incoming tokens in addition to determines the instruction as long as mat FU : dest <- src0 + src1 RF : reg write Dynamic Configuration of PEs Each cycle, tokens are sent to the consuming PEs Consuming resources collect incoming tokens, discover instruction as long as mats, in addition to fetch only necessary instruction fields Next cycle, resources can execute the scheduled operations Dataflow Graph Mapping Configuration Token Generation Tokens are generated at the beginning of dataflow : live-in nodes in RFs Each RF read port needs token generation info : 26 read ports in 4x4 CGRA 26 bits as long as token generation vs. 172 bits as long as instruction as long as mat Token Network Token network between datapath in addition to decoder No instruction as long as mat, but token generation info in the memory Adds 1 cycle between IF in addition to EX stage Created by cloning the datapath 1 bit interconnect with same topology Each resource translated to a token processing module Encode dest fields, not src fields Register File Token Module Write port MUXes are converted to token receivers Determine selection bits Read ports are converted to token senders Tokens are initially generated here Token generation in as long as mation stored in a separate memory token-gen token sender FU Token Module Input MUXes are converted to token receivers Opcode processor Fetch opcode field if necessary Determine token type (data/pred), latency System Overview datapath Experimental Setup Target multimedia applications as long as embedded systems Modulo scheduling as long as compute intensive loops in 3D graphics, AAC decoder, AVC decoder (214 loops) Three different control path designs baseline : fully decoded instructions static : fine-grained code compression with instruction as long as mat stored in the memory token : fine-grain code compression with token network Code Size / Per as long as mance Fine grain code compression increase code efficiency Token network further improve code efficiency Per as long as mance degradation Sharing of fields, allowing only 2 dests Power / Area SRAM read power is greatly reduced with token network Introducing token network slightly increases power in addition to area Area overhead can be mitigated with the reduced SRAM area Hardware overhead as long as migrating staging predicates into token network is minimal Staging Predicates Optimization Modulo scheduled loops Prolog (filling pipeline) Kernel code (steady state) Epilog (draining pipeline) Only kernel code is stored in memory Staging predicate control prolog/epilog phases II Overlapped Execution i0 i1 i2 i0 i1 i2 Migrating Staging Predicate Staging predicate Control in as long as mation, not data dependent 10% configurations used as long as routing staging predicate Move staging predicates into control path Increase token by 1 bit : staging predicate Only top nodes are guarded Staging predicate flows along with tokens Benefits Code size reduction Per as long as mance increase data staging predicate stage 0 stage 1 stage 2 stage 3 Code Size / Per as long as mance Code size reduction by 9% Migrating staging predicates improve per as long as mance by 7% 5% increase over baseline Power / Area Power/area of token network increase due to valid bit Reduced code size decreases SRAM power/area Overall overhead as long as migrating staging predicates is minimal Overall Power System power measured as long as a kernel loop in AVC Introducing token network reduces the overall system power by 25%, while achieving 5% per as long as mance gain 226.4 mW 170.0 mW Conclusion Fine grain code compression is a good fit as long as CGRAs Token network can eliminate the instruction as long as mat overhead Dynamic discovery of instruction as long as mat Small overhead (< 3%) Migrating staging predicates to token network improves per as long as mance Applicable to other highly distributed architectures Questions Token Sender Each output port of resources are converted into a token sender FU output, routing mux output, register file read ports Send out tokens only to the specified consumers in dest fields Allow only two destinations as long as each output, potentially limits the per as long as mance Maese, Kathryn Los Angeles Downtown News Contributing Writer www.phwiki.com

Token Receiver Input MUXes are converted to token receivers Dest fields are stored in the memory, not src fields MUX selection bits are determined with incoming token position Dynamic Instruction Format Discovery Resources need configuration only when data flows through them Instruction as long as mat can be discovered by looking at the data flow Token network from dataflow machines can be utilized Token is 1 bit in as long as mation indicating incoming data in next cycle Each PE observes incoming tokens in addition to determines the instruction as long as mat Who Generates Tokens Tokens are generated at the start of dataflow Live-ins Terminate when they get into a register file Tokens terminated in register files can be re-generated Read ports of register files generate tokens Token generation in as long as mation at RF read ports are stored separately 26 read ports in 4×4 CGRA

Reducing Decoder Complexity Partitioning the configuration memory in addition to decoder Trade-off between number of memories in addition to decoder complexity Design space exploration as long as memory partitioning Which fields are stored in the same memory Sharing of field entries in the memory : under-utilized fields MEM decoder MEM decoder MEM decoder MEM decoder Token Network Memory Partitioning Bundle fields with the same type : field width uni as long as mity Design space exploration result as long as a 4×4 CGRA sharing degree = total entries / total fields Reduces decoder complexity by 33% over naïve partitioning Sharing incurs less than 1% per as long as mance degradation

Maese, Kathryn Contributing Writer

Maese, Kathryn is from United States and they belong to Los Angeles Downtown News and they are from  Los Angeles, United States got related to this Particular Journal. and Maese, Kathryn deal with the subjects like Local News

Journal Ratings by Ursinus College

This Particular Journal got reviewed and rated by Ursinus College and short form of this particular Institution is PA and gave this Journal an Excellent Rating.