A Just-in-Time Customizable Processor What is a Customizable Processor ASIP Model

A Just-in-Time Customizable Processor What is a Customizable Processor ASIP Model www.phwiki.com

A Just-in-Time Customizable Processor What is a Customizable Processor ASIP Model

Vankat, John, Freelance Wine Writer has reference to this Academic Journal, PHwiki organized this Journal A Just-in-Time Customizable ProcessorLiang Chen, Joseph Tarango†, Tulika Mitra, Philip Brisk†School of Computing, National University of Singapore†Department of Computer Science & Engineering, University of Cali as long as nia, Riverside{chenliang, tulika}@comp.nus.edu.sg,{jtarango, philip}@cs.ucr.edu1Session 7A: Efficient in addition to Secure Embedded Processors What is a Customizable ProcessorApplication-specific instruction setExtension to a traditional processorComplex multi-cycle instruction set extensions (ISEs)Specialized data movement instructions2Control Logical UnitExtended Arithmetic Local UnitInstruction & Data inData outASIP Model3High ParallelismLow EnergyHigh Per as long as manceNo Flexibility with ISEsApplication-Specific Instruction-set Processor (ASIP)Tailored to benefit a specific application with the flexibility of the CPU in addition to per as long as mance of an Application Specific Integrated Circuit (ASIC)These use static logic to speedup specific operator chains seen frequently in addition to usually high cost within the CPU.These ISEs are tightly coupled into the CPU pipeline in addition to significantly reduce energy in addition to CPU time.ASIPs lack flexibility in addition to ISEs must be known at ASIC design time; requiring firmware (software application) to be developed be as long as e the ASIC is designed.

Arcadia University US www.phwiki.com

This Particular University is Related to this Particular Journal

Dynamically Extendable Processor Model4Very Flexible ISEsMedium EnergyMedium Per as long as manceSlow to SwapProgrammabilityThese use dynamic logic to speedup specific operator chains seen frequently in addition to usually high cost within the CPU.These ISEs are loosely coupled into the CPU pipeline in addition to significantly reduce energy in addition to CPU time.Very flexible in addition to ISEs can be done post design time; allowing firmware (software application) to be developed in parallel the ASIC design.High cost to reconfigure the fabric usually in the milliseconds range or larger depending on the size of the reconfigurable fabric.Developing ISEs requires a hardware synthesis design in addition to planning.JiTC Processor Model5Fast SwappingProgrammabilityMedium Flexible ISEsHigh Per as long as manceLow-Medium EnergyThese use near to ideal logic to speedup specific operator chains seen frequently in addition to usually high cost within the CPU.These ISEs are tightly coupled into the CPU pipeline in addition to significantly reduce energy in addition to CPU time.Flexible to the ISA in addition to the accelerator programming is transparent to the firmware (software application) developmentLow cost to reconfigure the fabric takes one-two cycles to fully reconfigure.Developing ISEs is done within the compiler, so software automatically mapped onto the fabric.Profiling in addition to compiler optimizations can be done on the fly in addition to binaries can be swapped.Comparison of ISE Models6High ParallelismLow EnergyHigh Per as long as manceNo Flexibility with ISEsHigh Development CostsVery Flexible ISEsMedium EnergyMedium Per as long as manceSlow to SwapDifficult to ProgramFast SwappingAutomatic & Easily ProgrammedMedium Flexible ISEsHigh Per as long as manceLow-Medium Energy

Supporting Instructions-Set ExtensionI$RFD$RFFetchDecodeExecuteMemoryWrite-backCompileProfileApplicationBinary with ISEsIdentificationISE Select & Map7SpecializedFunctional Unit (SFU)ISEOPISE Design Space Exploration8Input: R1Input: ImmOutput 1Output 2Dataflow Graph (DFG) of an Instruction Set Extension (ISE)Input: R2Input: R3&>>>>>>Instruction Level Parallelism (ILP)Compiler extracts ISEs from an application (domain)Avg. parallelism is stable across our application domain4-inputs, 2-outputs sufficesConstrain critical path into single cycle through operator chaining in addition to hardware optimizations.Inter-operation Parallelism+Exploring Inner-Operator Parallelism9MediabenchMibenchVery minimal amount of parallelism detected

Operator Critical Path Exploration10ISEs with a longer critical path tend to achieve the higher speedupsHot Operator Sequences11A – Arithmetic: Add, SubL – Logic: And, Or, Not, Xor, etc.M – MultiplyS – Shift: Logical & ArithmeticW – Bit Select or Data MovementSelected Operator Sequences12The 11 hot sequences are: AA, AS, LL, SA, SL, ASA, LLS, LSA, SAS, MWA, WMW.A – Arithmetic: Add, SubL – Logic: And, Or, Not, Xor, etc.M – MultiplyS – Shift: Logical & ArithmeticW – Bit Select or Data MovementA+AA+SL+LS+AS+LA/L+A/LS+A/LA/L+SConsider A in addition to L as equivalentData path mergingA/L+S+A/L+S(a) Identified hot sequences(b) Optimized sequences(c) Merged sequence (data path)Two operator chains:Three operator chains:A+S+AL+L+SL+S+AS+A+STwo operator chains:A/L+S+A/L A/L+A/L+SS+A/L+SThree operator chains:M+W+AW+M+WConsider W as a configurable wire connectionM+AMData path mergingM+A

Basic Functional Unit Design13InputsBlack represents inputs from Register FileBlue from Complex unitGreen Neighbor BFURed This BFURcontrol: Reconfiguration StreamFunctionality ALU includes a bypass Shift can be set from input or reconfiguration steam Local feedback from registerComplex Functional Unit Design14InputsBlack represents inputs from Register FileBlue from Complex unitGreen Neighbor BFURed This BFURcontrol: Reconfiguration StreamFunctionality MAC in parallel with ALU + Shift ALU bypass removed to save opcode spaceMerged Functional Unit Design15InputsBlack represents inputs from Register FileBlue from Complex unitGreen Neighbor BFURed This BFURcontrol: Reconfiguration StreamFunctionality Independent or chained operation mode Chained operation mode has critical path equal to the MAC Carry-out from first unit to second unit enables 64-bit operations

Interconnect Structure Fully connected topology between FUs Chained 1-cycle operation as long as two SFUs in any order Result selection as long as any time step in the interconnect Up to two results produced per time step Control sequencer enables multiple configurations as long as a different cycles of one ISE (62 configuration bits total)16Modified In-Order Pipeline17 Instruction buffer allows control memory to meet timing requirements We support up to 1024 ISEs ASIPs support up to 20 ISEsFetch 1Fetch 2DecodeRename RegistersDispatchRename MapIssueLoad Store QueueRegister ReadExecution UnitsWrite BackRe-order BufferCISE ConfigureSpecialized Functional UnitsIn-OrderOut-Of-OrderConfiguration Look-Up CacheModified Out-of-Order PipelineCISE Detect18

19ISE ProfilingMultiplyLoad StoreLoop Conditional CheckLoop Conditional CheckStart StopAddAddShiftSubtractShiftControl Data Flow Graph (CDFG) representationApply st in addition to ard compiler optimizationsLoop unrolling, instruction reordering, memory optimizations, etc.Insert cycle delay times as long as operationsBall-Larus profilingExecute codeEvaluate CDFG hotspots +Input 1Input 2Input 3Input 4Output 1Output 220+-ISE IdentificationMultiplyLoad StoreConditional CheckConditional CheckStart StopAddAddShiftSubtract>>ShiftComplexSimpleSimpleSimpleSimpleSimpleExample DFG +Input 1Input 2Input 3Input 4Output 1Output 2Stage 1 – StartStage 2 – ½ CycleStage 3 – 1 Cycle21+-Custom Instructions MappingMultiplyLoad StoreConditional CheckConditional CheckStart StopAddAddShiftSubtractReduced 6 Cycles to 1 Cycle, 5 Cycle Reduction>>ShiftBFU0BFU1CFU

22Schedule ISE using ALAPInput: r1+>>>>Output: r4Output: r5Input: Imm 3Input: r2Input: r3>>DFG of a custom instruction with 4 inputs in addition to 2 outputs23Routing Resource Graph (RRG)Input: r1Input: Imm 3Input: r2Input: r3Output: r5Output: r4Cycle 0, reconfigurationCycle 1, reconfigurationMulti-Cycle MappingJiTC Supports 4 time stepsWithin the RRG mapping we exclude memory accesses24Map ISE onto the Reconfigurable Unit Input: r1Input: Imm 3+>>Input: r2Input: r3>>&>>Output: r5Output: r4Cycle 0, reconfigurationCycle 1, reconfiguration

Vankat, John Vankat, John Freelance Wine Writer www.phwiki.com

Experimental Setup25Modified Simple Scalar to reflect synthesis resultsDecompiled binary to detect custom instructionRuntime analysis used to select best c in addition to idates to replace with ISEsRecompiled new JITC binary with reconfiguration memory initialization files SFU operates at 606 MHz (Synopsys DC, compile-ultra)The configuration parameters are chosen to closely match realistic in-order embedded processor (ARMCortex-A7) in addition to out-of-order embedded processor (ARM Cortex-A15).Experimental Out-of-Order Execution Unit Determination26No speedup achieved after 4 SFU units within out-of-order executionExperimental Runtime Results27Average of 18% speedup as long as in-order processor, 21% as long as ASIPs, 23% as long as theoreticalAverage of 23% speedup as long as out-of-order processor, 26% as long as ASIPs, 28% as long as theoreticalAchieved 94.3-97.4%, (in-order), 95.98-97.54% (out-of-order) in speedup compared to ASIPs

SummaryAverage of 18%, 23% speedup94.3-97.4%, (in-order), 95.98-97.54% (out-of-order) in speedup compared to ASIPsOn Average, SFU occupies 3.21% to 12.46% of the area of ASIPsISE latency is nearly identical from ASIP to JITCFor JITC, ISEs on average contain 2.53 operatorsJITC ISEs can have from 1 to 4 time steps as long as an individual custom instruction90% of ISEs can be executed in one time step99.77% of ISEs can be mapped in 4 time steps(7%, 4%) overhead compared to a (simple, complex) execution path28ConclusionWe proposed a Just-in-time Customizable (JITC) processor core that can accelerate executions across application domains.We systematically design in addition to integrate a specialized functional unit (SFU) into the processor pipeline.With the supported from modified compiler in addition to enhanced decoding mechanism, the experimental results show that JITC architecture offers ASIP-like per as long as mance with far superior flexibility.29Questions30

Experimental Synthesis Results37SFU operates at 555 MHz & 606 MHz using ultra optimizations as long as synthesisSFU occupies 80502 m2 areaBenchmark Details38JiTC Capability39ISE latency is nearly identical from ASIP to JITCFor JITC, ISEs on average contain 2.53 operatorsJITC ISEs can have from 1 to 4 time steps as long as an individual custom instruction90% of ISES can be executed in one time step99.77% of ISEs can be mapped in 4 time steps32-bit ISA (Instruction Set Architecture)Merge two-five instruction entries to have full ISE use8-bit opcode (operation code)4-bits per register10-bits encode the CID (Custom Instruction Identification)4 Addressing Modes (RRRR, RRRI, RRII, RIII)Latency Distribution of ISEs on ASIP in addition to SFU

Vankat, John Freelance Wine Writer

Vankat, John is from United States and they belong to Vankat, John and they are from  Flagstaff, United States got related to this Particular Journal. and Vankat, John deal with the subjects like Wine

Journal Ratings by Arcadia University

This Particular Journal got reviewed and rated by Arcadia University and short form of this particular Institution is US and gave this Journal an Excellent Rating.