9. The Execute Pipeline

Dual Issue Pipeline

Fig. 9.1 An example pipeline for a dual-issue BOOM. The first issue port schedules Micro-Ops onto Execute Unit #0, which can accept ALU operations, FPU operations, and integer multiply instructions. The second issue port schedules ALU operations, integer divide instructions (unpipelined), and load/store operations. The ALU operations can bypass to dependent instructions. Note that the ALU in EU#0 is padded with pipeline registers to match latencies with the FPU and iMul units to make scheduling for the write-port trivial. Each Execution Unit has a single issue-port dedicated to it but contains within it a number of lower-level Functional Units.

The Execution Pipeline covers the execution and write-back of Micro-Ops. Although the Micro-Ops will travel down the pipeline one after the other (in the order they have been issued), the Micro-Ops themselves are likely to have been issued to the Execution Pipeline out-of-order. Fig. 9.1 shows an example Execution Pipeline for a dual-issue BOOM.

9.1. Execution Units

Example Execution Unit

Fig. 9.2 An example Execution Unit. This particular example shows an integer ALU (that can bypass results to dependent instructions) and an unpipelined divider that becomes busy during operation. Both functional units share a single write-port. The Execution Unit accepts both kill signals and branch resolution signals and passes them to the internal functional units as required.

An Execution Unit is a module that a single issue port will schedule Micro-Ops onto and contains some mix of functional units. Phrased in another way, each issue port from the Issue Queue talks to one and only one Execution Unit. An Execution Unit may contain just a single simple integer ALU, or it could contain a full complement of floating point units, a integer ALU, and an integer multiply unit.

The purpose of the Execution Unit is to provide a flexible abstraction which gives a lot of control over what kind of Execution Units the architect can add to their pipeline

9.1.1. Scheduling Readiness

An Execution Unit provides a bit-vector of the functional units it has available to the issue scheduler. The issue scheduler will only schedule Micro-Ops that the Execution Unit supports. For functional units that may not always be ready (e.g., an un-pipelined divider), the appropriate bit in the bit-vector will be disabled (See Fig. 9.1).

9.2. Functional Units

Abstract Functional Unit

Fig. 9.3 The abstract Pipelined Functional Unit class. An expert-written, low-level functional unit is instantiated within the Functional Unit. The request and response ports are abstracted and bypass and branch speculation support is provided. Micro-Ops are individually killed by gating off their response as they exit the low-level functional unit.

Functional units are the muscle of the CPU, computing the necessary operations as required by the instructions. Functional units typically require a knowledgable domain expert to implement them correctly and efficiently.

For this reason, BOOM uses an abstract Functional Unit class to “wrap” expert-written, low-level functional units from the Rocket repository (see The Rocket-Chip Repository). However, the expert-written functional units created for the Rocket in-order processor make assumptions about in-order issue and commit points (namely, that once an instruction has been dispatched to them it will never need to be killed). These assumptions break down for BOOM.

However, instead of re-writing or forking the functional units, BOOM provides an abstract Functional Unit class (see Fig. 9.3) that “wraps” the lower-level functional units with the parameterized auto-generated support code needed to make them work within BOOM. The request and response ports are abstracted, allowing Functional Units to provide a unified, interchangeable interface.

9.2.1. Pipelined Functional Units

A pipelined functional unit can accept a new Micro-Op every cycle. Each Micro-Op will take a known, fixed latency.

Speculation support is provided by auto-generating a pipeline that passes down the Micro-Op meta-data and branch mask in parallel with the Micro-Op within the expert-written functional unit. If a Micro-Op is misspeculated, it’s response is de-asserted as it exits the functional unit.

An example pipelined functional unit is shown in Fig. 9.3.

9.2.2. Un-pipelined Functional Units

Un-pipelined functional units (e.g., a divider) take an variable (and unknown) number of cycles to complete a single operation. Once occupied, they de-assert their ready signal and no additional Micro-Ops may be scheduled to them.

Speculation support is provided by tracking the branch mask of the Micro-Op in the functional unit.

The only requirement of the expert-written un-pipelined functional unit is to provide a kill signal to quickly remove misspeculated Micro-Ops. [1]

Functional Unit Hierarchy

Fig. 9.4 The dashed ovals are the low-level functional units written by experts, the squares are concrete classes that instantiate the low-level functional units, and the octagons are abstract classes that provide generic speculation support and interfacing with the BOOM pipeline. The floating point divide and squart-root unit doesn’t cleanly fit either the Pipelined nor Unpipelined abstract class, and so directly inherits from the FunctionalUnit super class.

9.3. Branch Unit & Branch Speculation

The Branch Unit handles the resolution of all branch and jump instructions.

All Micro-Ops that are “inflight” in the pipeline (have an allocated ROB entry) are given a branch mask, where each bit in the branch mask corresponds to an un-executed, inflight branch that the Micro-Op is speculated under. Each branch in Decode is allocated a branch tag, and all following Micro-Ops will have the corresponding bit in the branch mask set (until the branch is resolved by the Branch Unit).

If the branches (or jumps) have been correctly speculated by the Front-end, then the Branch Unit’s only action is to broadcast the corresponding branch tag to all inflight Micro-Ops that the branch has been resolved correctly. Each Micro-Op can then clear the corresponding bit in its branch mask, and that branch tag can then be allocated to a new branch in the Decode stage.

If a branch (or jump) is misspeculated, the Branch Unit must redirect the PC to the correct target, kill the Front-end and fetch buffer, and broadcast the misspeculated branch tag so that all dependent, inflight Micro-Ops may be killed. The PC redirect signal goes out immediately, to decrease the misprediction penalty. However, the kill signal is delayed a cycle for critical path reasons.

The Front-end must pass down the pipeline the appropriate branch speculation meta-data, so that the correct direction can be reconciled with the prediction. Jump Register instructions are evaluated by comparing the correct target with the PC of the next instruction in the ROB (if not available, then a misprediction is assumed). Jumps are evaluated and handled in the Front-end (as their direction and target are both known once the instruction can be decoded).

BOOM (currently) only supports having one Branch Unit.

9.4. Load/Store Unit

The Load/Store Unit (LSU) handles the execution of load, store, atomic, and fence operations.

BOOM (currently) only supports having one LSU (and thus can only send one load or store per cycle to memory). [2]

See The Load/Store Unit (LSU) for more details on the LSU.

9.5. Floating Point Units

Functional Unit for FPU

Fig. 9.5 The class hierarchy of the FPU is shown. The expert-written code is contained within the hardfloat and rocket repositories. The “FPU” class instantiates the Rocket components, which itself is further wrapped by the abstract Functional Unit classes (which provides the out-of-order speculation support).

The low-level floating point units used by BOOM come from the Rocket processor (https://github.com/freechipsproject/rocket-chip) and hardfloat (https://github.com/ucb-bar/berkeley-hardfloat) repositories. Figure Fig. 9.5 shows the class hierarchy of the FPU.

To make the scheduling of the write-port trivial, all of the pipelined FP units are padded to have the same latency. [3]

9.6. Floating Point Divide and Square-root Unit

BOOM fully supports floating point divide and square-root operations using a single FDiv/Sqrt (or fdiv for short). BOOM accomplishes this by instantiating a double-precision unit from the hardfloat repository. The unit comes with the following features/constraints:

  • expects 65-bit recoded double-precision inputs
  • provides a 65-bit recoded double-precision output
  • can execute a divide operation and a square-root operation simultaneously
  • operations are unpipelined and take an unknown, variable latency
  • provides an unstable FIFO interface

Single-precision operations have their operands upscaled to double-precision (and then the output downscaled). [4]

Although the unit is unpipelined, it does not fit cleanly into the Pipelined/Unpipelined abstraction used by the other functional units (see Fig. 9.4). This is because the unit provides an unstable FIFO interface: although the  unit may provide a ready signal on Cycle i, there is no guarantee that it will continue to be ready on Cycle i+1, even if no operations are enqueued. This proves to be a challenge, as the Issue Queue may attempt to issue an  instruction but cannot be certain the  unit will accept it once it reaches the  unit on a later cycle.

The solution is to add extra buffering within the unit to hold instructions until they can be released directly into the unit. If the buffering of the unit fills up, back pressure can be safely applied to the Issue Queue. [5]

9.7. Parameterization

BOOM provides flexibility in specifying the issue width and the mix of functional units in the execution pipeline. Listing 9.1 shows how to instantiate an execution pipeline in BOOM.

Listing 9.1 Instantiating the Execution Pipeline (in dpath.scala). Adding execution units is as simple as instantiating another ExecutionUnit module and adding it to the exe units ArrayBuffer.
val exe_units = ArrayBuffer[ExecutionUnit]()

if (ISSUE_WIDTH == 2)
   exe_units += Module(new ALUExeUnit(is_branch_unit = true
                                       , has_mul     = true
   exe_units += Module(new ALUMemExeUnit(has_div     = true
else if (ISSUE_WIDTH == 3)
   exe_units += Module(new ALUExeUnit(is_branch_unit = true
                                       , has_mul     = true
   exe_units += Module(new ALUExeUnit(has_div = true))
   exe_units += Module(new MemExeUnit())

Additional parameterization, regarding things like the latency of the FP units can be found within the Configuration settings (configs.scala).

9.8. Control/Status Register Instructions

A set of Control/Status Register (CSR) instructions allow the atomic read and write of the Control/Status Registers. These architectural registers are separate from the integer and floating registers, and include the cycle count, retired instruction count, status, exception PC, and exception vector registers (and many more!). Each CSR has its own required privilege levels to read and write to it and some have their own side-effects upon reading (or writing).

BOOM (currently) does not rename any of the CSRs, and in addition to the potential side-effects caused by reading or writing a CSR, BOOM will only execute a CSR instruction non-speculatively. [6] This is accomplished by marking the CSR instruction as a “unique” (or “serializing”) instruction - the ROB must be empty before it may proceed to the Issue Queue (and no instruction may follow it until it has finished execution and been committed by the ROB). It is then issued by the Issue Queue, reads the appropriate operands from the Physical Register File, and is then sent to the CSRFile. [7] The CSR instruction executes in the CSRFile and then writes back data as required to the Physical Register File. The CSRFile may also emit a PC redirect and/or an exception as part of executing a CSR instruction (e.g., a syscall).

[1]This constraint could be relaxed by waiting for the un-pipelined unit to finish before de-asserting its busy signal and suppressing the valid output signal.
[2]Relaxing this constraint could be achieved by allowing multiple LSUs to talk to their own bank(s) of the data-cache, but the added complexity comes in allocating entries in the LSU before knowing the address, and thus which bank, a particular memory operation pertains to.
[3]Rocket instead handles write-port scheduling by killing and refetching the offending instruction (and all instructions behind it) if there is a write-port hazard detected. This would be far more heavy-handed to do in BOOM.
[4]It is cheaper to perform the SP-DP conversions than it is to instantiate a single-precision fdivSqrt unit.
[5]It is this ability to hold multiple inflight instructions within the unit simultaneously that breaks the “only one instruction at a time” assumption required by the UnpipelinedFunctionalUnit abstract class.
[6]There is a lot of room to play with regarding the CSRs. For example, it is probably a good idea to rename the register (dedicated for use by the supervisor) as it may see a lot of use in some kernel code and it causes no side-effects.
[7]The CSRFile is a Rocket component.

9.9. The Rocket Custom Co-Processor Interface (RoCC)

The ROCC interface accepts a ROCC command and up to two register inputs from the Control Processor’s scalar register file. The ROCC command is actually the entire RISC-V instruction fetched by the Control Processor (a “ROCC instruction”). Thus, each ROCC queue entry is at least 2*XPRLEN + 32 bits in size (additional ROCC instructions may use the longer instruction formats to encode additional behaviors).

As BOOM does not store the instruction bits in the ROB, a separate data structure (A “ROCC Shim”) holds the instructions until the ROCC instruction can be committed and the ROCC command sent to the co-processor.

The source operands will also require access to BOOM’s register file. ROCC instructions are dispatched to the Issue Window, and scheduled so that they may access the read ports of the register file once the operands are available. The operands are then written into the ROCC Shim, which stores the operands and the instruction bits until they can be sent to the co-processor. This requires significant state.

After issue to RoCC, we track a queue of in-flight RoCC instructions, since we need to translate the logical destination register identifier from the RoCC response into the previously renamed physical destination register identifier.

Currently the RoCC interface does not support interrupts, exceptions, reusing the BOOM FPU, or direct access to the L1 data cache. This should all be straightforward to add, and will be completed as demand arises.