Easy Tutorial
❮ Es6 Let Const Java Knowledge Structure Diagram ❯

6.3 Verilog RTL Level Low Power Design (Part 1)

Category Advanced Verilog Tutorial

The table below shows the percentage of power consumption that can be reduced at various levels of digital design. After the RTL level, the reduction in power consumption is very limited.

Design Level Improvement Degree
System Level 50% ~ 90%
RTL Level 20% ~ 50%
Gate Level 10% ~ 15%
Transistor Level 5% ~ 10%
Layout Level < 5%

As a pseudo-code farmer writing Verilog, one can also participate in the work of reducing power consumption at the system level, but the focus should be on reducing power consumption at the RTL level.

The following sections will introduce common methods for reducing power consumption from the RTL level in two parts.


Parallelism and Pipelining

For a functional module, it can be implemented in a parallel manner or in a pipelined manner. Both methods are about trading resources for speed. In certain situations, the flexible use of these two methods can reduce power consumption.

Parallel Processing

Parallel processing can handle multiple execution statements simultaneously, making the execution efficiency higher. Therefore, under the condition of meeting the work requirements, adopting parallel processing can reduce the system working frequency and decrease power consumption.

For example, the code descriptions for implementing the multiplication and addition of 4 data points using 1 multiplier and 2 multipliers (in parallel) are as follows:

Example

//===========================================
//1 multiplier, high speed
module mul1_hs
    (
        input clk ,           //200MHz
        input rstn ,
        input en ,
        input [3:0] mul1 ,          //data in
        input [3:0] mul2 ,          //data in
        output dout_en ,
        output [8:0] dout
     );

    reg flag ;
    reg en_r ;
    always @(posedge clk or negedge rstn) begin
        if (!rstn) begin
            flag   <= 1'b0 ;
            en_r   <= 1'b0 ;
        end
        else if (en) begin
            flag   <= ~flag ;
            en_r   <= 1'b1 ;
        end
        else begin
            flag   <= 1'b0 ;
            en_r   <= 1'b0 ;
        end
    end

    wire [7:0]           result = mul1 * mul2 ;

    // data output en
    reg [7:0]            res1_r, res2_r ;
    always @(posedge clk or negedge rstn) begin
        if (!rstn) begin
            res1_r         <= 'b0 ;
            res2_r         <= 'b0 ;
        end
        else if (en & !flag) begin
            res1_r         <= result ;
        end
        else if (en & flag) begin
            res2_r         <= result ;
        end
    end

    assign dout_en = en_r & !flag ;
    assign dout = res1_r + res2_r ;

endmodule

//===========================================
// 2 multipliers, low speed
module mul2_ls
    (
        input clk ,           //100MHz
        input rstn ,
        input en ,
        input [3:0] mul1 ,          //data in
        input [3:0] mul2 ,          //data in
        input [3:0] mul3 ,          //data in
        input [3:0] mul4 ,          //data in
        output dout_en,
        output [8:0] dout
     );

    wire [7:0]           result1 = mul1 * mul2 ;
    wire [7:0]           result2 = mul3 * mul4 ;

    //en delay
    reg                  en_r ;
    always @(posedge clk or negedge rstn) begin
        if (!rstn) begin
            en_r           <= 1'b0 ;
        end
        else begin
          en_r           <= en ;
 
The simulation results are as follows.

As can be seen from the figure, the output results of the two implementation methods are consistent, but the parallel processing method has halved the working frequency, which will reduce power consumption, and the design area will also increase at this time.

### Pipeline Processing

In the [Verilog Tutorial](verilog-tutorial.html), it is explained that the efficiency improvement of a continuously working N-stage pipeline design is approximately N times. Like parallel design, when using pipeline design, the working frequency can also be appropriately reduced to reduce power consumption.

From another perspective, pipeline design can divide a long combination path into N stages of pipeline. The path length is reduced to 1/N of the original path length. At this time, if the clock frequency remains unchanged, only the capacitance C/N needs to be charged and discharged in a cycle, instead of the original capacitance C. Therefore, under the same frequency requirements, a lower power supply voltage can be used to drive the system, reducing power consumption.

Assuming that in a design, the critical path is a 32-bit by 32-bit multiplier. The overall capacitance of this multiplier is C, and the working voltage is V.

Without adding a pipeline, to achieve this working frequency, the working voltage should be V.

When using a two-stage pipeline method, the path is divided into two parts. For each part, the overall capacitance becomes C/2. If the original working frequency is to be achieved, the working voltage can be reduced to βV (β&lt;1). The overall system power consumption is reduced to the original β^2.

The specific design method of the pipeline can refer to the [Verilog Tutorial](verilog-tutorial.html) chapter [6.7 Verilog Pipeline Design](verilog-pipeline-design.html).

---

## Resource Sharing and State Encoding

**Resource Sharing**

When some of the same computational logic in the design is used in multiple places, the method of resource sharing can be used to avoid the repeated appearance of multiple computational logics, reducing the consumption of resources.

For example, a comparison logic, the code description without using resource sharing is as follows:

## Example

```verilog
always @(*) begin
    case (mode) :
        3'b000: result = 1'b1;
        3'b001: result = 1'b0;
        3'b010: result = value1 == value2;
        3'b011: result = value1 != value2;
        3'b100: result = value1 > value2;
        3'b101: result = value1 < value2;
        3'b110: result = value1 >= value2;
        3'b111: result = value1 <= value2;
    endcase
end
Optimize the above code and describe as follows:
    wire equal_con = value1 == value2;
    wire great_con = value1 > value2;
    always @(*) begin
        case (mode) :
            3'b000: result = 1'b1;
            3'b001: result = 1'b0;
            3'b010: result = equal_con;
            3'b011: result = !equal_con;
            3'b100: result = great_con;
            3'b101: result = !great_con && !equal_con;
            3'b110: result = great_con && equal_con;
            3'b111: result = !great_con;
        endcase
    end

The first method, when synthesized, if the compiler optimization is not good, may require 6 comparators. The second method of resource sharing only requires 2 comparators to complete the same logical function, so it will reduce power consumption to a certain extent.

State Encoding

For some signals that change frequently, the flip rate is relatively high, and the power consumption is relatively large. State encoding can be used to reduce switch activity and reduce power consumption.

For example, when a high-speed counter is working, using Gray code instead of binary encoding, only 1 bit of data flips at a time, the flip rate is reduced, and the power consumption is reduced accordingly.

For example, when designing a state machine, if the state encoding before and after the state machine switch has only a 1-bit difference, it will also reduce the flip rate.

Operand Isolation

The principle of operand isolation: if the output of the data path is useless within a certain period of time, set the input to a fixed value, and there is no flip in the data path, power consumption will be reduced.

A multiplier circuit is shown below.

When sel0 = 0 or sel1 = 1, the output result of the Multiplier cannot be passed through the two Mux to the input end of the register. That is, the register cannot save the

Follow on WeChat

❮ Es6 Let Const Java Knowledge Structure Diagram ❯