6.7 Verilog Pipelining

❮ Gcc Parameter Detail Python Print Func B ❯

6.7 Verilog Pipelining

Category Verilog Tutorial

Keywords: Pipelining, Multiplier

One of the prominent advantages of hardware description languages is the parallelism of instruction execution. Multiple statements can handle multiple signal data in the same clock cycle in parallel.

However, when data is input serially, the parallelism of instruction execution does not show its advantage. Moreover, sometimes some calculations cannot be completed within one or two clock cycles. If each input of serial data has to wait for the previous calculation to be completed before the next calculation can be started, the efficiency is quite low. Pipelining is the solution to the problem of low efficiency in calculating serial data over multiple cycles.

Pipelining

The basic idea of pipelining is to decompose a repetitive process into several sub-processes, each of which is implemented by a dedicated functional unit. By staggering multiple processing processes in time and passing them through various functional stages one after another, each sub-process can be carried out in parallel with other sub-processes.

Suppose the process of washing clothes in a laundry shop is divided into four stages: picking up clothes, washing, drying, and packaging. Each stage takes half an hour to complete, so washing a batch of clothes takes 2 hours.

Consider the worst-case scenario, where there is only one washing machine, one dryer, and one wardrobe in the laundry shop. If a batch of clothes to be washed is delivered every half hour, and each time it takes 2 hours to wait for the previous batch of clothes to be finished, then washing 4 batches of clothes will take 8 hours.

The illustration is as follows:

Upgrade the equipment of this laundry shop by introducing 4 sets of laundry equipment, and increase the staff to 4 people, each responsible for a laundry stage. So each batch of clothes can be promptly put into different washing machines by the same person. Because the time is staggered, each batch of clothes can be washed, dried, and packaged by the same person in different equipment and time periods (half an hour). The illustration is as follows.

It can be seen that washing 4 batches of clothes only takes 3 and a half hours, and the efficiency is significantly improved.

In fact, after 2 hours, the first set of laundry equipment has already completed the washing process and is in an idle state. If there is a delivery of the 5th batch of clothes at this time, then the first set of equipment can start working again. And so on, as long as the clothes are continuously inputted, the 4 washing machines can continuously complete the washing process of all clothes. And in addition to the first batch of laundry time taking 2 hours, every half an hour afterwards, a batch of clothes will be washed.

The more batches of clothes there are, the more obvious the time saved. Suppose there are N batches of clothes, the required time is (4+N) and a half hours.

Of course, the upgraded laundry process also has disadvantages. The increase in equipment and staff has led to increased investment costs, and the remaining space in the laundry shop has been reduced, and the working state looks more busy.

Similar to the laundry process, the data processing path can also be regarded as a production line, and each digital processing unit on the path can be regarded as a stage, which will produce a delay.

Pipelining design is to divide the path system into a digital processing unit (stage), and insert registers between each processing unit to temporarily store the data of the intermediate stage. The divided units can be executed in stages in parallel without affecting each other. So in the end, the pipeline design can improve the data throughput, that is, improve the speed of data processing.

The disadvantage of pipeline design is that each processing stage needs to add registers to save the intermediate calculation state, and multiple instructions are executed in parallel, which will inevitably lead to increased power consumption.

Next, design a multiplier and compare whether to use pipeline design.

General Multiplier Design

Preface

Some people may ask, isn't it faster and simpler to directly use the multiplication sign * to complete the multiplication of two numbers?

If you have this question, it means that your understanding of hardware description languages is still insufficient. As previously mentioned, Verilog describes hardware circuits, and directly using the multiplication sign to complete the multiplication process, the compiler will also map this multiplication expression into a default multiplier when compiling, but its structure is unknown.

For example, in FPGA design, you can directly call IP cores to generate a high-performance multiplier. reg data_rdy_low; reg [N-1:0] mult1_low; reg [M-1:0] mult2_low; wire [M+N-1:0] res_low; wire res_rdy_low;

//Using task to stimulate periodically task mult_data_in; input [M+N-1:0] mult1_task, mult2_task; begin wait(!test.u_mult_low.res_rdy); //not output state @(negedge clk); data_rdy_low = 1'b1; mult1_low = mult1_task; mult2_low = mult2_task; @(negedge clk); data_rdy_low = 1'b0; wait(test.u_mult_low.res_rdy); //test the output state end endtask

//driver initial begin #55; mult_data_in(25, 5); mult_data_in(16, 10); mult_data_in(10, 4); mult_data_in(15, 7); mult_data_in(215, 9); end

mult_low #(.N(N), .M(M)) u_mult_low ( .clk(clk), .rstn(rstn), .data_rdy(data_rdy_low), .mult1(mult1_low), .mult2(mult2_low), .res_rdy(res_rdy_low), .res(res_low));

//simulation finish initial begin forever begin #100; if ($time >= 10000) $finish; end end

endmodule // test

The simulation results are as follows.

As shown in the figure, the two input data, after a delay of 4 cycles, obtained the correct multiplication result. Including the delay time of the data input in the middle, it takes about 20 clock cycles to calculate 4 multiplications.

Image link

Pipeline Multiplier Design

The following saves the intermediate state of the multiplication process to facilitate pipeline operation, and the design code is as follows.

The code file for the single accumulation calculation process is as follows (mult_cell.v):

Example

module mult_cell
    #(parameter N=4,
    parameter M=4)
    (
    input clk,
    input rstn,
    input en,
    input [M+N-1:0] mult1, //Multiplicand
    input [M-1:0] mult2, //Multiplier
    input [M+N-1:0] mult1_acci, //Previous accumulation result

    output reg [M+N-1:0] mult1_o, //Multiplicand shift and save value
    output reg [M-1:0] mult2_shift, //Multiplier shift and save value
    output reg [N+M-1:0] mult1_acco, //Current accumulation result
    output reg rdy
    );

    always @(posedge clk or negedge rstn) begin
        if (!rstn) begin
            rdy <= 'b0;
            mult1_o <= 'b0;
            mult1_acco <= 'b0;
            mult2_shift <= 'b0;
        end
        else if (en) begin
            rdy <= 1'b1;
            mult2_shift <= mult2 >> 1;
            mult1_o <= mult1 << 1;
            if (mult2[0]) begin
                //If the corresponding bit of the multiplier is 1, accumulate
                mult1_acco <= mult1_acci + mult1;
            end
            else begin
                //If the corresponding bit of the multiplier is 0, keep it
                mult1_acco <= mult1_acci;
            end
        end
        else begin
            rdy <= 'b0;
            mult1_o <= 'b0;
            mult1_acco <= 'b0;
            mult2_shift <= 'b0;
        end
    end

endmodule

Top-level instantiation

Multiple module instantiations complete multiple accumulations, and the code file is as follows (mult_man.v):

Example

```verilog module mult_man #(parameter N=4, parameter M=4) ( input clk, input rstn, input data_rdy, input [N-1:0] mult1, input [M-1:0] mult2,

output res_rdy,
output [N+M-1:0] res
);

wire [N+M-1:0] mult1_t

Simulation Results

The simulation results for the first few dozen clock cycles are as follows.

As shown in the figure, the simulation results indicate that the signal error_flag remains 0, which means the multiplication design is correct.

Data is continuously serially input under the drive of the clock, and the multiplication output results are output without delay at each clock after a delay of 4 clock cycles, completing the pipeline operation.

Simulation Results Image

Compared to a multiplier that does not use pipelining, the efficiency of multiplication calculation has been greatly improved.

However, the pipelined multiplier also uses about 4 times the register resources compared to the previous non-pipelined one.

Therefore, whether to use a pipeline design in a digital design needs to be weighed from both resource and efficiency perspectives.

Source Code Download

Download

Follow on WeChat

❮ Gcc Parameter Detail Python Print Func B ❯