CN113805840B - Fast accumulator - Google Patents
Fast accumulator Download PDFInfo
- Publication number
- CN113805840B CN113805840B CN202111365222.XA CN202111365222A CN113805840B CN 113805840 B CN113805840 B CN 113805840B CN 202111365222 A CN202111365222 A CN 202111365222A CN 113805840 B CN113805840 B CN 113805840B
- Authority
- CN
- China
- Prior art keywords
- carry
- addition
- bit
- adder
- group
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000010586 diagram Methods 0.000 description 11
- 238000009825 accumulation Methods 0.000 description 7
- 238000003780 insertion Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 230000001934 delay Effects 0.000 description 5
- 230000002123 temporal effect Effects 0.000 description 5
- 230000037431 insertion Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000000034 method Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- OUXCBPLFCPMLQZ-WOPPDYDQSA-N 4-amino-1-[(2r,3s,4s,5r)-4-hydroxy-5-(hydroxymethyl)-3-methyloxolan-2-yl]-5-iodopyrimidin-2-one Chemical compound C[C@H]1[C@H](O)[C@@H](CO)O[C@H]1N1C(=O)N=C(N)C(I)=C1 OUXCBPLFCPMLQZ-WOPPDYDQSA-N 0.000 description 1
- 238000012300 Sequence Analysis Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 239000000872 buffer Substances 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/50—Adding; Subtracting
- G06F7/501—Half or full adders, i.e. basic adder cells for one denomination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Pure & Applied Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Optimization (AREA)
- Complex Calculations (AREA)
Abstract
The embodiment of the application provides a fast accumulator, including: the system comprises a plurality of groups of sequentially connected addition modules, wherein each group of addition modules at least comprises an addition unit, the addition unit comprises an adder and a register, and the adder is circularly connected with the input end and the output end of the register; except the first group and the last group of the addition modules, each group of the addition modules comprises a first carry, a second carry, a first carry-out bit and a second carry-out bit; the second carry bit is directly connected with the second output bit of the previous group of addition modules through a register. By inserting the register into the carry chain between the groups, the critical path is obviously reduced, and the upper frequency limit of the system is improved.
Description
Technical Field
The present application relates to the field of digital signal processing technology, and more particularly, to a fast accumulator.
Background
In order to improve the throughput rate of the digital signal processing system, a critical path of the system is generally reduced by inserting a pipeline, so that the operating frequency of the system is improved, and the throughput rate is further improved, wherein the critical path refers to a logic calculation path with the longest time delay in all paths which do not pass through a register unit in a circuit. Insertion into the pipeline, i.e., inserting registers into the feed-forward cut-set of the circuit (cut-set is a collection of edges in the graph that are removed, the graph becomes an unconnected graph, and all edges on the feed-forward cut-set point forward, i.e., in the direction of the input to the output), is an effective way to optimize the critical path of the loop-free circuit.
However, in many digital signal processing systems, the accumulator shown in fig. 1 is involved, iterative operations involved in the accumulator introduce a loop, and inserting a register directly into this loop may destroy the correctness of the calculation, so that the path delay of the accumulator is difficult to reduce, and is easily a bottleneck in the optimization of the critical path of the system.
Disclosure of Invention
In order to meet the critical path delay requirement of the system and reduce the path delay of the accumulator, the present application provides a fast accumulator through the following embodiments.
A first aspect of the present application provides a fast accumulator, comprising: the system comprises a plurality of groups of sequentially connected addition modules, wherein each group of addition modules at least comprises an addition unit, the addition unit comprises an adder and a register, and the adder is circularly connected with the input end and the output end of the register;
except the first group and the last group of the addition modules, each group of the addition modules comprises a first carry, a second carry, a first carry-out bit and a second carry-out bit; the first carry is used for receiving and inputting data of corresponding bit in the data to be accumulated, and the second carry is directly connected with the second output bit of the previous group of addition modules through a register;
the first group of the addition modules comprises a first carry, a first carry-out bit and a second carry-out bit, and the last group of the addition modules comprises a first carry, a second carry and a first carry-out bit;
for a first set of the summing modules, the first output bit is connected directly to an output port of the fast accumulator; for the other groups of the addition modules, the first carry-out bit and the second carry-out bit are both connected to the input end of the merge adder, and the output end of the merge adder is connected to the output port of the fast accumulator.
Optionally, if each group of the addition modules includes two or more addition units, the adders in all the addition units are connected in sequence.
Optionally, if each group of the addition modules includes two or more addition units, the merge adder further includes a zero-insertion carry.
A second aspect of the present application provides a fast accumulator comprising: the system comprises a plurality of groups of sequentially connected addition modules, wherein each group of addition modules at least comprises an addition unit, the addition unit comprises a first adder, a second adder and a register which are sequentially connected, and the output end of the register is connected to the input end of the second adder;
except the first group and the last group of the addition modules, each other group of the addition modules comprises a first carry, a second carry, a third carry, a first carry-out bit, a second carry-out bit and a third carry-out bit;
the first carry is the input end of the first adder in each addition unit and is used for receiving and inputting data of corresponding bit in the data to be accumulated, the second carry is the input end of the first adder in the addition unit with the lowest bit, and the second carry is directly connected with the second carry-out bits of the previous group of addition modules through a register; the third carry is the input end of the second adder in the addition unit with the lowest bit, and the third carry is directly connected to the first output bit of the previous addition module; the first output bit is the output end of a first adder in the adding unit with the highest bit, the second output bit is the output end of a second adder in the adding unit with the highest bit, and the third output bit is the output end of a register in each adding unit;
the first group of the addition modules comprises a first carry, a first carry-out bit, a second carry-out bit and a third carry-out bit, and the last group of the addition modules comprises a first carry, a second carry, a third carry and a third carry-out bit;
for a first set of the summing modules, the third output bit is connected directly to an output port of the fast accumulator; for the other groups of the addition modules, the third carry-out bit and the second carry-out bit are both connected to the input end of the merge adder, and the output end of the merge adder is connected to the output port of the fast accumulator.
Optionally, if each group of the addition modules includes two or more addition units, the first adders in all the addition units are connected in sequence, and the second adders are connected in sequence.
Optionally, if each group of the addition modules includes two or more addition units, the merge adder further includes a zero-insertion carry.
The embodiment of the application provides a fast accumulator, including: the system comprises a plurality of groups of sequentially connected addition modules, wherein each group of addition modules at least comprises an addition unit, the addition unit comprises an adder and a register, and the adder is circularly connected with the input end and the output end of the register; except the first group and the last group of the addition modules, each group of the addition modules comprises a first carry, a second carry, a first carry-out bit and a second carry-out bit; the second carry bit is directly connected with the second output bit of the previous group of addition modules through a register. By inserting the register into the carry chain between the groups, the critical path is obviously reduced, and the upper frequency limit of the system is improved.
Drawings
FIG. 1 is a schematic diagram of an accumulator structure of a conventional lead-in loop;
FIG. 2 is a schematic diagram of an accumulator and its front and rear input/output portions according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a fast accumulator disclosed in the embodiment of the present application;
fig. 4 is a schematic structural diagram of another fast accumulator disclosed in the embodiment of the present application;
fig. 5 is a schematic structural diagram of another fast accumulator disclosed in the embodiment of the present application;
fig. 6 is a schematic structural diagram of another fast accumulator disclosed in the embodiment of the present application;
FIG. 7 is a diagram illustrating an example of a three-stage pipeline process for an 8-bit adder in the fast accumulator disclosed in the embodiments of the present application;
FIG. 8 is a schematic diagram of a low bit width adder employed in the fast accumulator disclosed in the embodiment of the present application;
fig. 9 is a schematic diagram of a set of input feature maps with a channel number C and a convolution kernel with the channel number C according to an embodiment of the present application.
Detailed Description
In order to facilitate the technical solution of the present application, some concepts related to the present application will be described below.
In the digital signal processing system, the operations of the accumulator and its front and back input/output parts can be described as follows, referring to fig. 2:
the outputs from the rest of the system are first input to an adder (Σ), and the addition result can be summarized into two categories: one temporary sum and two temporary sums. Referring to FIG. 2 (a), one way temporary sum is TS (temporal sum)Referring to (b) in FIG. 2, two temporary sums are TS in the graph1And TS2. For more multi-path temporal sums, it can be compressed into two-path temporal sums by a 4to2 compressor array, etc., i.e., the case shown in fig. 2 (b) can be returned. Wherein existing 4to2 compressors can be used in the present application.
Assume that the accumulated output of the accumulator does not exceed N bits. Assuming a total of T sets of inputs, the T sets of inputs are accumulated for T consecutive cycles to obtain two partial sums: one is the partial sum held by the sum bit register (i.e., the register holding the value on each bit adder "sum line"): s = SN-1,...,s2,s1,s0Second, is the partial sum held by the carry register (i.e., the register holding the value on the carry chain of each bit adder): C. the two partial sums are then added by a combining adder to obtain the final output. The manner in which the two partial sums are generated in the accumulator, and the manner in which the partial sums are added, are different for different requirements of the system.
For a path of temporary sum input TS, assume that its N bits are respectively represented as IN-1,...,I2,I1,I0In order to meet the critical path delay requirement of the system and reduce the path delay of the accumulator, a first embodiment of the present application provides a fast accumulator.
Referring to fig. 3 and 4, a fast accumulator disclosed in a first embodiment of the present application includes: the system comprises a plurality of groups of adding modules which are connected in sequence, wherein each group of adding modules at least comprises an adding unit, each adding unit comprises an adder and a register, and the adders are circularly connected with the input end and the output end of the register.
Except the first group and the last group of the addition modules, each other group of the addition modules comprises a first carry I, a second carry II, a first output I 'and a second output II'; the first carry I is used for receiving and inputting data of corresponding bit positions in the data to be accumulated, and the second carry II is directly connected with the second output position II' of the previous group of addition modules through a register.
The first group of the addition modules comprises a first carry I, a first carry I ' and a second carry II ', and the last group of the addition modules comprises a first carry I, a second carry II and a first carry I '.
For a first group of said summing modules, said first output bit i' is connected directly to an output port of said fast accumulator; for the other groups of the addition modules, the first carry-out bit I' and the second carry-out bit II are both connected to the input end of the merging adder, and the output end of the merging adder is connected to the output port of the fast accumulator.
In fig. 3 and 4, signs for each carry and carry are indicated only in one of the addition blocks, but these signs are common to each addition block. In the figure, "+" indicates an adder, and "D" indicates a register.
If each group of the addition modules comprises two or more addition units, the adders in all the addition units are connected in sequence.
If each group of the addition modules comprises two or more addition units, the merge adder also comprises a zero insertion carry.
The number of add units included in each set of add modules is determined by the critical path requirements of the system.
In one implementation, if the critical path requirement of the system does not exceed 1 full adder delay, only one adding unit is included in each group of adding modules, i.e., registers are inserted in each bit carry chain of N-bit adders in the accumulator, as shown in fig. 3. After the T round of accumulation is finished, the register registers a part sum C consisting of all carry bits of N-1 bits, the C and the high N-1 bits of the part sum S are added through a merging adder (usually specially designed to meet the overall critical path requirement of the accumulator, which is described in detail below), and the addition result and the sum of the lowest bit are addedOf direct outputSplicing to obtain the accumulator output S’=s’ N-1,...,s’ 2,s’ 1,s’ 0。
In another implementation, if the critical path requirement of the system does not exceed 2 full adder delays, each group of adding modules includes two adding units, i.e., N-bit adders in the accumulator are divided into one group of two bits, and a register is inserted into a carry chain between each group, as shown in fig. 4. After T round of accumulation is finished, only one carry is cached in each two-bit addition, so that 0 needs to be inserted before C and S are accumulated to form C = CN-1,...,0,c4,0,c2The high N-2 bits of the sum S are added by a merge adder to obtain a sum S1、S0S of direct output’ 1,s’ 0Splicing to obtain the final accumulator output S’。
If the critical path of the system can be longer, it can be analogized as above, and more bit adders are used as a group, and the carry line between the adders of each group is inserted into a register for operation, and for the bit number of the non-inserted register, the final addition operation is completed by inserting 0 into the corresponding bit of the partial sum addition.
Input TS for two-way temporal sum1And TS2Assume that its respective bits are represented asa N-1,...,a 2,a 1,a 0Andb N-1,...,b 2,b 1,b 0in order to meet the critical path delay requirement of the system and reduce the path delay of the accumulator, a second embodiment of the present application provides a fast accumulator.
Referring to fig. 5 and 6, a fast accumulator disclosed in the second embodiment of the present application includes: the system comprises a plurality of groups of sequentially connected addition modules, wherein each group of addition modules at least comprises an addition unit, the addition unit comprises a first adder, a second adder and a register which are sequentially connected, and the output end of the register is connected to the input end of the second adder. It should be noted that "first" and "second" of the first adder and the second adder are defined according to the transmission direction of the input data, and the data is initially input into the first adder and then transmitted from the first adder to the second adder.
Except for the first group and the last group of the addition modules, each other group of the addition modules comprises a first carry I, a second carry II, a third carry III, a first output I ', a second output II ' and a third output III '.
The first carry I is the input end of the first adder in each adding unit and is used for receiving and inputting data of corresponding bits in data to be accumulated, the second carry II is the input end of the first adder in the adding unit with the lowest bit, and the second carry II is directly connected with the second output bit II' of the previous group of adding modules through a register; the third carry-in bit III is the input end of the second adder in the adding unit with the lowest carry-in bit, and the third carry-in bit III is directly connected to the first carry-out bit I' of the previous adding module; the first output bit I ' is the output end of a first adder in the highest-order addition unit, the second output bit II ' is the output end of a second adder in the highest-order addition unit, and the third output bit III ' is the output end of a register in each addition unit.
The first group of the addition modules comprises a first carry I, a first output I ', a second output II' and a third output III ', and the last group of the addition modules comprises a first carry I, a second carry II, a third input III and a third output III'.
For a first group of the summing modules, the third output bit iii' is connected directly to an output port of the fast accumulator; for the other groups of the addition modules, the third carry-out bit III' and the second carry-out bit II are both connected to the input end of the merge adder, and the output end of the merge adder is connected to the output port of the fast accumulator.
In fig. 5 and 6, signs for each carry and carry are indicated only in one of the addition blocks, but these signs are common to each addition block. In the figure, "+" indicates an adder, and "D" indicates a register.
If each group of the addition modules comprises two or more addition units, the first adders in all the addition units are connected in sequence, and the second adders are connected in sequence.
If each group of the addition modules comprises two or more addition units, the merge adder also comprises a zero insertion carry.
The number of add units included in each set of add modules is determined by the critical path requirements of the system.
In one implementation, if the critical path requirement of the system does not exceed 2 full adder delays, only one adding unit is included in each group of adding modules, that is, a register is inserted in the carry chain of the adder for the local accumulation (i.e., the second carry corresponding to the first adder) in each bit of the two adders (i.e., the first adder and the second adder, respectively, for completing the accumulation of the local result and the addition of the temporary sum), as shown in fig. 5. After T round accumulation is finished, the register registers partial sum C formed by all carry bits of N-1 bits, the partial sum C and the high N-1 bits of S are added through a merging adder, and the addition result and the sum of the lowest bit are addedOf direct outputSplicing to obtain the accumulator output S’=s’ N-1,...,s’ 2,s’ 1,s’ 0。
If the critical path requirement of the system does not exceed 3 full-adder delays, the same circuit as shown in fig. 5 can meet the requirement of not more than 3 full-adder delays with the minimum hardware cost.
If the critical path requirement of the system does not exceed 4 full adder delays, each group of addition modules includes two addition units, i.e. each group of 2-bit addition is used as a group, and a register is inserted between each group on the adder carry chain (i.e. the second carry corresponding to the first adder in the first addition unit) for accumulating the result of the current bit. As shown in FIG. 6, T rounds of accumulationAfter finishing, each 2-bit addition buffers 1 carry, so that 0 needs to be inserted into the bit without buffering carry before C and S are accumulated, and then the carry is added with the high N-2 bit of S through the merging adder. Obtained sum and s1,s0Directly output s’ 1,s’ 0Splicing to obtain the final accumulator output S’。
In fig. 3 to 6, the sign of the output signal of the si signal on the adder and the line and the sign of the ci signal on the carry line after passing through the register are the same as those of the corresponding input signals. In practice, these corresponding signals are only equivalent (equal in value) when they differ in time by one period.
In the fast accumulators disclosed in the first and second embodiments, the critical paths of the C and S parts and the adder may be optimized according to the following schemes to ensure that the critical path of the accumulator module as a whole can meet the system clock requirement.
One solution is that in the optimized accumulator the combining adder for adding the partial sum C and S is located outside the loop, involving only a pure forward path. For adapting the critical path delay constraint, a multi-stage pipeline may be inserted into the merged adder appropriately, for example, referring to fig. 7, for 8-bit partial sum addition, a pipeline may be inserted into each 3-bit adder, and the adder between two adjacent stages of pipelines may select some fast addition implementation with low bit width according to the constraints of timing, area, and the like.
Alternatively, considering that the introduction of a new merge adder into the optimized accumulator will bring extra hardware consumption, for a circuit with strict area constraint, the hardware complexity can be reduced in the following two ways.
1) According to the fast accumulator structures of fig. 3 to 6, 0 is inserted into the partial sum C of the input merge adder, and accordingly, the output logic of the merge adder can be re-derived, thereby simplifying the design.
2) The merging adder may adopt a low bit width adder to group addends and perform iterative addition for multiple times to complete the output of the final addition result. Assuming a low bit width adder of 4 bits, addAnd the sum of the two parts is 8 bits, the implementation is shown in fig. 8, TselIndicating a selection signal for selecting whether to use the upper four bits or the lower four bits addend, CoutRepresenting the carry output signal at the previous instant.
The key point of the method is to introduce a new accumulator implementation scheme into a digital signal processing system, so that the critical path is obviously reduced, and the upper frequency limit of the system is improved. The main key points are as follows:
firstly, by inserting a register on a carry chain between groups after grouping, combining each bit adder and the register inserted on a line, the generation of the output of an accumulator can be divided into two steps, and the first step of accumulation generates two partial sums which are respectively: s = S held by each sum line registerN-1,...,s2,s1,s0And C stored in each carry register. The second step is to perform the addition for the two aforementioned partial sums by an adder for partial sum addition.
Based on the first key point, the insertion scheme of the carry chain register can be adjusted according to the requirement of the key path of the system, so that the constraint of the key path of the accumulator module is met with the least additional hardware consumption.
And thirdly, based on the first key point, adders for partial sum addition can be properly inserted into multi-stage pipelines, and fast adders (such as carry selection adders, carry-ahead adders and the like) can be adopted among the pipelines so as to meet the constraint of a key path of an accumulator module.
Based on the main key points, the problem of key path bottleneck caused by a loop in the accumulator can be fundamentally solved. The pure forward path outside the accumulator can realize the increase of the system frequency by a conventional method such as inserting a pipeline and the like.
With reference to fig. 9, a typical example of a convolution operation performed by a set of input feature maps and corresponding convolution kernels in a Convolutional Neural Network (CNN) is described.
Fig. 9 shows a schematic diagram of a set of input feature maps with the number of channels C performing convolution operation with convolution kernels with the number of channels C, and dashed arrows show a corresponding relationship between a system module and a module in the design scheme of the present application. In fig. 9, the input channel dimension parallel scheme is adopted for multiplication, and for simplicity, the selected parallelism is 4. After a group of four pixel values (pixel values are values of one point on the input characteristic diagram) and corresponding weight value multiplication operations are finished, two temporary sums (temporal sums) are obtained from the results through a 4to2 compressor array and are input into an improved accumulator for accumulation. Under the common convolution kernel 3 × 3 configuration, one pixel value of the next set of feature maps can be obtained after accumulating 9C/4 times. Setting the input of the accumulator to be 32bit, respectively implementing the accumulator according to the architectures of the above fig. 4 and 5, and performing time sequence analysis by using the SMIC 55nm library, wherein the specific results are shown in the following table:
TABLE 1 Critical Path comparison before and after accumulator optimization
Through comprehensive analysis, the path delay of the 32bit accumulator deeply optimized by the EDA tool in the prior art is 2.17ns, and the maximum frequency of the whole CNN accelerator system is 460.8 MHz. Through the rapid accumulator disclosed by the application, the path delay of the module can be reduced to 1.10ns, the scheme of pipeline processing of other modules of a system is synchronously matched, the system frequency can be increased to 909.1MHz to the maximum, and the throughput rate can be increased to the maximum by 1.97 times.
Claims (6)
1. A fast accumulator, comprising: the system comprises a plurality of groups of sequentially connected addition modules, wherein each group of addition modules at least comprises an addition unit, the addition unit comprises an adder and a register, and the adder is circularly connected with the input end and the output end of the register;
except the first group and the last group of the addition modules, each group of the addition modules comprises a first carry, a second carry, a first carry-out bit and a second carry-out bit; the first carry is used for receiving and inputting data of corresponding bit in the data to be accumulated, and the second carry is directly connected with the second output bit of the previous group of addition modules through a register;
the first group of the addition modules comprises a first carry, a first carry and a second carry, and the last group of the addition modules comprises a first carry, a second carry and a first carry;
for a first set of the summing modules, the first output bit is connected directly to an output port of the fast accumulator; for the other groups of the addition modules, the first carry-out bit and the second carry-out bit are both connected to the input end of the merge adder, and the output end of the merge adder is connected to the output port of the fast accumulator.
2. The fast accumulator of claim 1, wherein if each set of the adding modules comprises two or more of the adding units, the adders in all the adding units are connected in sequence.
3. The fast accumulator of claim 1 or 2, wherein if each group of the addition modules comprises two or more of the addition units, the merge adder further comprises a zero-inserted carry.
4. A fast accumulator, comprising: the system comprises a plurality of groups of sequentially connected addition modules, wherein each group of addition modules at least comprises an addition unit, the addition unit comprises a first adder, a second adder and a register which are sequentially connected, and the output end of the register is connected to the input end of the second adder;
except the first group and the last group of the addition modules, each other group of the addition modules comprises a first carry, a second carry, a third carry, a first carry-out bit, a second carry-out bit and a third carry-out bit;
the first carry is the input end of the first adder in each addition unit and is used for receiving and inputting data of corresponding bit in the data to be accumulated, the second carry is the input end of the first adder in the addition unit with the lowest bit, and the second carry is directly connected with the second carry-out bits of the previous group of addition modules through a register; the third carry is the input end of the second adder in the addition unit with the lowest bit, and the third carry is directly connected to the first output bit of the previous addition module; the first output bit is the output end of a first adder in the adding unit with the highest bit, the second output bit is the output end of a second adder in the adding unit with the highest bit, and the third output bit is the output end of a register in each adding unit;
the first group of the addition modules comprises a first carry, a first carry-out bit, a second carry-out bit and a third carry-out bit, and the last group of the addition modules comprises a first carry, a second carry, a third carry and a third carry-out bit;
for a first set of the summing modules, the third output bit is connected directly to an output port of the fast accumulator; for the other groups of the addition modules, the third carry-out bit and the second carry-out bit are both connected to the input end of the merge adder, and the output end of the merge adder is connected to the output port of the fast accumulator.
5. The fast accumulator of claim 4, wherein if each set of the adding modules comprises two or more adding units, the first adders and the second adders of all the adding units are connected in sequence.
6. The fast accumulator of claim 4 or 5, wherein if each group of the addition modules comprises two or more of the addition units, the merge adder further comprises a zero-inserted carry.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111365222.XA CN113805840B (en) | 2021-11-18 | 2021-11-18 | Fast accumulator |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111365222.XA CN113805840B (en) | 2021-11-18 | 2021-11-18 | Fast accumulator |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113805840A CN113805840A (en) | 2021-12-17 |
CN113805840B true CN113805840B (en) | 2022-05-03 |
Family
ID=78938294
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111365222.XA Active CN113805840B (en) | 2021-11-18 | 2021-11-18 | Fast accumulator |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113805840B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1570843A (en) * | 2003-07-15 | 2005-01-26 | 中国科学院微电子中心 | Circuit realization structure for concurrent low power dissipation software computing element |
CN1801163A (en) * | 2005-01-05 | 2006-07-12 | 美国博通公司 | Method and logic module of implementation of signal processing functions at integrate circuit chip |
CN101014932A (en) * | 2004-08-04 | 2007-08-08 | 英特尔公司 | Carry-skip adder having merged carry-skip cells with sum cells |
CN101521504A (en) * | 2009-04-13 | 2009-09-02 | 南通大学 | Implementation method for reversible logic unit used for low power consumption encryption system |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1109990C (en) * | 1998-01-21 | 2003-05-28 | 松下电器产业株式会社 | Method and apparatus for arithmetic operation |
US6622153B1 (en) * | 1999-07-07 | 2003-09-16 | Agere Systems, Inc. | Virtual parallel multiplier-accumulator |
TW531952B (en) * | 2000-12-15 | 2003-05-11 | Asulab Sa | Numerically controlled oscillator in particular for a radiofrequency signal receiver |
US20040252829A1 (en) * | 2003-04-25 | 2004-12-16 | Hee-Kwan Son | Montgomery modular multiplier and method thereof using carry save addition |
CN110647308B (en) * | 2019-09-29 | 2021-12-28 | 京东方科技集团股份有限公司 | Accumulator and operation method thereof |
-
2021
- 2021-11-18 CN CN202111365222.XA patent/CN113805840B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1570843A (en) * | 2003-07-15 | 2005-01-26 | 中国科学院微电子中心 | Circuit realization structure for concurrent low power dissipation software computing element |
CN101014932A (en) * | 2004-08-04 | 2007-08-08 | 英特尔公司 | Carry-skip adder having merged carry-skip cells with sum cells |
CN1801163A (en) * | 2005-01-05 | 2006-07-12 | 美国博通公司 | Method and logic module of implementation of signal processing functions at integrate circuit chip |
CN101521504A (en) * | 2009-04-13 | 2009-09-02 | 南通大学 | Implementation method for reversible logic unit used for low power consumption encryption system |
Also Published As
Publication number | Publication date |
---|---|
CN113805840A (en) | 2021-12-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106909970B (en) | Approximate calculation-based binary weight convolution neural network hardware accelerator calculation device | |
US20210349692A1 (en) | Multiplier and multiplication method | |
EP0018519B1 (en) | Multiplier apparatus having a carry-save/propagate adder | |
EP0448367B1 (en) | High speed digital parallel multiplier | |
US6601077B1 (en) | DSP unit for multi-level global accumulation | |
EP0416869B1 (en) | Digital adder/accumulator | |
Alioto et al. | A simple strategy for optimized design of one-level carry-skip adders | |
CN108449091B (en) | Polarization code belief propagation decoding method and decoder based on approximate calculation | |
CN113805840B (en) | Fast accumulator | |
US5097436A (en) | High performance adder using carry predictions | |
CN108255463B (en) | Digital logic operation method, circuit and FPGA chip | |
US6484193B1 (en) | Fully pipelined parallel multiplier with a fast clock cycle | |
CN116155481A (en) | SM3 algorithm data encryption realization method and device | |
US5948051A (en) | Device improving the processing speed of a modular arithmetic coprocessor | |
JPH06149542A (en) | Chaining and adding method for adder | |
Ibrahim | Radix-2n multiplier structures: A structured design methodology | |
US7720902B2 (en) | Methods and apparatus for providing a reduction array | |
US7440991B2 (en) | Digital circuit | |
KR100437845B1 (en) | High speed processing methods and circuits of the modified Euclid's algorithm for a Reed-Solomon decoder | |
CN117149131A (en) | Quick right shift accumulator, distributed algorithm processor and filter | |
Padmapriya et al. | Design of a power optimal reversible FIR filter for speech signal processing | |
Adamidis et al. | RNS multiplication/sum-of-squares units | |
US11620106B1 (en) | Parallel hybrid adder | |
CN108268242B (en) | Carry-over-10: 4 adder-save and carry-over-10: 2 adder-save | |
KR100335252B1 (en) | Fast digital filter |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |