CN112416260B

CN112416260B - Data processing method and data processing device

Info

Publication number: CN112416260B
Application number: CN202011412811.4A
Authority: CN
Inventors: 苗昆
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2022-11-15
Anticipated expiration: 2040-12-03
Also published as: CN112416260A

Abstract

A data processing method and a data processing apparatus. The data processing method comprises the following steps: acquiring the number of invalid data before jth data of a storage unit, wherein the storage unit comprises N data, and the jth data is valid data; determining the number of times that the jth data needs to be shifted forwards and the step size of each shift based on the number of the invalid data so as to shift the invalid data to the address behind the jth data; j is an integer of 1 or more and N or less, and N is an integer of 1 or more. The data processing method greatly reduces the specification (depth) of a multiplexer (namely a multiplexer) in front of each storage unit, further optimizes the time sequence, greatly reduces the number of signal lines and avoids the phenomenon of signal line congestion when the rear end of a chip is realized.

Description

Data processing method and data processing device

Technical Field

The embodiment of the disclosure relates to a data processing method and a data processing device.

Background

The logic circuit is a circuit which transfers and processes discrete signals and realizes the logical operation and operation of digital signals by using a binary system as a principle. The logic circuit includes a combinational logic circuit and a sequential logic circuit. The combinational logic circuit comprises the most basic AND gate circuit, an OR gate circuit and a NOT gate circuit, and the output value of the combinational logic circuit depends on the current value of the input variable and is independent of the past value of the input variable, namely, the combinational logic circuit has no memory and storage functions; sequential logic circuits also include the basic logic gate circuit described above, but there is a feedback loop whose output value depends not only on the current value of the input variable, but also on past values of the input variable. Because the logic circuit only has high and low levels, strong interference resistance, good precision and confidentiality, the logic circuit is widely applied to the aspects of computers, digital control, communication, automation, instruments and the like.

Disclosure of Invention

At least one embodiment of the present disclosure provides a data processing method, including: acquiring the number of invalid data before jth data of a storage unit, wherein the storage unit comprises N data, and the jth data is valid data; determining the number of times that the jth data needs to be shifted forward and the step size of each shift based on the number of the invalid data, so as to shift the invalid data to the address behind the jth data; j is an integer of 1 or more and N or less, and N is an integer of 1 or more.

For example, in a data processing method provided by at least one embodiment of the present disclosure, determining, based on the number of invalid data, the number of times that the jth data needs to be shifted forward and a step size of each shift, so as to shift the invalid data to an address subsequent to the jth data, includes: coding the number of invalid data before the jth data according to a coding rule; obtaining a step size of each shift according to the coding.

For example, in a data processing method provided in at least one embodiment of the present disclosure, the number of invalid data before the jth data is binary-coded; the step size of each shift is obtained according to the value represented by each bit of the binary code.

For example, in a data processing method provided in at least one embodiment of the present disclosure, the number of times of the shift is expressed by the following formula:

SN＝Ceiling(log ₂ HoleCnt_(j-1),1)

wherein, SN represents the number of times that the jth data needs to be shifted forward, and HoleCnt _ (j-1) represents the number of invalid data before the jth data.

For example, in the data processing method provided in at least one embodiment of the present disclosure, the bit width of the invalid data before the jth data after binary encoding is SN bits.

For example, in the data processing method provided in at least one embodiment of the present disclosure, a step size of shifting at an ith shift is determined according to a corresponding bit at the ith shift; i is an integer of 1 to SN.

For example, the data processing method provided in at least one embodiment of the present disclosure further includes: and when the jth data is shifted, synchronously shifting the number of invalid data corresponding to the jth data.

For example, the data processing method provided in at least one embodiment of the present disclosure further includes: and synchronously shifting the data between the jth data and the invalid data and the jth data.

For example, in a data processing method provided by at least one embodiment of the present disclosure, the storage unit includes N storage sub-units, and stores the N data, respectively, where the jth data is stored in the jth storage sub-unit, and the jth storage sub-unit is adjacent to the storage sub-unit storing the invalid data.

For example, in a data processing method provided in at least one embodiment of the present disclosure, when a plurality of invalid data exists before the jth data, a plurality of storage subunits respectively storing the plurality of invalid data are continuously distributed.

At least one embodiment of the present disclosure further provides a data processing apparatus, including: the data processing device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is configured to acquire the number of invalid data before jth data of the storage unit, the storage unit comprises N data, and the jth data is valid data; the shifting unit is configured to determine the number of times that the jth data needs to be shifted forwards and the step size of each shifting based on the number of the invalid data, so as to shift the invalid data to an address behind the jth data; j is an integer of 1 or more and N or less, and N is an integer of 1 or more.

For example, in a data processing apparatus provided in at least one embodiment of the present disclosure, the storage unit includes N storage sub-units, and the N data are stored respectively, and the jth data is stored in the jth storage sub-unit.

For example, in a data processing apparatus provided in at least one embodiment of the present disclosure, the shift unit includes a latch, and the latch includes N latch sub-units corresponding to the N storage sub-units one to one, and is configured to store the number of invalid data corresponding to data in the N storage sub-units, respectively.

For example, in a data processing apparatus provided in at least one embodiment of the present disclosure, the shift unit further includes a logic circuit, and the logic circuit is connected to the storage unit and the latch, and configured to control the jth data to shift according to an encoding rule according to the number of invalid data before the jth data.

For example, in the data processing apparatus provided in at least one embodiment of the present disclosure, the encoding rule is binary encoding, and the logic circuit is configured to control the j-th data to shift according to a value represented by each bit of the binary encoding according to the number of invalid data before the j-th data.

For example, in a data processing apparatus provided in at least one embodiment of the present disclosure, the logic circuit includes an or gate and at least one and gate; the at least one AND gate is connected with the plurality of storage subunits and the latch subunits corresponding to the plurality of storage subunits one to one, and the OR gate is connected with the plurality of latch subunits and the at least one AND gate; the plurality of storage sub-units comprise a jth storage sub-unit and a storage sub-unit determined according to the binary code.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description only relate to some embodiments of the present disclosure and do not limit the present disclosure.

Fig. 1A is a schematic diagram of invalid data before being "logically removed" according to at least one embodiment of the disclosure;

FIG. 1B is a schematic diagram of the invalid data shown in FIG. 1A after being logically removed;

fig. 2A is a schematic diagram of another cache unit storing invalid data before performing a "logical remove" operation according to at least one embodiment of the disclosure;

FIG. 2B is a schematic diagram illustrating a process of "logically removing" invalid data Hole-0 shown in FIG. 2A;

FIG. 2C is a schematic diagram illustrating a process of "logically removing" invalid data Hole-1 shown in FIG. 2A;

fig. 3 is a flowchart of an example of a data processing method according to at least one embodiment of the present disclosure;

fig. 4 is a flowchart of a method for determining a step size per shift according to at least one embodiment of the present disclosure;

fig. 5A is a schematic diagram illustrating a data flow from a memory cell as a shift-out side according to at least one embodiment of the present disclosure;

FIG. 5B is a schematic diagram illustrating a data flow of a memory cell as a move-in side according to at least one embodiment of the present disclosure;

fig. 6A is a schematic diagram of data shifting provided by at least one embodiment of the present disclosure;

fig. 6B is a schematic diagram of another data shifting provided by at least one embodiment of the present disclosure;

fig. 7 is a flow chart of another data processing method according to at least one embodiment of the present disclosure;

fig. 8 is a flowchart of another data processing method according to at least one embodiment of the disclosure;

fig. 9A is a schematic block diagram of a data processing apparatus according to at least one embodiment of the present disclosure;

fig. 9B is a schematic diagram of a logic and gate according to at least one embodiment of the present disclosure;

fig. 10 is a schematic diagram of another data processing apparatus according to at least one embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.

Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. Also, the use of the terms "a," "an," or "the" and similar referents do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that the element or item preceding the word comprises the element or item listed after the word and its equivalent, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used only to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

To maintain the following description of the embodiments of the present disclosure clear and concise, a detailed description of some known functions and components have been omitted from the present disclosure.

One such common design requirement exists in logic design: for a cache with a depth of N (N is an integer greater than or equal to 1), the cache includes, for example, N cache units (Entry), a bit width of each cache unit is W bits (bit), and W is an integer greater than or equal to 1. For example, m (0 < = m < N) buffer units in the buffer are evacuated to become invalid data (Hole), where m represents the number of buffer units storing invalid data, and the m buffer units storing invalid data may be distributed continuously or discontinuously; before the next operation is performed, the stored invalid data needs to be logically removed, so that the memory cells storing valid data are successively placed together from the lowest order address. Reference is made in particular to the description of fig. 1A and 1B below.

Fig. 1A is a schematic diagram of invalid data before being "logically removed" according to at least one embodiment of the disclosure; FIG. 1B is a schematic diagram after "logical removal" of the invalid data shown in FIG. 1A.

For example, as shown in FIG. 1A, 0- (N-1) indicates an address where Data is stored, the N addresses correspond to N cache units, valid indicates Valid Data, and Hole indicates invalid Data. As shown in fig. 1A, the number of cache units storing invalid data Hole in the N cache units is equal to 3, that is, m =3 in the example shown in fig. 1A, and the N cache units are discontinuously distributed.

For example, as shown in fig. 1B, the cache memory unit storing the invalid data Hole in fig. 1A is shifted to the high-order address, so that the memory units storing the Valid data Valid are successively placed together from the lowest-order address, thereby realizing the logic removal.

For example, techniques that implement the above logic requirements may be referred to as "bubble squeezing," sorting, "or" Lazy-Shift. The circuit implementing the above logic removal may be, for example, a Mesh (Mesh) or "Cross Bar" circuit.

Currently, a widely used technical solution for logic removal is to implement a "bubble collapse" function by a circuit colloquially referred to as "Lazy-Shift". "Lazy-Shift" completes padding of the cache cell storing the invalid data Hole by shifting (Shift), the status of the padded cache cell storing the invalid data Hole becomes Valid, and the status of the cache cell used for padding the cache cell storing the invalid data Hole becomes "Empty (Empty)" when no other data is shifted thereto. Since the step of each Shift of "Lazy-Shift" is 1 buffer unit (1 buffer unit storing invalid data Hole is pushed out by 1 Shift), m buffer units storing invalid data Hole need to be shifted m times. The specific description is as follows.

The single Shift in "Lazy-Shift" refers to a single Shift to remove the invalid data Hole from 1 cache unit storing the invalid data Hole, that is, the cache units above the target cache unit (i.e., the cache unit storing the invalid data Hole) of the single Shift are all moved one cache unit downwards, and the cache units below the target cache unit are not moved. The following description will be given by taking m =2 as an example.

Fig. 2A is a schematic diagram of another cache unit storing invalid data before performing a "logical removal" operation according to at least one embodiment of the disclosure; FIG. 2B is a schematic diagram illustrating a process of "logically removing" invalid data Hole-0 shown in FIG. 2A; FIG. 2C is a schematic diagram illustrating a process of "logically removing" invalid data Hole-1 shown in FIG. 2A.

As shown in fig. 2A, address 1 and address 2 in the N buffer units are addresses of two consecutively distributed buffer units storing invalid data Hole, that is, m =2, and therefore, it is necessary to complete the removal of the two buffer units storing invalid data Hole by shifting twice.

As shown in FIG. 2B, the first shift removes invalid data Hole-0. As shown in fig. 2B, the buffer units above the buffer unit storing the invalid data Hole-0 (including the buffer unit storing the invalid data Hole-1) are all shifted downward by 1 buffer unit, that is, the buffer unit Entry _ (N-1) with address (N-1) is shifted to the buffer unit Entry _ (N-2) with address (N-2), the buffer unit Entry _2 with address 2 is shifted to the buffer unit Entry _1 with address 1, and so on.

The positions of the buffer units below the buffer unit storing the invalid data Hole-0 are kept unchanged, and the buffer unit Entry _0 with the address of 0 in the example does not move.

As shown in FIG. 2B, as a result of the first shift, that is, the cache location storing the invalid data Hole-0 is occupied by the invalid data Hole-1, since the cache locations storing the invalid data Hole-0 or more are entirely shifted downward by 1 cache location, the cache location Entry _ (N-1) with the address (N-1) is left Empty (released) and becomes Empty (Empty). At this time, invalid data Hole-0 is removed.

As shown in fig. 2C, the second shift removes invalid data Hole-1. As shown in FIG. 2C, the cache units storing invalid data above Hole-1 all move one cache unit downwards, including the cache unit Entry _ (N-1) with address (N-1) and state Empty (Empty); therefore, the buffer unit Entry _1 storing the invalid Data Hole-1 is occupied by the Data of the buffer unit Entry _2 having the address of 2.

The cache units below the Hole-1 for storing invalid data do not move.

As shown in fig. 2C, after the second shift, the invalid data Hole-1 is removed, and the entire buffer units above the buffer unit storing the invalid data Hole-1 are shifted downward by 1 buffer unit, so that the buffer unit Entry _ (N-1) having the address (N-1) is again Empty and becomes Empty (Empty). At this point, invalid data Hole-1 has been removed.

Currently, in the circuit implementation of the "Lazy-Shift" technology, each cache unit corresponds to W multiplexers (i.e., multiplexers) MUX (i.e., 1 multiplexer MUX for each bit of data), so that a total of N × W multiplexers MUX. The depth of the maximum multiplexer MUX is determined by the number m of invalid data, the maximum value of m is N-1, and the maximum multiplexer MUX corresponds to the multiplexer MUX which selects 1 from N. The minimum multiplexer MUX is a 1-out-of-2 multiplexer MUX. The multiplexer MUX has a specification ranging from "1-on-2" to "1-on-N". The analysis is performed below with a single bit multiplexer MUX. The shift analysis is performed according to m = N-1 (i.e. only the buffer cells storing valid data are shifted, so the maximum number of invalid data is N-1), and the multiplexer MUX rules required for each buffer cell are shown in table 1.

TABLE 1

For example, as shown in Table 1, for the cache unit Entry _ (N-1) with address (N-1), which is a MUX selecting 1 from 2, 1 cache unit Entry _ (N-1) with address (N-1) is selected from two inputs (i.e., entry _ (N-1) or "zero");

for the cache unit Entry _ (N-2) with address (N-2), which is a MUX selecting 1 from 2, 1 cache unit Entry _ (N-2) with write address (N-2) needs to be selected from two inputs (i.e., entry _ (N-1) or Entry _ (N-2));

for the cache unit Entry _ (N-3) with address (N-3), which is a MUX selecting 1 from 3, 1 cache unit Entry _ (N-3) with write address (N-3) needs to be selected from 3 inputs (i.e., entry _ (N-1), entry _ (N-2), entry _ (N-3));

and so on:

for the cache unit Entry _2 with address 2, which is an N-2 to 1 MUX, 1 cache unit Entry _2 with write address 2 needs to be selected from N-2 inputs (i.e., entry _ (N-1), entry _ (N-2), \8230; entry _ 2);

for the cache unit Entry _1 with the address of 1, which is an N-1 to 1 MUX, 1 cache unit Entry _1 with the write address of 1 needs to be selected from N-1 inputs (i.e., entry _ (N-1), entry _ (N-2), \8230;, entry _ 1);

for the cache unit Entry _0 with address 0, which is an N-to-1 MUX, 1 cache unit Entry _0 with write address 0 needs to be selected from N inputs (i.e., entry _ (N-1), entry _ (N-2), \8230;, entry _ 0).

As can be seen from table 1, for a cache unit in which the number m of invalid data is small and the bit width W (unit bit) of the cache unit is small, the conventional logic implementation of "Lazy-Shift" does not cause a big problem. However, if the number m of invalid data is large, and especially if the bit width W of the superimposed buffer unit is also large, a serious problem is caused to the circuit implementation. The problem mainly includes, for example, class 2: timing issues and back-end wiring issues.

For timing problems, for example, the largest multiplexer MUX format of the above-mentioned conventional technology (Lazy-Shift) is the 1-out-of-N multiplexer MUX corresponding to Entry _0, because the Entry _0 may be input from any one of the Entry _ (N-1) to Entry _0 buffer cells. When the number N of buffer cells is small, the maximum depth of the multiplexer MUX is also small, but when the number N of buffer cells is large, the depth of the multiplexer MUX becomes deep. For example, when the number N =1024 of buffer units, the number m of invalid data is 1023 at most, and the depth of the maximum multiplexer MUX is 1024, that is, the logic needs to complete the operation of selecting 1 from 1024 in at most one time slot. This makes timing challenges very large, especially with today's trend of higher and higher clock frequencies, which is very tight. When a parallel logic process is performed or a low-latency device is used, problems such as a rapid increase in area and a rapid increase in power consumption occur.

For the problem of the back end, the multiplexer MUX corresponding to a single cache unit is actually realized based on the bit width (W bit) of the cache unit, that is, each bit of the W bit in each cache unit corresponds to one multiplexer MUX. Taking the cache unit Entry _0 with address 0 as an example, the cache unit Entry _0 corresponds to W multiplexers MUX with 1 selected by N, and then the number of multiplexer MUX signal lines (bit unit) corresponding to the input end of the cache unit Entry _0 with address 0 is W × N; similarly, the number of signal lines corresponding to the buffer Entry _1 with address 1 is W (N-1), and the total number of signal lines sum of the multiplexer MUX from the buffer Entry0 with address 0 to the buffer Entry _ (N-1) with address (N-1) can be expressed by the following formula:

as can be seen from the above, if the number N of the buffer units is very large (i.e., the depth of the buffer units is deep), the number of the signal lines at the outer side of the buffer units is very large, and the data of the corresponding signal lines calculated based on the above formula for different numbers of the buffer units is shown in table 2.

TABLE 2

N	W	SUM
			512	256	33620224
1024	256	134349056
			2048	256	537133312
512	512	67240448
			1024	512	268698112
2048	512	1074266624

As shown in table 2, when the number N of the buffer units and the bit width W of the buffer units are large, the total number of signal lines outside the buffer units increases sharply, which brings a serious layout problem to the back-end implementation and causes signal line congestion, and especially when the buffer units are implemented using non-RAM, the signal line congestion phenomenon is more serious.

It can be seen from the above analysis that, in the current "Lazy-Shift", when implementing logic removal of invalid data of a cache unit with a deep depth, there are disadvantages that timing convergence is difficult and wiring risk is large.

At least one embodiment of the present disclosure provides a data processing method including: acquiring the number of invalid data before the jth data of a storage unit, wherein the storage unit comprises N data, and the jth data is valid data; determining the number of times that the jth data needs to be shifted forwards and the step size of each shift based on the number of the invalid data so as to shift the invalid data to the address behind the jth data; j is an integer of 1 or more and N or less, and N is an integer of 1 or more.

At least one embodiment of the present disclosure further provides a data processing apparatus corresponding to the data processing method.

The data processing method provided by the embodiment of the disclosure greatly reduces the specification (depth) of the multiplexer in front of each memory cell, further optimizes the time sequence, greatly reduces the number of signal lines, and avoids the phenomenon of signal line congestion when the rear end of a chip is realized.

Embodiments of the present disclosure and examples thereof are described in detail below with reference to the accompanying drawings.

At least one embodiment of the present disclosure provides a data processing method, for example, suitable for use in a computer system, for performing "logical removal" on invalid data stored in a cache unit. For example, in some examples, the data processing method may be applied to various types of processors to implement data processing, but may also be applied to other fields, and the processor may be a general-purpose processor or a special-purpose processor, may be a Reduced Instruction Set (RISC) processor or a Complex Instruction Set (CISC) processor, may be an X86 processor, an ARM processor, or the like, and the embodiments of the present disclosure do not limit this.

Fig. 3 is a flowchart of an example of a data processing method according to at least one embodiment of the present disclosure. For example, the data processing method may be implemented in software, hardware, firmware, or any combination thereof, and may be loaded and executed by a processor in a device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, or a network server, and when invalid data stored in a cache unit is logically removed, the specification (depth) of a multiplexer in front of each storage unit is greatly reduced, so that a timing sequence is optimized, the number of signal lines is greatly reduced, and a phenomenon of signal line congestion is avoided.

For example, the computer system includes any electronic device with a computing function, such as a mobile phone, a notebook computer, a tablet computer, a desktop computer, a network server, and the like, and may load and execute the data processing method, which is not limited in this respect. For example, in some examples, the computing system may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or other forms of Processing units, memory units, and the like having data Processing capability and/or instruction execution capability, such as a CPU, a GPU, and the like, to implement the data Processing method provided by the embodiments of the present disclosure by way of executing code or instructions or in combination with logic circuitry.

As shown in fig. 3, the data processing method includes steps S110 to S120.

Step S110: and acquiring the number of invalid data before the jth data of the storage unit.

Step S120: and determining the number of times that the jth data needs to be shifted forward and the step size of each shift according to the number of the invalid data so as to shift the invalid data into the address after the jth data.

For step S110, for example, the memory cell includes N data (as shown in fig. 1A to fig. 2C, the N data are respectively stored in addresses 0 to (N-1)), and the jth data is valid data, that is, the embodiment of the disclosure describes the shift of the valid data, and the following embodiments are the same and are not repeated. For example, the jth data is stored in the memory cell Entry _ j-1 with the address j-1, corresponding to the jth data 1, and the j +1 th data is stored in the memory cell Entry _ j with the address j, corresponding to the jth data. For example, j is an integer of 1 or more and N or less.

The following description will take the data processing method of the memory cell Entry _ j (i.e. the j +1 th data) with the address j as an example, and the processing methods of other memory cells are similar to this, and are not described herein again. Embodiments of the present disclosure are not limited in this regard.

For example, for the memory cell Entry _ j of address j storing the j +1 th data, the number of memory cells storing invalid data from the memory cell Entry _ (j-1) of address (j-1) to the memory cell Entry _0 of address 0 is calculated, and the statistical number of invalid data corresponding to the memory cell Entry _ j of address j is HoleCnt _ j (0 < = HoleCnt _ j < = j). HoleCnt _ j represents the number of invalid data before the memory cell Entry _ j with the address j, that is, the total number of memory cells in which the j +1 th data stored in the memory cell Entry _ j with the address j needs to be shifted.

For example, since there is no invalid data before the memory cell Entry _0 having an address of 0 and the memory cell storing the invalid data is not shifted, the memory cell storing the invalid data does not need to count the number of invalid data before it, that is, the number of invalid data HoleCnt _0 is 0. For clarity and simplicity, holeCnt is used to indicate the number of invalid data.

For step S120, for example, the number of shifts may be represented by the following formula (1) based on the number of invalid data before the jth data (data stored in the memory cell Entry _ j-1 having the address of j-1):

SN＝Ceiling(log ₂ HoleCnt _ (j-1), 1) equation (1)

Where SN represents the number of times that the jth data needs to be shifted forward, and HoleCnt _ (j-1) represents the number of invalid data before the jth data (i.e., the data stored at address (j-1)).

For example, the number of invalid data, holeCnt _ (j-1), may be encoded to obtain a step size for each shift, such that the jth data is shifted by the step size accordingly.

Fig. 4 is a flowchart of a method for determining a step size per shift according to at least one embodiment of the present disclosure. That is, fig. 4 is a flowchart of some examples of step S120 shown in fig. 3. The method for determining the step size per shift provided in at least one embodiment of the present disclosure is described in detail below with reference to fig. 4. For example, in the example shown in fig. 4, the transmission method includes steps S121 to S122.

Step S121: and encoding the number of invalid data before the jth data according to an encoding rule.

For example, the number of invalid data before the jth data (data stored in the memory cell Entry _ j-1 having the address j-1) is binary-coded. For a specific binary encoding method, reference may be made to the introduction of the field, and details are not described herein.

Step S122: the step size of each shift is obtained from the encoding.

For example, the step size per shift is obtained by binary coding the value represented by each bit.

For example, the bit width after binary coding of the number of invalid data before the jth data (data stored in the memory cell Entry _ j-1 having the address of j-1) is SN bits. For example, if the number of times of shifting required to obtain the jth data according to the above formula is 3, the bit width of the binary code is 3. Specifically, reference may be made to the description of the example in fig. 6A, which is not repeated herein.

It should be noted that, when a logic circuit is specifically designed, bit widths obtained by binary encoding the number of invalid data before each valid data may all be set to a uniform bit width, which is the existing maximum bit width BW, so that adjustment of the bit width of the invalid data in any case may be satisfied.

For example, the number of invalid data may be taken as twoThe bit width after the system coding is Ceiling (log) ₂ (N-j), 1), when j =1, the bit width is the largest.

For example, the step size of the shift at the ith shift is determined according to the corresponding bit at the ith shift. For example, i is an integer of 1 or more and SN or less. The bit value of the ith shift determines whether the ith shift is performed, and if the bit value of the ith shift is 1, the ith shift is not performed. The step (step size) of the ith shift depends on the bit to which the bit value corresponds. For example, the ith shift, i.e. using the i-1 th bit of binary coding, when i =1 (i.e. corresponding to the first shift), (i-1) =0, the step is 1, and thus the step can be expressed as 2^ (i-1).

For example, the step size of the shift at the i-th shift can be expressed as:

t＝2 ^i-1 formula (2)

Where t represents a step size of shift at the ith shift, and (i-1) represents a bit and is an integer of 0 or more.

As can be seen from this equation (2), when binary coding is employed, the step size of the shift is a power of 2.

For example, the memory cell Entry _ j with address j in the above steps S110 to S120 is a stage at the output end to count the number of invalid data from the memory cell Entry _ (j-1) with address (j-1) to the memory cell Entry _0 with address 0, so as to shift the data (i.e., the j +1 th data) in the memory cell Entry _ j with address j. For each memory cell, the valid data stored in the memory cell is shifted, so that the memory cell has two data flow directions of 'shift-out' and 'shift-in'.

Fig. 5A is a schematic diagram illustrating data flow out of a memory cell as a shift-out side according to at least one embodiment of the present disclosure; fig. 5B is a schematic diagram illustrating a data flow of a memory cell as a move-in side according to at least one embodiment of the present disclosure.

For example, as shown in FIG. 5A, when the memory cell Entry _ j with address j is used as the mobile terminal, the data stored in the memory cell Entry _ j (i.e. the j +1 th data) may go to the memory cell Entry _ j, the memory cell Entry _ (j-1), the memory cell Entry _ (j-2), and the memory cell Entry (j-4) \8230; \ 8230;, memory cell Entry (j-2 ^ k), 0<＝k<Ceiling(log ₂ j,1)。

For example, as shown in FIG. 5B, when the memory cell Entry _ j with address j is used as the access terminal, the data stored in the memory cell Entry _ j may be from the memory cell Entry _ j, the memory cell Entry _ (j + 1), the memory cell Entry _ (j + 2), the memory cell Entry _ (j + 4), the memory cell 8230, the memory cell Entry _ (j +2^ w), wherein (0)<＝j<N，0<＝w<＝Ceiling(log ₂ (N-1-j),1)。

As can be seen from FIGS. 5A and 5B, when the memory cell Entry _ j is used as the mobile terminal, the size of the multiplexer MUX on the input side of the memory cell Entry _ j is reduced from "((N-1) -j) +1 by 1", to "Ceiling (log) ₂ (N-1-j), 1) +1 selects 1", thus greatly reduce the specification (depth) of the multiplexer before each memory cell, thus has optimized the time sequence, and then greatly reduce the number of signal lines, have avoided the phenomenon of signal line congestion.

Fig. 6A is a schematic diagram of data shifting provided by at least one embodiment of the present disclosure; fig. 6B is a schematic diagram of another data shift according to at least one embodiment of the present disclosure.

As shown in fig. 6A and 6B, the following description will be made by taking an example in which the buffer (buffer) includes 8 memory cells (Entry _0 to Entry _ 7), that is, the depth of the buffer is N = 8. For example, in the example shown in fig. 6A, the data stored in the memory cell Entry7 having an address of 7 (i.e., the 8 th data) is valid data, and it is described by taking an example of shifting the valid data to the memory cell Entry _0 having an address of 0; in the example shown in fig. 6B, data stored in the memory cell Entry7 is valid data, and the data is shifted to the memory cell Entry _2 with the address 2, which is specifically described below.

For example, in the example shown in fig. 6A, it is assumed that all of the invalid data stored in the memory cells Entry _0 to Entry _6 before the memory cell Entry _7, that is, the number of invalid data before the memory cell Entry _7, holeCnt _7, is equal to 7,7, and the binary code is represented as 111.

For example, the 8 th data (i.e., the data stored in the memory cell Entry7 having the address of 7) is calculated according to the above formula (1)The number of shifts required is SN = Ceiling (log) ₂ HoleCnt _7, 1) =3, that is, the data stored in the memory cell Entry _7 needs to be shifted 3 times.

For example, as can be seen from the above calculation, the bit width of the binary code (111) is 3 bits, and the 0-3 bits are all 1, based on which the step size of each shift of the data in the memory cell Entry _7 can be calculated.

For example, for the first shift, the corresponding binary code number 0bit =1, the moving step length is 20=1 based on the above formula (2), that is, in the first shift, the data in the memory cell Entry _7 is moved from the memory cell Entry _7 to the memory cell Entry _6 (as shown by the arrow a11 in fig. 6A).

For example, for the second shift, corresponding to 1bit =1 of binary coding, the step size of the shift is 2 based on the above formula (2) ¹ That is, in the second shift, the data in the memory cell Entry _7 is shifted from the memory cell Entry _6 into the memory cell Entry _4 (as indicated by an arrow a12 in fig. 6A).

For example, for the third shift, corresponding to 2bit =1 in binary coding, the step size of the shift is 2 based on the above formula (2) ² That is, in the third shift, the data in the memory cell Entry _7 is shifted from the memory cell Entry _4 into the memory cell Entry _0 (as indicated by an arrow a13 in fig. 6A).

Therefore, when the number of invalid data holeccnt _7 is equal to 7, the data stored in the memory cell Entry _7 completes the shift across 7 memory cells through 1 st round (T _ sft 1), 2 nd round (T _ sft 2), and 3 rd round (T _ sft 3) shifts.

For example, in the example shown in fig. 6B, it is assumed that all of the invalid data stored in the memory cells Entry _2 to Entry _6 before the memory cell Entry _7, that is, the number of invalid data before the memory cell Entry _7, holeCnt _7, is equal to 5,5, and binary encoding is denoted as 101.

For example, the number of times that the 8 th data needs to be shifted calculated according to the above formula (1) is SN = Ceiling (log) ₂ HoleCnt _7, 1) =3, i.e., the data stored in the memory cell Entry _7 needs to be shifted 3 times.

For example, as can be seen from the above calculation, the bit width of the binary code (101) is 3 bits, the 0 th bit and the 3 rd bit are both 1, and the 2 nd bit is 0, based on which the step size of each shift of the data in the memory cell Entry _7 can be calculated.

For example, for the first shift, the corresponding binary code 0bit =1, the shift step size is 20=1 based on the above formula (2), that is, in the first shift, the data in the memory cell Entry _7 is shifted from the memory cell Entry _7 to the memory cell Entry _6 (as shown by arrow a21 in fig. 6A).

For example, for the second shift, the 1bit of the binary code is =0, so at the second shift, the step size is 0, i.e., no shift is made.

For example, for the third shift, corresponding to 2bit =1 in binary coding, the step size of the shift is 2 based on the above formula (2) ² =4, that is, in the third shift, the data in the memory cell Entry _7 is shifted from the memory cell Entry _6 into the memory cell Entry _2 (as shown by an arrow a23 in fig. 6A).

Therefore, when the number of invalid data HoleCnt _7 is equal to 5, the data stored in the memory cell Entry _7 completes the shift across 5 memory cells through the 1 st round (T _ sft 1) and the 3 rd round (T _ sft 3) shifts.

For example, as can be seen from the above, the shift of each memory cell can be divided into a move-in and a move-out. The move-in of each memory cell is the current memory cell to select the memory cell from which the input came, and the move-out is the memory cell to select the next memory cell to go to. For the shift-in of memory cell Entry _ j, its input comes from memory cell Entry _ (j +2^ w), w =0,1, \ 8230; ceiling (log) ₂ (N-1-j),1)。

For a shift-out of memory cell Entry _ j, its output may go to Entry _ (j-2 ^ k), k =0,1, \ 8230; ceiling (log) ₂ j,1)。

For example, in some examples, the memory cell includes N memory sub-cells (i.e., the above-described memory cells Entry _ (0), entry _ (1), \8230;, entry _ (j), entry _ (j + 1), entry _ (j + 2), \8230; \8230, entry _ (N-1)), respectively storing N data, e.g., the jth data is stored in the jth memory sub-cell Entry _ (j-1) having an address of j-1). For example, the jth memory sub-cell Entry _ (j-1) is adjacent to the memory sub-cell storing invalid data.

Of course, the jth memory sub-cell Entry _ (j-1) may not be adjacent to the memory sub-cell storing invalid data, as long as any one of the memory sub-cells storing valid data and being continuous with the memory sub-cell Entry _ j-1 is satisfied (for example, the memory sub-cells Entry _4 to Entry _ N-1 in fig. 2A, and the memory sub-cells Entry _3 storing valid data adjacent to the memory sub-cells Entry _1 and Entry _2 storing continuous invalid data are continuously distributed). Embodiments of the present disclosure are not limited in this regard.

For example, in some examples, when the jth data (i.e., the memory sub-cell Entry _ j-1) is preceded by a plurality of invalid data, the plurality of memory sub-cells respectively storing the plurality of invalid data are continuously distributed. Of course, the plurality of storage subunits respectively storing the plurality of invalid data may not be continuously distributed, and the embodiment of the disclosure is not limited in this respect.

For example, in some examples, the memory sub-cells storing invalid data before the jth data are consecutive, and the memory sub-cell Entry _ j-1 (i.e., jth memory sub-cell) storing the jth data and having an address of j-1 is a memory sub-cell adjacent to the memory sub-cell storing invalid data. For example, in the example shown in fig. 6A, the memory cells Entry _0 to Entry _6 storing invalid data are consecutive, in the example shown in fig. 6B, the memory cells Entry _2 to Entry _6 storing invalid data are consecutive, and the memory cells Entry _7 shown in fig. 6A and 6B are each a memory cell adjacent to the memory cell Entry _6 storing invalid data. Embodiments of the present disclosure are not limited in this regard.

Fig. 7 is a flowchart of another data processing method according to at least one embodiment of the disclosure. As shown in fig. 7, the data processing method further includes step S130 on the basis of the example shown in fig. 3.

Step S130: when the jth data is shifted, the number of invalid data corresponding to the jth data is also shifted synchronously.

For example, the number of invalid data corresponding to the jth data is stored in another memory cell such as a latch as shown in fig. 10. When the jth data is shifted, the number of invalid data HoleCnt _ (j-1) corresponding to the jth data is also shifted synchronously in the latch 10, so that the step size of each shift can be determined in real time.

Fig. 8 is a flowchart of another data processing method according to at least one embodiment of the disclosure. As shown in fig. 8, the data processing method further includes step S140 on the basis of the example shown in fig. 7.

Step S140: so that the data between the jth data and the invalid data and the jth data are synchronously shifted.

Because the number of invalid data between each memory cell before the memory cell Entry _ (j-1) and the memory cell Entry _0 is calculated, the movement of all the memory cells can be actually moved as a whole according to the number of the invalid data before each memory cell, and the situation that more than two memory cells in the same time slot can be shifted to the same memory cell can not occur.

It should be noted that, in the embodiments of the present disclosure, the flows of the data processing methods provided in the above-mentioned embodiments of the present disclosure may include more or less operations, and these operations may be executed sequentially or in parallel. Although the flow of the data processing method described above includes a plurality of operations occurring in a certain order, it should be clearly understood that the order of the plurality of operations is not limited. The data processing method described above may be executed once or a plurality of times according to a predetermined condition.

The data processing method provided by the embodiment of the disclosure greatly reduces the specification (depth) of the multiplexer before each storage unit; the time sequence is optimized, the number of signal lines is greatly reduced, and the phenomenon of signal line congestion is avoided.

Fig. 9A is a schematic block diagram of a data processing apparatus according to at least one embodiment of the present disclosure. For example, in the example shown in fig. 9A, the data processing apparatus 100 includes an acquisition unit 110 and a shift unit 120. For example, the units may be implemented by hardware (e.g., circuit) modules and the like, for example, by logic circuits, which are not limited by the embodiments of the present disclosure. The following are the same as the examples and are not described in detail.

For example, the acquisition unit 110 is configured to acquire the number of invalid data before the jth data of the storage unit. For example, the memory cell includes N data, and the jth data is valid data. For example, the obtaining unit 110 may implement the step S110, and a specific implementation method thereof may refer to the related description of the step S110, which is not described herein again.

For example, the shift unit 120 is configured to determine the number of times that the jth data needs to be shifted forward and the step size of each shift based on the number of invalid data, so as to shift the invalid data into the address subsequent to the jth data. For example, the shifting unit 120 may implement the step S120, and the specific implementation method thereof may refer to the related description of the step S120, which is not described herein again.

Fig. 10 is a schematic diagram of another data processing apparatus according to at least one embodiment of the present disclosure. As shown in fig. 10, the memory cell 20 includes N memory sub-cells Entry _ (0), entry _ (1), \8230 \ 8230;, entry _ (j), entry _ (j + 1), entry _ (j + 2), \8230;, entry _ (N-1), which respectively store the N data, and the jth data is stored in the jth memory sub-cell Entry _ (j-1). For example, the storage unit 20 may be a buffer shown in fig. 6A or fig. 6B, which is not limited by the embodiment of the disclosure.

For example, as shown in fig. 10, the shift unit 120 includes a latch 10. For example, the latch 10 includes N latch sub-units corresponding to the N memory sub-units one to one, and is configured to store the numbers of invalid data respectively corresponding to the data in the N memory sub-units, holercnt _ (0), holercnt _ (1), \8230 \ 8230;, holercnt _ (j), holercnt _ (j + 1), \8230; \8230, holercnt _ (j +2^ w), \8230;, holercnt _ (N-1), respectively.

For example, the shift unit 120 further includes a logic circuit for implementing a multiplexer MUX, connected to the storage unit 20 and the latch 10, and configured to control the jth data to shift according to the encoding rule according to the number of invalid data before the jth data.

For example, the encoding rule is binary encoding, and the logic circuit is configured to control the jth data to shift according to the value represented by each bit of the binary encoding according to the number of invalid data before the jth data.

For example, the j +1 th data is stored in the memory sub-cell Entry _ j with the address j, and the number of the corresponding invalid data is denoted as holeccnt _ j. The connection manner of the memory sub-cell Entry _ j with address j is taken as an example for explanation, and the connection manners of other memory sub-cells are similar to this and are not described again.

For example, the memory subunit Entry _ j input multiplexer MUX (i.e., logic circuit) may include 2 parts: and a gate and an or gate.

For example, for the AND gate, the ith shift of the j +1 th data corresponds to the ith bit of the invalid data HoleCnt _ j, for example, recorded as HoleCnt [ i ], and the AND gate determines whether the current storage subunit output is valid.

Fig. 9B is a logic diagram according to at least one embodiment of the disclosure.

For example, as shown in fig. 9B, when HoleCnt [ i ] is 1, the output terminal DataOut of the and gate outputs data input by the input terminal DataIn of the and gate, the input terminal DataIn of the and gate is connected to the storage subunit from which data is shifted out, and the output terminal DataOut of the and gate is connected to the storage subunit into which data is to be shifted, so that data is output to the corresponding storage subunit, and thus data shifting is realized, otherwise data is not output, that is, data is not shifted.

For the OR gate, the memory sub-cell Entry _ j, which is the destination (data receiving side), may have inputs from the memory sub-cells Entry (j + 1), entry (j + 2), entry (j + 4), \ 8230;, entry (j +2^ w), and Entry _ j itself.

Therefore, on the basis of the and gate, the N (0 = < N < = N) 1-out MUX of the single-bit input of the memory subunit Entry _ j is decomposed into 1N 1-out-of-1 or gate and a logic circuit of a small number of and gates.

For example, as shown in fig. 10, the logic circuit includes an or gate 11 and at least one and

gate

21, 22, 23, 24, 25, and 26.BW represents the bit width after binary encoding of the number of invalid data preceding each valid data, e.g., the maximum bit width, which is equal to Ceiling (log) ₂ (N-1), 1). It should be noted that fig. 10 only schematically illustrates a schematic connection relationship of the memory sub-unit Entry _ j as a moving-in terminal, and the embodiment of the disclosure is not limited to this.

For example, at least one and gate 21-26 is connected to a plurality of memory sub-units and to a latch sub-unit corresponding to the plurality of memory sub-units one to one. For example, the plurality of memory sub-cells includes a jth memory sub-cell Entry _ j and memory sub-cells Entry _ (j + 1), entry _ (j + 2), entry _ (j + 4) \8230;, entry _ (j +2^ w) determined according to the binary code.

For example, the output of AND gate 21 is connected to memory subunit Entry _ j, and the input is connected to the outputs of AND gates 22-25. For example, the input terminal of the and gate 22 is connected to the memory sub-unit Entry _ (j + 1) and the latch corresponding thereto, so as to determine whether to output the memory sub-unit Entry _ (j + 1) to the memory sub-unit Entry _ (j + 1) according to the bit number of the invalid data homecnt _ (j + 1) before the memory sub-unit Entry _ (j + 1) in accordance with the principle in fig. 9B, so as to implement the shifting of the data stored in the memory sub-unit Entry _ (j + 1).

Correspondingly, the input terminal of the and gate 23 is connected to the memory sub-unit Entry _ (j + 2) and the latch corresponding thereto, so as to determine whether to output the memory sub-unit Entry _ (j + 2) to the memory sub-unit Entry _ (j + 2) according to the bit of the invalid data homecnt _ (j + 2) before the memory sub-unit Entry _ (j + 2) in accordance with the principle in fig. 9B, so as to shift the data stored in the memory sub-unit Entry _ (j + 2).

The input terminal of the and gate 24 is connected to the memory subunit Entry _ (j + 4) and the latch corresponding thereto, so as to determine whether to output the memory subunit Entry _ (j + 4) to the memory subunit Entry _ (j) according to the bit of the invalid data HoleCnt _ (j + 4) before the memory subunit Entry (j + 4) in accordance with the principle in fig. 9B, so as to implement the shifting of the data stored in the memory subunit Entry _ (j + 4).

The input terminal of the AND gate 25 is connected to the memory sub-unit Entry _ (j +2^ w) and the latch corresponding thereto, so that whether to output the memory sub-unit Entry _ (j +2^ w) is determined according to the principle in FIG. 9B according to the bit of the invalid data Holecnt _ (j +2^ w) before the memory sub-unit Entry _ (j +2^ w)) To the memory sub-cell Entry _ (j) to implement the shifting of the data stored in the memory sub-cell Entry _ (j +2^ w). E.g., w =0,1, \8230;, ceiling (log) ₂ (N-j), 1), i.e. connected to AND gate 21, includes Ceiling (log) ₂ (N-j), 1) AND gates.

For example, the or gate 11 is connected to a plurality of latch subunits and at least one and gate (e.g., and gate 26). For example, the input terminal of the OR gate 11 is connected to the latches corresponding to the one-to-one memory subcells Entry _ (j + 1), entry _ (j + 2), entry _ (j + 4), \8230; entry _ (j +2^ w) connected to the memory subcell Entry _ (j), and the output terminal is connected to the AND gate 26.

For example, the input of the and gate 26 is also connected to the memory subunit Entry _ (j) itself, and the output is connected to the and gate 21.

Therefore, the data processing method provided by the embodiment of the disclosure reduces the specification of the MUX on the input side of the memory subunit Entry _ j from "((N-1) -j) +1 to 1", to "Ceiling (log) ₂ (N-1-j), 1) +1 selects 1", greatly reduce the specification (depth) of the multiplexer before each memory cell, and then optimize the time sequence, greatly reduce the number of signal lines, avoid the phenomenon of signal line congestion. The effect is particularly remarkable for the memory unit with deep depth and large bit width. A comparison of data between the conventional "Lazy-Shift" protocol and the protocol of the present invention is shown in Table 3 below:

TABLE 3

As can be seen from table 3, exponential reduction is achieved in the maximum MUX specification and the total number of shifted signal lines, and therefore, the effects of timing convergence, congestion relief, area reduction, and power consumption reduction are all very significant.

For the specific operation of the data processing apparatus shown in fig. 10, reference may be made to the description of fig. 3, which is not repeated herein.

It should be noted that in the embodiment of the present disclosure, the data processing apparatus 100 may include more or less circuits or units, and the connection relationship between the respective circuits or units is not limited and may be determined according to actual requirements. The specific configuration of each circuit is not limited, and may be configured by an analog device, a digital chip, or other suitable configurations according to the circuit principle.

For technical effects of the data processing apparatus 100, reference may be made to technical effects of the data processing method provided in the embodiments of the present disclosure, which are not described herein again.

The following points need to be explained:

(1) The drawings of the embodiments of the disclosure only relate to the structures related to the embodiments of the disclosure, and other structures can refer to the common design.

(2) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.

The above description is intended to be merely exemplary embodiments of the present disclosure and is not intended to limit the scope of the present disclosure, which is defined by the claims appended hereto.

Claims

1. A method of data processing, comprising:

acquiring the number of invalid data before the jth data of a storage unit, wherein the storage unit comprises N data, and the jth data is valid data;

determining the number of times that the jth data needs to be shifted forward and the step size of each shift based on the number of invalid data before the jth data, so as to shift the invalid data before the jth data to the address after the jth data;

wherein j is an integer of 1 or more and N or less, and N is an integer of 1 or more.

2. The data processing method according to claim 1, wherein determining the number of times the jth data needs to be shifted forward and the step size of each shift based on the number of invalid data to shift the invalid data into an address subsequent to the jth data comprises:

coding the number of invalid data before the jth data according to a coding rule;

obtaining a step size of each shift according to the coding.

3. The data processing method of claim 2,

carrying out binary coding on the number of invalid data before the jth data;

the step size of each shift is obtained according to the value represented by each bit of the binary code.

4. The data processing method according to claim 3, wherein the number of shifts is represented by the following formula:

SN＝Ceiling(log ₂ HoleCnt_(j-1),1)

wherein, SN represents the number of times that the jth data needs to be shifted forward, and holercnt _ (j-1) represents the number of invalid data before the jth data.

5. The data processing method according to claim 4, wherein the bit width after binary coding of the number of invalid data before the jth data is SN bits.

6. The data processing method according to claim 4, wherein the step size of shifting at the ith shifting is determined according to the corresponding bit at the ith shifting;

wherein i is an integer of 1 or more and SN or less.

7. The data processing method of any of claims 1 to 6, further comprising:

and when the jth data is shifted, synchronously shifting the number of invalid data corresponding to the jth data.

8. The data processing method of any of claims 1 to 6, further comprising:

and synchronously shifting the data between the jth data and the invalid data and the jth data.

9. The data processing method according to any one of claims 1 to 6, wherein the storage unit includes N storage sub-units that respectively store the N data,

the jth data is stored in a jth storage subunit, and the jth storage subunit is adjacent to the storage subunit storing the invalid data.

10. The data processing method according to claim 9, wherein when a plurality of the invalid data precede the jth data, a plurality of storage sub-units that respectively store the plurality of the invalid data that precede the jth data are continuously distributed.

11. A data processing apparatus comprising:

the data processing device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is configured to acquire the number of invalid data before the jth data of the storage unit, the storage unit comprises N data, and the jth data is valid data;

the shifting unit is configured to determine the number of times of forward shifting the jth data and the step size of each shifting based on the number of invalid data before the jth data, so as to shift the invalid data before the jth data to the address after the jth data;

12. The data processing apparatus according to claim 11, wherein the storage unit includes N storage sub-units that respectively store the N data,

the jth data is stored in a jth storage subunit.

13. The data processing apparatus according to claim 12, wherein the shift unit comprises a latch,

the latch comprises N latch subunits which are in one-to-one correspondence with the N storage subunits, and the latch is configured to store the number of invalid data corresponding to the data in the N storage subunits respectively.

14. The data processing apparatus of claim 13, wherein the shift unit further comprises a logic circuit,

the logic circuit is connected with the storage unit and the latch and is configured to control the j-th data to shift according to an encoding rule and the number of invalid data before the j-th data.

15. The data processing apparatus according to claim 14, wherein the encoding rule is a binary encoding,

the logic circuit is configured to control the jth data to shift according to the number of invalid data before the jth data and the value represented by each bit of the binary code.

16. The data processing apparatus according to claim 15, wherein the logic circuitry comprises an or gate and at least one and gate;

the at least one AND gate is connected with a plurality of storage subunits and a plurality of latch subunits corresponding to the storage subunits one by one,

the OR gate is connected with the plurality of latch subunits and the at least one AND gate;

the plurality of storage sub-units comprise a jth storage sub-unit and a storage sub-unit determined according to the binary coding.