[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN113157636A - Coprocessor, near data processing device and method - Google Patents

Coprocessor, near data processing device and method Download PDF

Info

Publication number
CN113157636A
CN113157636A CN202110358261.0A CN202110358261A CN113157636A CN 113157636 A CN113157636 A CN 113157636A CN 202110358261 A CN202110358261 A CN 202110358261A CN 113157636 A CN113157636 A CN 113157636A
Authority
CN
China
Prior art keywords
instruction
register
calculation
decoding
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110358261.0A
Other languages
Chinese (zh)
Other versions
CN113157636B (en
Inventor
山蕊
冯雅妮
蒋林
杨博文
高旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Posts and Telecommunications
Original Assignee
Xian University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Posts and Telecommunications filed Critical Xian University of Posts and Telecommunications
Priority to CN202110358261.0A priority Critical patent/CN113157636B/en
Publication of CN113157636A publication Critical patent/CN113157636A/en
Application granted granted Critical
Publication of CN113157636B publication Critical patent/CN113157636B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • G06F7/575Basic arithmetic logic units, i.e. devices selectable to perform either addition, subtraction or one of several logical operations, using, at least partially, the same circuitry
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Advance Control (AREA)

Abstract

The present invention relates to the field of chip technologies, and in particular, to a coprocessor, a near data processing apparatus, and a method. The coprocessor comprises: the instruction register unit is used for receiving and caching a calculation instruction issued by the main processor, and sending the calculation instruction to the decoding access unit when receiving the decoding enabling signal; the decoding access unit is used for receiving the calculation instruction, decoding the calculation instruction, reading an operand required by the calculation instruction from an operand storage position according to a decoding result, and sending the operand to the calculation unit when receiving a calculation enabling signal; the computing unit is used for receiving the operand sent by the decoding and fetching unit, executing corresponding computing operation and writing a computing result to a corresponding storage address; and the accumulation register is used for storing the data written in the calculation unit and the accumulation result of the original value in the accumulation register. The invention directly reads the data of the main memory through the coprocessor, performs calculation and then writes back, thereby improving the operation efficiency of the main processor.

Description

Coprocessor, near data processing device and method
Technical Field
The present invention relates to the field of chip technologies, and in particular, to a coprocessor, a near data processing apparatus, and a method.
Background
The operands of the main processor calculation instruction in the reconfigurable system come from a register or an immediate number in the instruction, when the main memory data needs to be calculated, the instruction supported by the main processor fetches the data from the main memory to the register through an LD instruction or writes the value in the register back to the main memory through an ST instruction, and the data of the main memory is not directly calculated.
Therefore, the problems of slow data reading and low operation efficiency exist when the main processor in the prior art calculates the main memory data.
The above drawbacks are expected to be overcome by those skilled in the art.
Disclosure of Invention
Technical problem to be solved
In order to solve the above problems in the prior art, the present invention provides a coprocessor, a near data processing apparatus and a method, which solve the problems of slow data reading and low computation efficiency when a main processor in the prior art calculates main memory data.
(II) technical scheme
In order to achieve the purpose, the invention adopts the main technical scheme that:
in a first aspect, an embodiment of the present application provides a coprocessor, including: the device comprises an instruction register unit, a decoding and fetching unit, a calculating unit and an accumulation register;
the instruction register unit is connected with the main processor and used for receiving and caching a calculation instruction issued by the main processor, and sending the calculation instruction to the decoding access unit when receiving a decoding enabling signal sent by the decoding access unit;
the decoding access unit is connected with the instruction register unit and used for receiving the calculation instruction, decoding the calculation instruction, reading an operand required by the execution of the calculation instruction from an operand storage position according to a decoding result, and sending the operand to the calculation unit when receiving a calculation enabling signal sent by the calculation unit;
the computing unit is connected with the decoding access unit and used for receiving the operand sent by the decoding access unit, executing corresponding computing operation and writing a computing result to a corresponding storage address;
the accumulation register is connected with the decoding access unit and the calculation unit and is used for storing the data written in the calculation unit and the accumulation result of the original value in the accumulation register.
Optionally, the operand is stored in a location comprising a main memory, a general purpose register, the accumulator register; and when the operand storage position is a main memory, the coprocessor accesses the data storage device by adopting a register indirect addressing mode.
Optionally, a communication flow of the instruction registering unit and the main processor includes:
when the effective signal of the cache calculation instruction output by the instruction register unit is at a low level or the feedback signal of the decoding access unit is at a high level, enabling the effective bit register and the instruction register is effective;
the main processor sends a calculation instruction, an effective signal and a register enabling signal sent by the main processor are high at the same time, and the instruction register unit receives and caches the calculation instruction;
and the instruction registering unit sends a feedback output signal to the main processor, wherein the feedback output signal indicates that the main processor can send a calculation instruction to the coprocessor again when the next clock cycle arrives.
In a second aspect, an embodiment of the present application provides a near data processing apparatus based on a reconfigurable array processor, the apparatus including the coprocessor according to any one of the above first aspects;
the coprocessor is connected with a main processor, the main processor is a reconfigurable array processor and is used for carrying out first decoding on a currently issued calculation instruction and sending a first decoding result to the coprocessor according to an instruction operation code;
the coprocessor is used for carrying out second decoding according to the preset field of the calculation instruction and reading data to be processed from a memory according to a second decoding result, wherein the memory is one of a main memory, a general register and an accumulation register;
the coprocessor is used for calculating the data to be processed according to the second decoding result and storing the calculated result into a corresponding storage position, wherein the storage position is one of a main memory, a general register and an accumulation register determined based on the calculation instruction.
In a third aspect, an embodiment of the present application provides a near data processing method based on a reconfigurable array processor, where the method includes:
the coprocessor receives a first decoding result sent by the main processor after the main processor performs first decoding on the currently issued calculation instruction;
the coprocessor carries out second decoding according to the preset field of the calculation instruction and reads data to be processed from a memory according to a second decoding result, wherein the memory is one of a main memory, a general register and an accumulation register;
and the coprocessor calculates the data to be processed according to the second decoding result and stores the calculated result into a corresponding storage position, wherein the storage position is one of a main memory, a general register and an accumulation register determined based on the calculation instruction.
Optionally, the coprocessor adopts a three-stage pipeline manner; wherein,
the first stage comprises receiving and caching a calculation instruction issued by the main processor through an instruction registering unit;
the second stage comprises receiving and decoding the calculation instruction through a decoding and fetching unit, and reading an operand required by the execution of the calculation instruction according to a decoding result;
and the third stage comprises receiving the operand sent by the decoding and fetching unit through a computing unit, executing the computing instruction and writing the computing result to a corresponding storage position.
Optionally, when the computation instruction is a convolution computation in a neural network,
the main memory is used for storing original image data and a calculation result of the convolutional neural network;
the general register is used for storing operands or intermediate calculation results used in the instruction execution process;
the accumulation register is used for storing the data written in by the calculation unit and the accumulation result of the original value in the accumulation register.
Optionally, receiving, by a computing unit, an operand sent by the decoding and fetching unit, executing the computing instruction, and writing a computing result to a corresponding storage address, includes:
performing convolution calculation of the current channel characteristic diagram and a convolution kernel through a calculating unit;
accumulating the convolution calculation result and the calculation result of the last convolution in the accumulation register, and storing the accumulated result in the accumulation register;
and when the current image convolution calculation is finished, reading a calculation result in the accumulation register through an STRM instruction, and writing the calculation result into a main memory or a general register according to the highest bit of a field of the general register.
Alternatively, when the non-multiply-accumulate instruction is executed and the highest bit of the register Rd field is 0, the storage location is a general purpose register,
when the non-multiply-accumulate instruction is executed and the highest bit of the register Rd field is 1, the location is main memory.
Optionally, the preset field is a Func field, the coprocessor reads data to be processed from a memory according to a second decoding result, and when performing calculation, determines a type and an operand source of a current calculation instruction according to a lowest bit of the Func field, including:
when the lowest bit of a Func field of the current calculation instruction is 0, if the highest bit of a register field is 0, directly reading an operand of the calculation instruction from a general register; if the highest bit of the register field is 1, reading a corresponding position in the data storage according to the address stored in the register to obtain an operand required by the calculation instruction;
when the lowest bit of the Func field of the current calculation instruction is 1, one operand of the calculation instruction is from an immediate number, the other operand is determined by the highest bit of the Rs field, and if the highest bit of the Rs field is 0, the operand of the calculation instruction is directly read from the general register; if the highest bit of the register field is 1, reading a corresponding position in the data storage according to the address stored in the register to obtain the operand of the calculation instruction;
when executing the STRM instruction, the instruction operands come directly from the accumulator register.
(III) advantageous effects
The invention has the beneficial effects that: the coprocessor provided by the embodiment of the invention directly reads the main memory data for calculation and then writes back, so that the access speed of the main memory data is improved, and the operation efficiency of the main processor is further improved.
Furthermore, the near data processing device based on the reconfigurable array processor provided by the embodiment of the invention not only improves the processing efficiency of the processor on the main memory data, but also further improves the operation speed and the data processing efficiency through the accumulation register arranged in the coprocessor.
Drawings
FIG. 1 is a block diagram of a coprocessor according to an embodiment of the present invention;
FIG. 2 is a block diagram of a coprocessor according to another embodiment of the present invention;
FIG. 3 is a diagram illustrating an encoding format of a coprocessor instruction according to another embodiment of the present invention;
FIG. 4 is a block diagram of an instruction register unit according to another embodiment of the present invention;
FIG. 5 is a schematic diagram of a decoding access unit according to another embodiment of the present invention;
FIG. 6 is a schematic diagram of a computing unit according to another embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a near data processing apparatus based on a reconfigurable array processor according to another embodiment of the present invention;
fig. 8 is a flowchart illustrating a method for processing near data based on a reconfigurable array processor according to still another embodiment of the present invention.
Detailed Description
For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.
All technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
Fig. 1 is a coprocessor according to an embodiment of the present invention, and as shown in fig. 1, the coprocessor includes: the device comprises an instruction register unit, a decoding and fetching unit, a calculating unit and an accumulation register;
the instruction register unit is connected with the main processor and used for receiving and caching a calculation instruction issued by the main processor and sending the calculation instruction to the decoding access unit when receiving a decoding enabling signal sent by the decoding access unit;
the decoding access unit is connected with the instruction register unit and used for receiving and decoding the calculation instruction, reading an operand required by the calculation instruction execution from an operand storage position according to a decoding result, and sending the operand to the calculation unit when receiving a calculation enabling signal sent by the calculation unit;
the calculation unit is connected with the decoding access unit and used for receiving the operand sent by the decoding access unit, executing corresponding calculation operation and writing a calculation result to a corresponding storage address;
the accumulation register is connected with the decoding access unit and the calculation unit and is used for storing the data written in the calculation unit and the accumulation result of the original value in the accumulation register.
The coprocessor shown in fig. 1 also includes a register file as a general register of the coprocessor.
In the technical solution provided by the embodiment of the present invention shown in fig. 1, the access speed of the main memory data is increased by directly reading the main memory data for calculation and then writing back the main memory data, thereby increasing the operation efficiency of the main processor.
Fig. 2 is a schematic diagram of an architecture of a coprocessor according to another embodiment of the present invention, and the embodiment shown in fig. 2 is described in detail below:
in this embodiment, the co-processing includes an instruction register unit (IR), a decode-fetch unit (dec-fet), a compute unit (ALU), an accumulation register (Rm), and a register file (register file).
The operand storage locations include main memory, general purpose registers, and accumulator registers. The main memory is used for storing a large amount of original data or calculation results, when the highest bit of the Rs or Rt register field is 1, a source operand is read from the main memory, and when the highest bit of the Rd register field is 1, the instruction calculation results are written back to the main memory;
the general register is used for storing operands or intermediate calculation results used in the instruction execution process, when the highest bit of the Rs or Rt register field is 0, the source operands are read from the general register, and when the highest bit of the Rd register field is 0, the instruction calculation results are written back to the general register;
the accumulation register is used for storing the execution result of the multiply-accumulate instruction MAC and accumulating the execution result with the existing data in the accumulation register, and when the STRM instruction is executed to read the accumulation register, the data stored in the accumulation register is read out, and then the accumulation register is emptied.
When the position of operand storage is main memory, because the main memory storage capacity is large, the storage space is accessed by adopting a register indirect addressing mode, and the corresponding address of the target data stored in the main memory is stored in the general register; the accumulator register is used for storing convolution results. The interaction of the instructions among the units in the figure is performed, and the specific instructions will be described in the introduction of the unit modules below.
FIG. 3 is a schematic diagram of an encoding format of a coprocessor instruction according to another embodiment of the present invention, as shown in FIG. 3, where type I is an immediate type instruction and type II is a non-immediate type instruction. 6 bits in the 32-bit instruction are operation codes (OP) of the calculation instruction, which can be used as distinguishing marks of the main processor support instruction and the coprocessor support instruction, when the operation codes are 6' b100111, the calculation instruction needs to be issued to the coprocessor for execution.
[25:22] and [21:18] (the number in parentheses indicates a bit) are a general register Rd field and a general register Rs field, respectively, and the [5:0] Func field indicates the calculation type of the instruction. An immediate type instruction differs from a non-immediate type instruction in whether an immediate is employed as a source operand. In the immediate type instruction [17:6] represents a 12-bit immediate and takes the immediate as a source operand, and the lowest bit of a Func field in the instruction is 1; the non-immediate type instruction uses [17:14] to represent the general register Rt field, and the lowest order bit of the Func field in the instruction is 0. 8-bit data in the non-immediate type instruction is not filled, and when the non-I type instruction is executed, the 8-bit data does not participate in calculation.
Except that the lowest order of the Func field is adopted to judge whether the immediate is adopted as the source operand, the sources of the other source operands are determined by the highest order of the register field, and when the highest order of the register field is 0, the source operand is from the register; when the register field has a most significant bit of 1, it indicates that the source operand is from the corresponding address in main memory to which the register points. The results of the execution of the instructions may be written back into the registers or main memory may be updated.
Fig. 4 is a schematic structural diagram of an instruction register unit according to another embodiment of the present invention, and as shown in fig. 4, the instruction register unit is composed of two parts, where valid bit register is used for receiving a valid signal issued by a main processor, and instruction register is used for receiving an instruction issued by the main processor. When the valid bit register and the instruction register are both empty and the valid signal issued by the main processor is valid, the instruction received by the instruction register is valid and can be used for the next-stage decoding and fetching unit.
The main function of the instruction register unit IR is to cache instructions issued by the main processor to the coprocessor. After the calculation instruction is decoded in the main processor, the lower 26 bits of the instruction with the instruction upper 6 bits operation code of 6' b100111 and the effective signal of the instruction are sent to the coprocessor, and the instruction register unit IR in the coprocessor buffers the instruction and the effective signal transmitted by the main processor.
When the IR _ valid signal is at a low level or the decoded feedback signal ready _ in is at a high level, the enabling of the valid bit register and the instruction register is valid, and the instruction and the valid signal issued by the main processor can be cached. When the effective signal and the register enable signal sent by the main processor are high at the same time, the instruction register unit is indicated to receive the instruction sent by the main processor and can finish caching in the next clock cycle, so that when the effective signal and the register enable signal sent by the main processor are high at the same time, a feedback signal ready _ out is sent to the main processor, and the main processor is indicated to send a calculation instruction to the coprocessor again when the next clock cycle arrives. Table 1 is an instruction register unit interface information specification table.
TABLE 1
Figure BDA0003004445970000081
Fig. 5 is a schematic structural diagram of a decoding access unit according to another embodiment of the present invention, and as shown in fig. 5, the decoding access unit includes a decoding control, a selector MUX, a register, and a feedback circuit. The decoding control unit is used for decoding each field of the instruction; the two selectors take the decoded result as a selection control signal, respectively select two operands required by instruction calculation, send the operation result to a register, beat and output the operation result to a next-stage calculation unit; the feedback circuit is used for controlling the whole assembly line.
The main function of the decode fetch unit dec _ fet is to decode an instruction and obtain a source operand required for instruction execution according to a decoding result. When the decoding and fetching unit is in an idle state or the decoding and fetching operation of the current calculation instruction is completed, the instruction register unit IR sends the instruction to the decoding and fetching unit for secondary decoding, reads all source operands required by instruction execution from the corresponding address according to the decoding result until all the operands required by the calculation instruction are ready, namely after the decoding and fetching operation is completed, sends a ready feedback signal to the previous stage instruction register unit to indicate that the decoding and fetching operation of the current instruction is completed, and can perform decoding and fetching on the next instruction. Meanwhile, the decoding fetch unit sends valid signals and source operands which are prepared by the source operands required by the instruction execution of the calculation unit ALU of the next stage.
The type and operand source of the current computing instruction are determined by the lowest order bits of the Func field, which mainly includes the following 3 cases:
firstly, when the lowest bit of a Func field of an instruction is 0, if the highest bit of a register field is 0, directly reading an operand of a calculation instruction from a general register; if the highest bit of the register field is 1, reading a corresponding position in the data storage according to the address stored in the register to obtain the operand of the calculation instruction;
when the lowest bit of the Func field of the instruction is 1, one operand of the calculation instruction comes from the immediate, and the other operand source is determined by the highest bit of the Rs field of the register, which is the same as the first case;
③ when executing the STRM instruction, the instruction operands come directly from the accumulation register Rm.
In the decoding access unit, an En enable signal is used for indicating that the current decoding access unit can be idle or the instruction decoding access is completed, and the next calculation instruction can be decoded. When all operand sources of the instruction are general registers, accumulation registers or immediate numbers in the instruction, the decoding access unit can immediately finish access, an En enable signal is pulled high, and receives and decodes the next calculation instruction in the next clock cycle; when one operand of the instruction comes from the main memory or two operands of the instruction come from the main memory, firstly, the corresponding register is accessed according to the low 3bit of the Rs or Rt field of the register, the data stored in the register is used as an address for accessing the main memory, an enable signal for accessing the main memory is sent out at the same time, the En enable signal is pulled down, the En enable signal is pulled up again until the target data and the effective feedback signal returned by the main memory are received, and the fetching operation of the instruction is completed.
When the En enable signal is effective, two operands required by the instruction are respectively assigned to the left Data L _ Data and the right Data R _ Data, meanwhile, an effective signal L _ V for completing the fetching of the left Data is set to be effective, and an effective signal R _ V for completing the fetching of the right Data is set to be effective. The ALU _ O _ V signal indicates that all operands in the instruction are ready. When an instruction which only needs left data is executed, such as a NOT instruction, a left data valid signal L _ V is directly assigned to the ALU _ O _ V; when an instruction which only needs right data is executed, such as an LUI instruction, a right data valid signal R _ V is directly assigned to an ALU _ O _ V; when two operands are required simultaneously in an executed instruction, ALU _ O _ V is only active when the left data valid signal L _ V and the right data valid signal R _ V are active simultaneously. After the decoding and fetching are completed, the left Data L _ Data, the right Data R _ Data and the signal ALU _ O _ V with ready operands are output to the ALU calculation unit. When the command valid signal IR _ valid and the enable signal En transmitted from the previous command register unit received by the decoding and fetching unit are valid at the same time, a feedback signal rdy _ out _ fetch is sent to the previous command register unit to indicate that the current calculation command completes the decoding and fetching operation, so that a new calculation command can be received to perform the decoding and fetching operation. Table 2 is a decoding fetch unit interface information description table.
TABLE 2
Figure BDA0003004445970000101
Figure BDA0003004445970000111
Because some calculation instructions need two operands transmitted from the main memory at the same time, but the main memory only has one group of read-write access ports, the arbitration processing is carried out on the requests which simultaneously send out the read access to the main memory at the top layer, one request is preferentially responded, the feedback signals and the target data which are responded are cached, and the other access request is responded. And after the two access requests complete responses, returning the target data and the feedback signal obtained by the two accesses to the decoding access unit. When the read access main memory, the general register or the accumulation register conflicts with the write access, the written data is directly used as the read data to be output to the decoding access unit, so that the access speed can be improved on the premise of ensuring the correct access.
Fig. 6 is a schematic diagram of a computing unit according to another embodiment of the present invention, and as shown in fig. 6, the computing unit includes a register, a selector MUX, an arithmetic logic unit ALU, and a feedback circuit. The register is used for receiving and beating a destination register field Rd and a function field Func transmitted by the decoding unit, and is used as a control signal of the selector to select a destination address written back by the calculation result of the arithmetic logic unit ALU; the feedback circuit is used for controlling the pipeline.
The main function of the compute unit ALU is to execute an instruction and write the result of the instruction execution back into the destination address. When the computing unit is in an idle state or finishes the computing operation of a previous instruction, after receiving a valid signal that all source operands of the instruction sent by the dec _ fet of the previous stage are ready, the computing unit executes corresponding computation according to the high 5bit of the Func field of the instruction. After the instruction calculation is completed, a ready feedback signal is sent to the upper-level decoding access unit to indicate that the current instruction calculation is completed, and the calculation unit can calculate and execute the next instruction. Table 3 shows that different Func fields correspond to different calculation types.
TABLE 3
Type of computation Func[5:1] Type of computation Func[5:1]
Additive ADD 00001 Right shift SRL 00111
Subtraction SUB 00010 Multiplying MUL 01000
AND 00011 Multiply-accumulate MAC 01001
OR OR 00100 Immediate left shift LUI 01010
non-NOT 00101 Less than one SLT 01011
Left shift SLL 00110 Reading accumulator register STRM 01100
After the computation is completed, the computation result may be written back to a general purpose register, an accumulator register, or main memory, as long as it is determined by the highest order bit of the Rd field of the type register of the executed computation instruction. When the multiply-accumulate instruction MAC is executed, the calculation result is written back to an accumulation register Rm; when a non-MAC instruction is executed and the highest bit of the Rd field of the register is 0, writing the calculation result back to the Rd register; and when the non-MAC instruction is executed and the highest bit of the Rd field of the register is 1, writing the calculation result back to the corresponding position of the main memory corresponding to the address stored in the Rd register.
In the computing unit, an enable signal En is used for indicating that the computing unit is idle or has already computed the current instruction and writes back the computed result, and when the enable signal En is valid, the enable signal En indicates that the current computing unit can receive a new operand input by the upper-level decoding access unit and perform computation and write back. When receiving left Data L _ Data and right Data R _ Data input by the last-stage decoding and fetching unit and a signal ALU _ O _ V with ready operand, the computing unit immediately completes corresponding computation according to the Func field delayed by one beat and outputs a computation result ALU _ R, and generates an enabling signal for updating a general register, an accumulation register or a main memory. When the general register or the accumulation register is updated by the calculation result, the write access can be completed in the next clock cycle generated by the updating enable; when the main memory is updated by the calculation result, firstly, the corresponding register is accessed according to the low 3 bits of the Rd field of the register with one beat delay, the data stored in the register is used as the destination address of the main memory to be updated, the enabling signal for updating the main memory and the calculation result are sent out at the same time, the enabling signal En is pulled down, and the enabling signal En is set to be effective again until the feedback signal which is sent by the main memory and is used for completing the updating is received. When the signal ALU _ O _ V and the enable signal En, which are input by the previous decoding and fetching unit and whose operands are ready, are asserted simultaneously, a feedback signal rdy _ out _ ALU is sent to the previous decoding and fetching unit, indicating that the current instruction has completed the calculation and write-back operations. The specific calculation unit interface information is shown in table 5.
In the convolution calculation, the multiplication and accumulation operation is to store and accumulate the product result of each pixel point and convolution kernel in one area, and the multiplication and accumulation instruction MAC designed according to the characteristic is to write the calculation result into the accumulation register Rm and automatically accumulate the calculation result with the calculation result executed by the next MAC instruction after the multiplication of the pixel point and the convolution kernel of the input image. When the convolution of this region of the input image is completed and is needed as the input image for the next layer of convolution or pooling, the result of accumulating the convolution calculations of this region is output by executing the STRM instruction. And executing the STRM instruction, reading out data in the accumulation register, outputting the data to a general register or a main memory according to the highest bit of the Rd field of the register, and clearing the accumulation register in the next clock period of reading out the accumulation result to ensure that the convolution accumulation result of other areas of the input image is not influenced by the convolution result of the previous area.
Table 4 is a calculation unit interface information description table.
TABLE 4
Figure BDA0003004445970000141
FIG. 7 is a schematic diagram of a near data processing apparatus based on a reconfigurable array processor according to another embodiment of the present invention, as shown in FIG. 7, which can be used for any data access and computation to a main memory, including the coprocessor described in any of the above embodiments;
the coprocessor is connected with the main processor, the main processor is a reconfigurable array processor and is used for carrying out first decoding on a currently issued calculation instruction and sending a first decoding result to the coprocessor according to an instruction operation code;
the coprocessor is used for carrying out secondary decoding according to the preset field of the calculation instruction and reading data to be processed from the main memory according to a second decoding result;
the coprocessor is used for calculating the data to be processed according to the second decoding result and storing the calculated result into the main memory.
It should be noted that the storage location of the calculation result is determined according to the calculation instruction, the storage location of the data to be processed may also be a main memory, a general purpose register, or an accumulator register, and the storage location of the calculation result may also be a main memory, a general purpose register, or an accumulator register, which is not described herein again.
In this embodiment, the coprocessor is controlled by the main processor layer, so when the coprocessor executes an instruction, the main processor PE first decodes the instruction, and determines whether the currently issued instruction is a calculation instruction executed in the coprocessor according to the instruction operation code. And after the calculation instruction is issued to the coprocessor, performing secondary decoding according to the Func field of the instruction, and generating a corresponding control signal according to a decoding result to complete corresponding calculation.
The embodiment can be used for convolution calculation of the neural network. According to the characteristic of high parallelism of an array processor, the neural network calculation is parallelly mapped on an array structure, the calculation speed of the neural network can be improved, but data of a main memory still need to be read into a register for processing. In order to further improve the calculation speed of the network, a coprocessor is combined with an array processor, a multiply-accumulate instruction and an accumulate register are specially designed in the coprocessor aiming at the calculation characteristics of convolution in a neural network, pixels of an input feature diagram from a main memory can be directly multiplied by elements of a convolution kernel and are directly accumulated with the calculation result of the last convolution in the accumulate register, and finally, the result in the accumulate register is read through an STRM instruction.
The near data processing device based on the reconfigurable array processor provided by the embodiment of the invention not only improves the access speed of main memory data, but also can improve the operation speed of a convolutional neural network and improve the image processing efficiency by setting the accumulation register arranged in the coprocessor.
Fig. 8 is a flowchart illustrating a method for processing near data based on a reconfigurable array processor according to still another embodiment of the present invention, as shown in fig. 8, the method includes:
the coprocessor receives a first decoding result sent by the main processor after the main processor performs first decoding on the currently issued calculation instruction;
the coprocessor carries out second decoding according to the preset field of the calculation instruction and reads data to be processed from a memory according to a second decoding result, wherein the memory is one of a main memory, a general register and an accumulation register;
and the coprocessor calculates the data to be processed according to the second decoding result and stores the calculated result into a corresponding storage position, wherein the storage position is one of a main memory, a general register and an accumulation register determined based on the calculation instruction.
In this embodiment, the coprocessor may adopt a three-level pipeline mode; wherein,
the first stage comprises receiving and caching a calculation instruction issued by the main processor through an instruction registering unit;
the second stage comprises receiving and decoding the calculation instruction by a decoding and fetching unit, and reading an operand required by the calculation instruction execution according to a decoding result; the operands may be stored in main memory, general purpose registers, or dedicated accumulator registers; when the calculation instruction is convolution calculation in a neural network, the operation number comprises a convolution kernel, a channel characteristic diagram and a result of the convolution calculation of the previous step, wherein the result of the convolution calculation of the previous step is stored in an accumulation register.
The third stage comprises receiving operands sent by the decoding and fetching unit through the computing unit, executing a computing instruction, and writing a computing result to a corresponding storage position; wherein the storage location comprises a main memory, a general purpose register, or an accumulator register.
In this embodiment, when the calculation instruction is convolution calculation in a neural network, receiving, by a calculation unit, an operand sent by the decode and fetch unit, executing the calculation instruction, and writing a calculation result to a corresponding storage address, includes:
performing convolution calculation of the current channel characteristic diagram and a convolution kernel through a calculating unit;
accumulating the convolution calculation result and the calculation result of the last convolution in the accumulation register, and storing the accumulated result in the accumulation register;
and when the current image convolution calculation is finished, reading a calculation result in the accumulation register through an STRM instruction, and writing the calculation result to a corresponding storage address according to the highest bit of the register field.
In this embodiment, when the non-multiply-accumulate instruction is executed and the highest bit of the Rd field of the register is 0, the storage location is a general purpose register,
when the non-multiply-accumulate instruction is executed and the highest bit of the register Rd field is 1, the location is main memory.
In this embodiment, the preset field is a Func field, the coprocessor reads data to be processed from the corresponding storage location according to the second decoding result, and when performing calculation, determines the type and operand source of the current calculation instruction according to the lowest bit of the Func field, including:
when the lowest bit of a Func field of the current calculation instruction is 0, if the highest bit of a register field is 0, directly reading an operand of the calculation instruction from a general register; if the highest bit of the register field is 1, reading a corresponding position in the data storage according to the address stored in the register to obtain the operand of the calculation instruction;
when the lowest order of the Func field of the current calculation instruction is 1, one operand of the calculation instruction is from an immediate number, the other operand is determined by the highest order of the Rs field of the register, and if the highest order of the Rs field of the register is 0, the operand of the calculation instruction is directly read from the general register; if the highest bit of the register field is 1, reading a corresponding position in the data storage according to the address stored in the register to obtain the operand of the calculation instruction;
when executing the STRM instruction, the instruction operands come directly from the accumulation register Rm.
The method of the embodiment adopts the near data processing device based on the reconfigurable array processor, thereby improving the operation speed and the data processing efficiency.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (10)

1. A coprocessor, the coprocessor comprising: the device comprises an instruction register unit, a decoding and fetching unit, a calculating unit and an accumulation register;
the instruction register unit is connected with the main processor and used for receiving and caching a calculation instruction issued by the main processor, and sending the calculation instruction to the decoding access unit when receiving a decoding enabling signal sent by the decoding access unit;
the decoding access unit is connected with the instruction register unit and used for receiving the calculation instruction, decoding the calculation instruction, reading an operand required by the execution of the calculation instruction from an operand storage position according to a decoding result, and sending the operand to the calculation unit when receiving a calculation enabling signal sent by the calculation unit;
the computing unit is connected with the decoding access unit and used for receiving the operand sent by the decoding access unit, executing corresponding computing operation and writing a computing result to a corresponding storage address;
the accumulation register is connected with the decoding access unit and the calculation unit and is used for storing the data written in the calculation unit and the accumulation result of the original value in the accumulation register.
2. The coprocessor of claim 1, wherein the locations where the operands reside comprise main memory, general purpose registers, the accumulator register; and when the operand storage position is a main memory, the coprocessor accesses the data storage device by adopting a register indirect addressing mode.
3. The coprocessor of claim 1, wherein the instruction registering unit's communication flow with the main processor comprises:
when the effective signal of the cache calculation instruction output by the instruction register unit is at a low level or the feedback signal of the decoding access unit is at a high level, enabling the effective bit register and the instruction register is effective;
the main processor sends a calculation instruction, an effective signal and a register enabling signal sent by the main processor are high at the same time, and the instruction register unit receives and caches the calculation instruction;
and the instruction registering unit sends a feedback output signal to the main processor, wherein the feedback output signal indicates that the main processor can send a calculation instruction to the coprocessor again when the next clock cycle arrives.
4. A near data processing device based on a reconfigurable array processor, characterized in that it comprises a coprocessor according to any of the preceding claims 1 to 3;
the coprocessor is connected with a main processor, the main processor is a reconfigurable array processor and is used for carrying out first decoding on a currently issued calculation instruction and sending a first decoding result to the coprocessor according to an instruction operation code;
the coprocessor is used for carrying out second decoding according to the preset field of the calculation instruction and reading data to be processed from a memory according to a second decoding result, wherein the memory is one of a main memory, a general register and an accumulation register;
the coprocessor is used for calculating the data to be processed according to the second decoding result and storing the calculated result into a corresponding storage position, wherein the storage position is one of a main memory, a general register and an accumulation register determined based on the calculation instruction.
5. A method for near data processing based on a reconfigurable array processor, the method comprising:
the coprocessor receives a first decoding result sent by the main processor after the main processor performs first decoding on the currently issued calculation instruction;
the coprocessor carries out second decoding according to the preset field of the calculation instruction and reads data to be processed from a memory according to a second decoding result, wherein the memory is one of a main memory, a general register and an accumulation register;
and the coprocessor calculates the data to be processed according to the second decoding result and stores the calculated result into a corresponding storage position, wherein the storage position is one of a main memory, a general register and an accumulation register determined based on the calculation instruction.
6. The near data processing method of claim 5, wherein the coprocessor employs a three-stage pipeline approach; wherein,
the first stage comprises receiving and caching a calculation instruction issued by the main processor through an instruction registering unit;
the second stage comprises receiving and decoding the calculation instruction through a decoding and fetching unit, and reading an operand required by the execution of the calculation instruction according to a decoding result;
and the third stage comprises receiving the operand sent by the decoding and fetching unit through a computing unit, executing the computing instruction and writing the computing result to a corresponding storage position.
7. The near data processing method of claim 6, wherein when the computation instruction is a convolution computation in a neural network,
the main memory is used for storing original image data and a calculation result of the convolutional neural network;
the general register is used for storing operands or intermediate calculation results used in the instruction execution process;
the accumulation register is used for storing the data written in by the calculation unit and the accumulation result of the original value in the accumulation register.
8. The method of claim 7, wherein receiving, by a computing unit, an operand sent by the decode fetch unit and executing the computation instruction, and writing a computation result to a corresponding memory address comprises:
performing convolution calculation of the current channel characteristic diagram and a convolution kernel through a calculating unit;
accumulating the convolution calculation result and the calculation result of the last convolution in the accumulation register, and storing the accumulated result in the accumulation register;
and when the current image convolution calculation is finished, reading a calculation result in the accumulation register through an STRM instruction, and writing the calculation result into a main memory or a general register according to the highest bit of a field of the general register.
9. The near data processing method of claim 8,
when the non-multiply-accumulate instruction is executed and the highest bit of the Rd field of the register is 0, the storage location is a general purpose register,
when the non-multiply-accumulate instruction is executed and the highest bit of the register Rd field is 1, the location is main memory.
10. The method of claim 5, wherein the predetermined field is a Func field, and the determining, by the coprocessor, the type and operand source of the current computing instruction according to the lowest bit of the Func field during the computing by reading the data to be processed from the memory according to the second decoding result comprises:
when the lowest bit of the Func field of the current calculation instruction is 0, if the highest bit of the field of the general register is 0, directly reading the operand of the calculation instruction from the general register; if the highest bit of the register field is 1, reading a corresponding position in the data storage according to the address stored in the register to obtain an operand required by the calculation instruction;
when the lowest bit of the Func field of the current calculation instruction is 1, one operand of the calculation instruction is from an immediate number, the other operand is determined by the highest bit of the Rs field, and if the highest bit of the Rs field is 0, the operand of the calculation instruction is directly read from the general register; if the highest bit of the register field is 1, reading a corresponding position in the data storage according to the address stored in the register to obtain the operand of the calculation instruction;
when executing the STRM instruction, the instruction operands come directly from the accumulator register.
CN202110358261.0A 2021-04-01 2021-04-01 Coprocessor, near data processing device and method Active CN113157636B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110358261.0A CN113157636B (en) 2021-04-01 2021-04-01 Coprocessor, near data processing device and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110358261.0A CN113157636B (en) 2021-04-01 2021-04-01 Coprocessor, near data processing device and method

Publications (2)

Publication Number Publication Date
CN113157636A true CN113157636A (en) 2021-07-23
CN113157636B CN113157636B (en) 2023-07-18

Family

ID=76886130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110358261.0A Active CN113157636B (en) 2021-04-01 2021-04-01 Coprocessor, near data processing device and method

Country Status (1)

Country Link
CN (1) CN113157636B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113702700A (en) * 2021-09-01 2021-11-26 上海交通大学 Special integrated circuit for calculating electric energy and electric energy quality parameters
CN113743599A (en) * 2021-08-08 2021-12-03 苏州浪潮智能科技有限公司 Operation device and server of convolutional neural network
CN116610362A (en) * 2023-04-27 2023-08-18 合芯科技(苏州)有限公司 Method, system, equipment and storage medium for decoding instruction set of processor

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS62145340A (en) * 1985-12-20 1987-06-29 Toshiba Corp Cache memory control system
JPH01134657A (en) * 1987-11-20 1989-05-26 Hitachi Ltd Co-processor system
JPH06309270A (en) * 1993-04-22 1994-11-04 Fujitsu Ltd Interruption control circuit built in dpram
JPH08305568A (en) * 1995-04-28 1996-11-22 Matsushita Electric Ind Co Ltd Information processing unit
WO2000068783A2 (en) * 1999-05-12 2000-11-16 Analog Devices, Inc. Digital signal processor computation core
JP2004005738A (en) * 1998-03-11 2004-01-08 Matsushita Electric Ind Co Ltd Data processor, and instruction set expansion method
CN102109979A (en) * 2009-12-28 2011-06-29 索尼公司 Processor, co-processor, information processing system, and method thereof
JP2014194660A (en) * 2013-03-28 2014-10-09 Fujitsu Ltd Calculation method, calculation program and calculation device
CN109144573A (en) * 2018-08-16 2019-01-04 胡振波 Two-level pipeline framework based on RISC-V instruction set
CN111159094A (en) * 2019-12-05 2020-05-15 天津芯海创科技有限公司 RISC-V based near data stream type calculation acceleration array
CN111930426A (en) * 2020-08-14 2020-11-13 西安邮电大学 Reconfigurable computing dual-mode instruction set architecture and application method thereof
CN112099762A (en) * 2020-09-10 2020-12-18 上海交通大学 Co-processing system and method for quickly realizing SM2 cryptographic algorithm
CN112181496A (en) * 2020-09-30 2021-01-05 中国电力科学研究院有限公司 AI extended instruction execution method and device based on open source instruction set processor, storage medium and electronic equipment

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS62145340A (en) * 1985-12-20 1987-06-29 Toshiba Corp Cache memory control system
JPH01134657A (en) * 1987-11-20 1989-05-26 Hitachi Ltd Co-processor system
JPH06309270A (en) * 1993-04-22 1994-11-04 Fujitsu Ltd Interruption control circuit built in dpram
JPH08305568A (en) * 1995-04-28 1996-11-22 Matsushita Electric Ind Co Ltd Information processing unit
JP2004005738A (en) * 1998-03-11 2004-01-08 Matsushita Electric Ind Co Ltd Data processor, and instruction set expansion method
WO2000068783A2 (en) * 1999-05-12 2000-11-16 Analog Devices, Inc. Digital signal processor computation core
CN102109979A (en) * 2009-12-28 2011-06-29 索尼公司 Processor, co-processor, information processing system, and method thereof
JP2014194660A (en) * 2013-03-28 2014-10-09 Fujitsu Ltd Calculation method, calculation program and calculation device
CN109144573A (en) * 2018-08-16 2019-01-04 胡振波 Two-level pipeline framework based on RISC-V instruction set
CN111159094A (en) * 2019-12-05 2020-05-15 天津芯海创科技有限公司 RISC-V based near data stream type calculation acceleration array
CN111930426A (en) * 2020-08-14 2020-11-13 西安邮电大学 Reconfigurable computing dual-mode instruction set architecture and application method thereof
CN112099762A (en) * 2020-09-10 2020-12-18 上海交通大学 Co-processing system and method for quickly realizing SM2 cryptographic algorithm
CN112181496A (en) * 2020-09-30 2021-01-05 中国电力科学研究院有限公司 AI extended instruction execution method and device based on open source instruction set processor, storage medium and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王士元: "TMS32010高速处理器及应用简介", 小型微型计算机系统, no. 03 *
蒋林,王杏军,刘镇弢,宋辉: "基于SystemC的可重构阵列处理器模型", 《西安邮电大学学报》, vol. 21, no. 3, pages 73 - 78 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743599A (en) * 2021-08-08 2021-12-03 苏州浪潮智能科技有限公司 Operation device and server of convolutional neural network
CN113743599B (en) * 2021-08-08 2023-08-25 苏州浪潮智能科技有限公司 Computing device and server of convolutional neural network
CN113702700A (en) * 2021-09-01 2021-11-26 上海交通大学 Special integrated circuit for calculating electric energy and electric energy quality parameters
CN113702700B (en) * 2021-09-01 2022-08-19 上海交通大学 Special integrated circuit for calculating electric energy and electric energy quality parameters
CN116610362A (en) * 2023-04-27 2023-08-18 合芯科技(苏州)有限公司 Method, system, equipment and storage medium for decoding instruction set of processor
CN116610362B (en) * 2023-04-27 2024-02-23 合芯科技(苏州)有限公司 Method, system, equipment and storage medium for decoding instruction set of processor

Also Published As

Publication number Publication date
CN113157636B (en) 2023-07-18

Similar Documents

Publication Publication Date Title
CN113157636B (en) Coprocessor, near data processing device and method
CN109522254B (en) Arithmetic device and method
JP3871883B2 (en) Method for calculating indirect branch targets
US7707397B2 (en) Variable group associativity branch target address cache delivering multiple target addresses per cache line
US6151662A (en) Data transaction typing for improved caching and prefetching characteristics
US6216219B1 (en) Microprocessor circuits, systems, and methods implementing a load target buffer with entries relating to prefetch desirability
JP4195006B2 (en) Instruction cache way prediction for jump targets
US8943300B2 (en) Method and apparatus for generating return address predictions for implicit and explicit subroutine calls using predecode information
CN102750133B (en) 32-Bit triple-emission digital signal processor supporting SIMD
CN104657110B (en) Instruction cache with fixed number of variable length instructions
US20030154349A1 (en) Program-directed cache prefetching for media processors
JPH10124391A (en) Processor and method for executing store convergence by merged store operation
JPH08234980A (en) Branch estimation system using branching destination buffer
US20150370569A1 (en) Instruction processing system and method
US11301250B2 (en) Data prefetching auxiliary circuit, data prefetching method, and microprocessor
US5898852A (en) Load instruction steering in a dual data cache microarchitecture
US6460132B1 (en) Massively parallel instruction predecoding
US9569219B2 (en) Low-miss-rate and low-miss-penalty cache system and method
US5900012A (en) Storage device having varying access times and a superscalar microprocessor employing the same
US5951671A (en) Sharing instruction predecode information in a multiprocessor system
US5678016A (en) Processor and method for managing execution of an instruction which determine subsequent to dispatch if an instruction is subject to serialization
US10922082B2 (en) Branch predictor
JP2001522082A (en) Approximately larger number of branch predictions using smaller number of branch predictions and alternative targets
US20050149708A1 (en) System, method and device for queuing branch predictions
CN113886286B (en) Two-dimensional structure compatible data reading and writing system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant