CN113468096A

CN113468096A - Data sharing system and data sharing method thereof

Info

Publication number: CN113468096A
Application number: CN202110668344.XA
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2017-06-26
Filing date: 2017-06-26
Publication date: 2021-10-01
Also published as: CN109117415A; CN109117415B

Abstract

A data sharing system comprising a storage module and at least two processing modules, wherein: at least two processing modules share a storage module; and at least two processing modules are communicated with each other to realize data sharing. And a data sharing method of the data sharing system. The method and the device can reduce the overhead of storage communication and effectively reduce the delay of data access.

Description

Data sharing system and data sharing method thereof

Technical Field

The present disclosure relates to a sharing system, and more particularly, to a data sharing system and a data sharing method thereof.

Background

With the continuous development of artificial intelligence technology, machine learning technology and deep neural network technology are widely applied, such as being applied to speech recognition, image processing, data analysis, advertisement recommendation systems, automatic driving of automobiles, and so on. These techniques can be used in such a wide range of applications and their advantages of being able to handle large data well are not separable. However, as the amount of data increases, the amount of computation increases, and therefore how to organize and store data efficiently becomes a problem to be faced when designing a system on chip (SoC chip).

As shown in fig. 1, in an existing SoC chip, when machine learning (deep learning or other) data of an application specific integrated circuit (ASIC module) is performed, the data is usually stored in a private Static Random Access Memory (SRAM), and the data is put into an off-chip Dynamic Random Access Memory (DRAM) or an on-chip SRAM (cache-like) through an advanced extensible interface (AXI) bus, and then indirectly interacts with other modules. This results in increased system overhead, increased data read latency, and increased energy consumption for data sharing and interaction.

Disclosure of Invention

Based on the above problems, a primary objective of the present disclosure is to provide a data sharing system and a data sharing method thereof, which are used to solve at least one of the above technical problems.

In order to achieve the above object, as one aspect of the present disclosure, the present disclosure proposes a data sharing system including a storage module and at least two processing modules, wherein:

at least two processing modules share a storage module;

at least two processing modules communicate through preset rules to realize data sharing.

In some embodiments of the present disclosure, the preset rules include a communication protocol, a transfer protocol, a handshake protocol, and/or a bus protocol.

In some embodiments of the present disclosure, the communicating via the preset rule includes: the at least two processing modules comprise a first processing module and a second processing module, the first processing module sends a request signal and a corresponding data address to the second processing module, and the second processing module replies an effective signal and data to the first processing module according to the request signal and the corresponding data address to realize data sharing.

In some embodiments of the disclosure, the at least two processing modules comprise physical processors.

In some embodiments of the disclosure, the physical processor comprises a neural network processor.

In some embodiments of the present disclosure, the neural network processor comprises means for performing an artificial neural network forward operation.

In some embodiments of the present disclosure, the apparatus for performing artificial neural network forward operation includes an instruction cache unit and a direct memory access unit, wherein:

the instruction cache unit is used for reading in the instructions through the direct memory access unit and caching the read instructions.

In some embodiments of the disclosure, the above apparatus for performing artificial neural network forward operation further includes:

the controller unit is used for reading the instruction from the instruction cache unit and decoding the instruction into the microinstruction.

In some embodiments of the present disclosure, the apparatus for performing artificial neural network forward operation further includes an H number module, a master operation module, and a plurality of slave operation modules, wherein:

the H tree module is used for transmitting input neuron vectors of the layer to all the slave operation modules through the H tree module at the stage of starting calculation of reverse training of each layer of neural network, and after the calculation process of the slave operation modules is completed, the H tree module is used for splicing the output neuron values of all the slave operation modules step by step into an intermediate result vector;

and the main operation module is used for finishing subsequent calculation by utilizing the intermediate result vector.

In some embodiments of the present disclosure, the direct memory access unit is further configured to write data from an external address space into corresponding data cache units of the master operation module and each slave operation module, or read data from the data cache units to the external address space.

In some embodiments of the present disclosure, the at least two processing modules include two processors with different structures; one of the two processors of the mutually different structure is a neural network processor.

In some embodiments of the disclosure, the at least two processing modules comprise at least two processor cores of a processor; the at least two processor cores are of the same/different structure.

In some embodiments of the present disclosure, the at least two processing modules include at least two arithmetic units of a processor core; the at least two arithmetic units are arithmetic units with the same/different structures.

In some embodiments of the present disclosure, the sharing system further includes:

at least two storage units respectively connected with at least one of the at least two operation units, wherein any one of the at least two operation units is connected with one or more storage units; and at least two memory cells share the memory module.

In some embodiments of the disclosure, the at least two operation units share the same memory unit, or share one memory unit alone, or share one memory unit partially and share one memory unit partially alone.

In some embodiments of the disclosure, the at least two processing modules include three arithmetic units of the processor core, and the number of the at least two storage units is two, two of the at least two storage units are simultaneously connected to one of the storage units, and another one of the at least two arithmetic units is connected to another one of the storage units.

In order to achieve the above object, as another aspect of the present disclosure, the present disclosure proposes a data sharing method including the steps of:

the at least two processing modules are communicated through a preset rule to realize data sharing;

wherein, the two processing modules share the storage module.

the instruction cache unit reads in the instruction through the direct memory access unit and caches the read-in instruction.

In some embodiments of the disclosure, the apparatus for performing artificial neural network forward operations described above further includes a controller unit that reads an instruction from the instruction cache unit and decodes the instruction to generate a microinstruction.

the H-tree module transmits the input neuron vectors of the layer to all the slave operation modules through the H-tree module at the stage of starting calculation of reverse training of each layer of neural network, and splices the output neuron values of all the slave calculation modules into an intermediate result vector step by step after the calculation process of the slave calculation modules is completed;

and the main operation module completes subsequent calculation by using the intermediate result vector.

In some embodiments of the present disclosure, the direct memory access unit further writes data from the external address space to the corresponding data cache units of the master operation module and each of the slave operation modules, or reads data from the data cache units to the external address space.

In some embodiments of the present disclosure, the data sharing method further includes:

The data sharing system and the data sharing method thereof have the following beneficial effects:

1. at least two processing modules in the system can directly communicate through a preset rule to realize data sharing; therefore, a shared storage module is not needed, so that the overhead of storage communication can be reduced, and the delay of data access is effectively reduced;

2. the at least two processing modules of the present disclosure may include processors with different structures and cores in the processors with different structures, so that an external storage module of the processor with the same or different structures and a core external storage module corresponding to the core may be maintained;

3. according to the memory unit, under the condition that the original storage efficiency is not reduced and the original storage cost is not increased, each storage unit can allow one or more operation units to directly access, the specific number of the operation units does not need to be fixed or agreed, an asymmetric structure is supported, and configuration and adjustment are allowed according to requirements, so that the interaction times of access and storage inside and outside the chip are reduced, and the power consumption is reduced;

4. the present disclosure allows a private memory module, which an arithmetic unit enjoys alone, to transfer data to other arithmetic units. The method and the device have the advantages that the data privacy is protected, the data are allowed to be rapidly interacted, the data utilization rate is improved, resource waste caused by the fact that multiple pieces of same data are stored on a chip and the access and storage expenses caused by the fact that the same data are repeatedly read are avoided, the access and storage speed is further improved, and the access and storage power consumption is reduced.

Drawings

FIG. 1 is a block diagram of a prior art data processing system;

fig. 2 is a schematic structural diagram of a data sharing system according to an embodiment of the present disclosure;

FIG. 3 is a block diagram of a processor in the system of FIG. 2;

FIG. 4 is a schematic diagram of the structure of the H-tree module of FIG. 3;

FIG. 5 is a schematic diagram of the main operation module shown in FIG. 3;

FIG. 6 is a schematic diagram of the slave computing module of FIG. 3;

fig. 7 is a schematic structural diagram of a data sharing system according to another embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a data sharing system according to yet another embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a data sharing system according to an embodiment of the present disclosure.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

The method for realizing rapid data interaction between the machine learning ASIC arithmetic unit and other modules in other SoCs is provided by the present disclosure. The method can effectively improve the data interaction efficiency and greatly reduce the interaction delay. For the storage modules which are public at each level, the storage modules can be accessed by the authorized access units, and for the private storage modules, the interaction and access of data can be completed among the access units directly or through a certain rule or a certain protocol.

The present disclosure provides a data sharing system, including a storage module and at least two processing modules, wherein:

at least two processing modules share a storage module;

The data sharing system of the present disclosure supports heterogeneous multiprocessor scenarios. The external memory module is a common memory module for multiple processors, and the processors may be the same processor, different processors, or partially the same.

In some embodiments of the disclosure, the at least two processing modules may be processors with the same/different structures, processor cores with the same/different structures, and arithmetic units with the same/different structures in the processor cores with the same/different structures.

In some embodiments of the present disclosure, the communicating via the preset rule includes: the at least two processing modules comprise a first processing module and a second processing module, the first processing module sends a request signal and a corresponding data address to the second processing module, and the second processing module replies an effective signal and data to the first processing module according to the request signal and the corresponding data address to realize data sharing. It should be noted that the at least two processing modules herein are not limited to include the first processing module and the second processing module, and for example, the at least two processing modules may further include a third processing module, and any two of the three modules may perform communication by using the preset rule.

The present disclosure also provides a data sharing method, including the following steps:

wherein, the two processing modules share one storage module.

As shown in fig. 2, in some embodiments of the present disclosure, the at least two processing modules are two memories, processor 1 and processor 2, and the communication between the two processors refers to the communication between internal storage modules inside the processors. The external memory module allows the processor 1 and the processor 2 to directly access and read data to the required positions of the internal memory module 1 and the internal memory module 2, respectively. The consistency of the data of the external memory module and the internal memory module of the processor is maintained by a certain consistency protocol. In the prior art, when the processor 1 changes the data in its internal storage module, it adopts the "write through" mode to change the data at the corresponding position in the internal storage module 1 and change the corresponding position of the data in the external storage module; the external memory module simultaneously sends a disable signal to the corresponding data in the internal memory module 2. When the processor 2 uses this data, after finding the failure signal, it reads the new value from the external memory module and writes it to the corresponding location in the internal memory module 2. In this embodiment, for data in the internal storage module 1, the processor 2 may send a request signal and a corresponding data address to the processor 1 through a certain preset rule, and after receiving the request signal, the processor 1 replies a valid signal and the data to complete data interaction; therefore, for a structure with a plurality of processors, the same storage space can be maintained, and the direct communication among the plurality of processors can be realized through a certain defined rule, so that the storage communication overhead is reduced, and the data access delay is reduced.

The processor 1, the processor 2, and the like in the present embodiment may be the same processor or different processors. The method is particularly suitable for cooperation between a novel artificial neural network processor and a traditional general-purpose processor. As can be assumed processor 1 is a general purpose processor CPU and processor 2 is an artificial neural network processor.

Specifically, as shown in fig. 3, the artificial neural network processor may be a structure for executing a forward operation of an artificial neural network, and includes an instruction cache unit 1, a controller unit 2, a direct memory access unit 3, an H-tree module 4, a master operation module 5, and a plurality of slave operation modules 6. The instruction cache unit 1, the controller unit 2, the direct memory access unit 3, the H-tree module 4, the master operation module 5, and the slave operation module 6 may be implemented by hardware circuits (e.g., application specific integrated circuits ASIC).

The instruction cache unit 1 reads in the instruction through the direct memory access unit 3 and caches the read instruction; the controller unit 2 reads the instruction from the instruction cache unit 1, and translates the instruction into a microinstruction for controlling the behavior of other modules, such as a direct memory access unit 3, a master operation module 5, a slave operation module 6, and the like; the direct memory access unit 3 can access and store an external address space, and directly read and write data to each cache unit in the processor to complete loading and storing of the data.

As shown in fig. 4, the H-tree module 4 has an H-tree structure, and the H-tree module 4 forms a data path between the master operation module 5 and the plurality of slave operation modules 6. The H tree is a binary tree path formed by a plurality of nodes, each node sends the upstream data to two downstream nodes as it is, and combines the data returned by the two downstream nodes and returns the data to the upstream node. For example, in the initial calculation stage of each layer of artificial neural network, neuron data in the master operation module 5 is sent to each slave operation module 6 through the H-tree module 4; after the calculation process of the slave operation module 6 is completed, the values of the neurons output from the slave operation module are gradually pieced together in the H-tree to form a complete vector composed of neurons as an intermediate result vector. Here, taking the fully-connected layer of the neural network as an example for explanation, assuming that N slave operation modules are shared in the processor, the intermediate result vector is segmented by N, that is, each segment has N elements, and the ith slave operation module calculates the ith element in each segment. N elements are spliced into a vector with the length of N by the H tree module and then returned to the main operation module. Therefore, if the network only has N output neurons, each slave operation unit only needs to output the value of a single neuron, and if the network has m × N output neurons, each slave operation unit needs to output m neuron values.

As shown in fig. 5, which is a block diagram illustrating a structure example of the main operation module 5, the main operation module 5 includes an operation unit 51, a data dependency relationship determination unit 52, and a neuron cache unit 53. The neuron cache unit 53 is used for caching input data and output data used by the main operation module 5 in a calculation process, the operation unit 51 completes various operation functions of the main operation module 5, and the data dependency relationship judgment unit 52 is a port of the operation unit 51 for reading and writing the neuron cache unit 53, and can ensure the reading and writing consistency of data in the neuron cache unit. Meanwhile, the data dependency relationship determination unit 52 is also responsible for sending the read data to the slave computation module 6 through the H-tree module 4, and the output data of the slave computation module 6 is directly sent to the operation unit 51 through the H-tree module 4. The instruction output by the controller unit 2 is sent to the calculation unit 51 and the data dependency relationship judgment unit 52 to control the behavior thereof.

As shown in fig. 6, each slave operation module 6 is a block diagram illustrating a structure example of the slave operation module 6, and includes an operation unit 61, a data dependency relationship determination unit 62, a neuron buffering unit 63, and a weight value buffering unit 64. The arithmetic unit 61 receives the microinstruction sent by the controller unit 2 and performs arithmetic logic operation; the data dependency relationship determination unit 62 is responsible for reading and writing operations on the neuron cache unit 63 in the calculation process. Before the data dependency judgment unit 62 performs the read/write operation, it is first ensured that there is no read/write consistency conflict for the data used in the instructions, for example, all the microinstructions sent to the data dependency unit 62 are stored in the instruction queue inside the data dependency unit 62, and in the queue, if the range of the read data of the read instruction conflicts with the range of the write data of the write instruction located at the front of the queue position, the instruction must wait until the dependent write instruction is executed before being executed; the neuron buffer unit 63 buffers the input neuron vector data and the output neuron value data of the slave operation module 6. The weight buffer unit 64 buffers the weight data required by the slave computing module 6 in the calculation process. For each slave operation module 6, only the weights between all input neurons and part of the output neurons are stored. Taking the fully-connected layer as an example, the output neurons are segmented according to the number N of the slave operation units, and the weight corresponding to the nth output neuron of each segment is stored in the nth slave operation unit.

The slave operation module 6 realizes the parallel arithmetic logic operation in the forward operation process of each layer of artificial neural network. Taking an artificial neural network fully-connected layer (MLP) as an example, the process is y ═ f (wx + b), wherein multiplication of the weight matrix w and the input neuron vector x can be divided into unrelated parallel computing sub-tasks, that is, since out and in are column vectors, each slave operation module 6 only computes the product of a corresponding part of scalar elements in and the column corresponding to the weight matrix w, each obtained output vector is a partial sum to be accumulated of the final result, and the partial sums are added pairwise in the H-tree module 4 to obtain the final result. The calculation process becomes a parallel process of calculating partial sums and a subsequent process of accumulation. Each slave operation module 6 calculates an output neuron value, and all the output neuron values are spliced into a final intermediate result vector in the H-tree module 4. Therefore, each slave operation module 6 only needs to calculate the value of the output neuron corresponding to the module in the intermediate result vector y. The H-tree module 4 sums all neuron values output from the operation module 6 to obtain a final intermediate result vector y. The main operation module 5 performs subsequent calculations based on the intermediate result vector y, such as biasing, pooling (e.g., max pooling (MAXPOOLING) or mean pooling (AVGPOOLING)), activating, sampling, and the like.

In the structure, a common storage module of the CPU and the artificial neural network processor is provided, and the two processors are allowed to directly access and respectively read data into a cache of the CPU and a cache unit of the artificial neural network processor. When the CPU is about to change the data in the buffer, the corresponding position of the data in the buffer is changed by adopting a write-through mode, and the corresponding position of the data in the external storage module is changed, and meanwhile, a failure signal is sent to the corresponding data in the artificial neural network processor. When the artificial neural network processor uses the data, after finding out the failure signal, reading a new value from the external storage module, and writing the new value to a corresponding position of a cache unit in the artificial neural network processor. In addition, for data in the CPU, the artificial neural network processor can complete data interaction by a defined rule that a request signal and a corresponding data address are sent to the CPU, and the CPU replies an effective signal and the data after receiving the request signal. Therefore, for a heterogeneous multiprocessor structure, the data sharing system provided by the embodiment can reduce storage communication overhead and reduce data access delay by maintaining the same storage space.

Each processor is internally provided with a plurality of cores, an internal core internal storage module and an external core storage module, and the data of the external core storage module can be directly accessed by several or all the cores. In some embodiments of the present disclosure, as shown in fig. 7, a data sharing system is proposed, in which at least two processing modules are two processor cores, and data sharing between the two processing modules is implemented by an internal core internal storage module, and the storage module refers to an external core storage module. In this embodiment, a core internal memory module that one core wants to access another core may be accessed via a communication protocol. The core external memory module allows the core 1 and the core 2 to access, and then the core 1 and the core 2 respectively read the required data to the corresponding positions of the core internal memory module 1 and the core internal memory module 2. The consistency of the data of the out-of-core storage module and the in-core storage module is maintained by some consistency protocol. In the prior art, when a core 1 changes data in its core internal memory module, only data at a corresponding position in the core internal memory module 1 is changed in a "write back" manner, and meanwhile, a core external memory module sends an invalid signal to a core internal memory module 2. When the part of data in the core internal storage module 1 is swapped out or the core 2 uses the data, after finding a failure signal, reading a new value from the core external storage module and writing the new value to a corresponding position in the core internal storage module 2. However, in this embodiment, for the data in the core internal memory module 1, the core 2 may also complete data interaction by sending a request signal and a corresponding data address to the core 1 first and replying a valid signal and data after the core 1 receives the request signal through a certain defined rule. The core and the core may be the same, such as a neural network core, or different, such as a neural network core and a CPU core. Therefore, the data can be protected to a certain extent, the access of the same or different structures to check the data storage is allowed, and the consistency of the data is maintained. Meanwhile, the access and memory overhead is reduced, and the access and memory delay is reduced.

Each neural network core includes a plurality of neural network arithmetic units, and thus, as shown in fig. 8, in some embodiments of the present disclosure, a data sharing system is proposed, in which at least two processing modules refer to three arithmetic units, and the three arithmetic units can directly access the memory module in the core, and also can directly transmit related data in a certain direction, which is beneficial to reducing the number of accesses to the memory module by data transmission between the arithmetic units, thereby reducing power consumption and access delay. It is not assumed that, when the neural network operation is completed, the operation unit 1 calculates an output value 1, and the result is represented by out1, where n is (n1, n2, … …, nk), and w is (w1, w2, … …, wk), then out1 is n1 w1+ n2 w2+ … … + nk wk. Similarly, the output result of the arithmetic unit 2 is out2, the corresponding neuron is m ═ m1, m2, … …, mk, and the synapse value is w ═ w1, w2, … …, wk, then out2 ═ m1 ═ w1+ m2 × w2+ … … + mk wk. The output result of the arithmetic unit 3 is out3, and if the corresponding neuron element is q ═ (q1, q2, … …, qk) and the synapse value is w ═ (w1, w2, … …, wk), then out3 ═ q1 ═ w1+ q2 × w2+ … … + qk × wk. Specifically, firstly, the arithmetic unit 1 reads n and w from the core internal memory module, and directly performs arithmetic to obtain out 1; the operation unit 2 reads m from the internal storage module of the core, receives the synapse value w transmitted from the operation unit 1, and performs corresponding operation to obtain out 2; the operation unit 3 reads q from the internal memory module of the core, and receives the synapse value w from the operation unit 1 to perform corresponding operation, so as to obtain out 3. Therefore, the access frequency of the internal storage module of the core is reduced, the delay and the power consumption are reduced, the operation speed is improved, and the operation energy consumption is saved.

In some embodiments of the present disclosure, in the data sharing system in the previous embodiment, one or more layers of storage units may be further added in the core, allowing 1 storage module unit to be shared by several operation units or 1 storage unit to be private by 1 operation unit. As shown in fig. 9, it is assumed here that the shared system includes two storage units, and the storage unit 1 is shared by the arithmetic unit 1 and the arithmetic unit 2, the arithmetic unit 1 and the arithmetic unit 2 can directly access the storage unit 1, and the arithmetic unit 3 cannot directly access; the storage unit 2 is private to the arithmetic unit 3, the arithmetic unit 3 is directly accessible, and the arithmetic unit 1 and the arithmetic unit 2 are not directly accessible. Therefore, if the arithmetic unit 1 wants to access the arithmetic result in the arithmetic unit 3, the arithmetic result can be directly obtained through the arithmetic unit 3, the kernel internal storage module does not need to be accessed through the storage unit 1, then the storage unit 2 is enabled to update the kernel internal storage module and then is transmitted into the storage unit 1, and then the arithmetic unit 1 is allowed to access the long process, so that the data is effectively protected, namely other calculation units without permission (such as the arithmetic unit 1) can not randomly change the storage unit (such as the storage unit 2), the memory access times can be greatly reduced, the waste of a plurality of same data stored on a chip to the storage resource on the chip is avoided, the delay and the power consumption are reduced, the arithmetic speed is further improved, and the arithmetic energy consumption is saved.

The above-mentioned embodiments are intended to illustrate the objects, aspects and advantages of the present disclosure in further detail, and it should be understood that the above-mentioned embodiments are only illustrative of the present disclosure and are not intended to limit the present disclosure, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A data sharing system comprising a storage module and at least two processing modules, wherein:

the at least two processing modules share the storage module;

the at least two processing modules communicate with each other through a preset rule to realize data sharing,

the at least two processing modules are processors with the same/different structures, processor cores with the same/different structures, and one of arithmetic units with the same/different structures in the processor cores with the same/different structures.

2. The data sharing system of claim 1, wherein the preset rules include a communication protocol, a transfer protocol, a handshake protocol, and/or a bus protocol.

3. The data sharing system of any one of claims 1 to 2, wherein the communicating by preset rules comprises: the at least two processing modules comprise a first processing module and a second processing module, the first processing module sends a request signal and a corresponding data address to the second processing module, and the second processing module replies an effective signal and data to the first processing module according to the request signal and the corresponding data address to realize data sharing.

4. The data sharing system of any one of claims 1 to 3, wherein the at least two processing modules comprise physical processors.

5. The data sharing system of claim 4, wherein the physical processor comprises a neural network processor.

6. The data sharing system of claim 5 wherein the neural network processor comprises means for performing an artificial neural network forward operation.

7. The data sharing system of claim 6, wherein the means for performing artificial neural network forward operations comprises an instruction cache unit and a direct memory access unit, wherein:

8. The data sharing system of claim 7, wherein the means for performing artificial neural network forward operations further comprises:

9. The data sharing system of any one of claims 7 to 8, wherein the means for performing artificial neural network forward operations further comprises an H-number module, a master operation module, and a plurality of slave operation modules, wherein:

the H-tree module is used for transmitting input neuron vectors of the layer to all the slave operation modules through the H-tree module at the stage of starting calculation of reverse training of each layer of neural network, and after the calculation process of the slave operation modules is completed, the H-tree module is used for splicing output neuron values of all the slave operation modules step by step into an intermediate result vector;

10. The data sharing system of claim 9, wherein the direct memory access unit is further configured to write data from an external address space to the corresponding data cache unit of the master computing module and each slave computing module, or read data from the data cache unit to the external address space.