[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN113468096A - Data sharing system and data sharing method thereof - Google Patents

Data sharing system and data sharing method thereof Download PDF

Info

Publication number
CN113468096A
CN113468096A CN202110668344.XA CN202110668344A CN113468096A CN 113468096 A CN113468096 A CN 113468096A CN 202110668344 A CN202110668344 A CN 202110668344A CN 113468096 A CN113468096 A CN 113468096A
Authority
CN
China
Prior art keywords
module
data
data sharing
unit
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110668344.XA
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN202110668344.XA priority Critical patent/CN113468096A/en
Publication of CN113468096A publication Critical patent/CN113468096A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/167Interprocessor communication using a common memory, e.g. mailbox
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Neurology (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A data sharing system comprising a storage module and at least two processing modules, wherein: at least two processing modules share a storage module; and at least two processing modules are communicated with each other to realize data sharing. And a data sharing method of the data sharing system. The method and the device can reduce the overhead of storage communication and effectively reduce the delay of data access.

Description

Data sharing system and data sharing method thereof
Technical Field
The present disclosure relates to a sharing system, and more particularly, to a data sharing system and a data sharing method thereof.
Background
With the continuous development of artificial intelligence technology, machine learning technology and deep neural network technology are widely applied, such as being applied to speech recognition, image processing, data analysis, advertisement recommendation systems, automatic driving of automobiles, and so on. These techniques can be used in such a wide range of applications and their advantages of being able to handle large data well are not separable. However, as the amount of data increases, the amount of computation increases, and therefore how to organize and store data efficiently becomes a problem to be faced when designing a system on chip (SoC chip).
As shown in fig. 1, in an existing SoC chip, when machine learning (deep learning or other) data of an application specific integrated circuit (ASIC module) is performed, the data is usually stored in a private Static Random Access Memory (SRAM), and the data is put into an off-chip Dynamic Random Access Memory (DRAM) or an on-chip SRAM (cache-like) through an advanced extensible interface (AXI) bus, and then indirectly interacts with other modules. This results in increased system overhead, increased data read latency, and increased energy consumption for data sharing and interaction.
Disclosure of Invention
Based on the above problems, a primary objective of the present disclosure is to provide a data sharing system and a data sharing method thereof, which are used to solve at least one of the above technical problems.
In order to achieve the above object, as one aspect of the present disclosure, the present disclosure proposes a data sharing system including a storage module and at least two processing modules, wherein:
at least two processing modules share a storage module;
at least two processing modules communicate through preset rules to realize data sharing.
In some embodiments of the present disclosure, the preset rules include a communication protocol, a transfer protocol, a handshake protocol, and/or a bus protocol.
In some embodiments of the present disclosure, the communicating via the preset rule includes: the at least two processing modules comprise a first processing module and a second processing module, the first processing module sends a request signal and a corresponding data address to the second processing module, and the second processing module replies an effective signal and data to the first processing module according to the request signal and the corresponding data address to realize data sharing.
In some embodiments of the disclosure, the at least two processing modules comprise physical processors.
In some embodiments of the disclosure, the physical processor comprises a neural network processor.
In some embodiments of the present disclosure, the neural network processor comprises means for performing an artificial neural network forward operation.
In some embodiments of the present disclosure, the apparatus for performing artificial neural network forward operation includes an instruction cache unit and a direct memory access unit, wherein:
the instruction cache unit is used for reading in the instructions through the direct memory access unit and caching the read instructions.
In some embodiments of the disclosure, the above apparatus for performing artificial neural network forward operation further includes:
the controller unit is used for reading the instruction from the instruction cache unit and decoding the instruction into the microinstruction.
In some embodiments of the present disclosure, the apparatus for performing artificial neural network forward operation further includes an H number module, a master operation module, and a plurality of slave operation modules, wherein:
the H tree module is used for transmitting input neuron vectors of the layer to all the slave operation modules through the H tree module at the stage of starting calculation of reverse training of each layer of neural network, and after the calculation process of the slave operation modules is completed, the H tree module is used for splicing the output neuron values of all the slave operation modules step by step into an intermediate result vector;
and the main operation module is used for finishing subsequent calculation by utilizing the intermediate result vector.
In some embodiments of the present disclosure, the direct memory access unit is further configured to write data from an external address space into corresponding data cache units of the master operation module and each slave operation module, or read data from the data cache units to the external address space.
In some embodiments of the present disclosure, the at least two processing modules include two processors with different structures; one of the two processors of the mutually different structure is a neural network processor.
In some embodiments of the disclosure, the at least two processing modules comprise at least two processor cores of a processor; the at least two processor cores are of the same/different structure.
In some embodiments of the present disclosure, the at least two processing modules include at least two arithmetic units of a processor core; the at least two arithmetic units are arithmetic units with the same/different structures.
In some embodiments of the present disclosure, the sharing system further includes:
at least two storage units respectively connected with at least one of the at least two operation units, wherein any one of the at least two operation units is connected with one or more storage units; and at least two memory cells share the memory module.
In some embodiments of the disclosure, the at least two operation units share the same memory unit, or share one memory unit alone, or share one memory unit partially and share one memory unit partially alone.
In some embodiments of the disclosure, the at least two processing modules include three arithmetic units of the processor core, and the number of the at least two storage units is two, two of the at least two storage units are simultaneously connected to one of the storage units, and another one of the at least two arithmetic units is connected to another one of the storage units.
In order to achieve the above object, as another aspect of the present disclosure, the present disclosure proposes a data sharing method including the steps of:
the at least two processing modules are communicated through a preset rule to realize data sharing;
wherein, the two processing modules share the storage module.
In some embodiments of the present disclosure, the preset rules include a communication protocol, a transfer protocol, a handshake protocol, and/or a bus protocol.
In some embodiments of the present disclosure, the communicating via the preset rule includes: the at least two processing modules comprise a first processing module and a second processing module, the first processing module sends a request signal and a corresponding data address to the second processing module, and the second processing module replies an effective signal and data to the first processing module according to the request signal and the corresponding data address to realize data sharing.
In some embodiments of the disclosure, the at least two processing modules comprise physical processors.
In some embodiments of the disclosure, the physical processor comprises a neural network processor.
In some embodiments of the present disclosure, the neural network processor comprises means for performing an artificial neural network forward operation.
In some embodiments of the present disclosure, the apparatus for performing artificial neural network forward operation includes an instruction cache unit and a direct memory access unit, wherein:
the instruction cache unit reads in the instruction through the direct memory access unit and caches the read-in instruction.
In some embodiments of the disclosure, the apparatus for performing artificial neural network forward operations described above further includes a controller unit that reads an instruction from the instruction cache unit and decodes the instruction to generate a microinstruction.
In some embodiments of the present disclosure, the apparatus for performing artificial neural network forward operation further includes an H number module, a master operation module, and a plurality of slave operation modules, wherein:
the H-tree module transmits the input neuron vectors of the layer to all the slave operation modules through the H-tree module at the stage of starting calculation of reverse training of each layer of neural network, and splices the output neuron values of all the slave calculation modules into an intermediate result vector step by step after the calculation process of the slave calculation modules is completed;
and the main operation module completes subsequent calculation by using the intermediate result vector.
In some embodiments of the present disclosure, the direct memory access unit further writes data from the external address space to the corresponding data cache units of the master operation module and each of the slave operation modules, or reads data from the data cache units to the external address space.
In some embodiments of the present disclosure, the at least two processing modules include two processors with different structures; one of the two processors of the mutually different structure is a neural network processor.
In some embodiments of the disclosure, the at least two processing modules comprise at least two processor cores of a processor; the at least two processor cores are of the same/different structure.
In some embodiments of the present disclosure, the at least two processing modules include at least two arithmetic units of a processor core; the at least two arithmetic units are arithmetic units with the same/different structures.
In some embodiments of the present disclosure, the data sharing method further includes:
at least two storage units respectively connected with at least one of the at least two operation units, wherein any one of the at least two operation units is connected with one or more storage units; and at least two memory cells share the memory module.
In some embodiments of the disclosure, the at least two operation units share the same memory unit, or share one memory unit alone, or share one memory unit partially and share one memory unit partially alone.
In some embodiments of the disclosure, the at least two processing modules include three arithmetic units of the processor core, and the number of the at least two storage units is two, two of the at least two storage units are simultaneously connected to one of the storage units, and another one of the at least two arithmetic units is connected to another one of the storage units.
The data sharing system and the data sharing method thereof have the following beneficial effects:
1. at least two processing modules in the system can directly communicate through a preset rule to realize data sharing; therefore, a shared storage module is not needed, so that the overhead of storage communication can be reduced, and the delay of data access is effectively reduced;
2. the at least two processing modules of the present disclosure may include processors with different structures and cores in the processors with different structures, so that an external storage module of the processor with the same or different structures and a core external storage module corresponding to the core may be maintained;
3. according to the memory unit, under the condition that the original storage efficiency is not reduced and the original storage cost is not increased, each storage unit can allow one or more operation units to directly access, the specific number of the operation units does not need to be fixed or agreed, an asymmetric structure is supported, and configuration and adjustment are allowed according to requirements, so that the interaction times of access and storage inside and outside the chip are reduced, and the power consumption is reduced;
4. the present disclosure allows a private memory module, which an arithmetic unit enjoys alone, to transfer data to other arithmetic units. The method and the device have the advantages that the data privacy is protected, the data are allowed to be rapidly interacted, the data utilization rate is improved, resource waste caused by the fact that multiple pieces of same data are stored on a chip and the access and storage expenses caused by the fact that the same data are repeatedly read are avoided, the access and storage speed is further improved, and the access and storage power consumption is reduced.
Drawings
FIG. 1 is a block diagram of a prior art data processing system;
fig. 2 is a schematic structural diagram of a data sharing system according to an embodiment of the present disclosure;
FIG. 3 is a block diagram of a processor in the system of FIG. 2;
FIG. 4 is a schematic diagram of the structure of the H-tree module of FIG. 3;
FIG. 5 is a schematic diagram of the main operation module shown in FIG. 3;
FIG. 6 is a schematic diagram of the slave computing module of FIG. 3;
fig. 7 is a schematic structural diagram of a data sharing system according to another embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of a data sharing system according to yet another embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of a data sharing system according to an embodiment of the present disclosure.
Detailed Description
For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.
The method for realizing rapid data interaction between the machine learning ASIC arithmetic unit and other modules in other SoCs is provided by the present disclosure. The method can effectively improve the data interaction efficiency and greatly reduce the interaction delay. For the storage modules which are public at each level, the storage modules can be accessed by the authorized access units, and for the private storage modules, the interaction and access of data can be completed among the access units directly or through a certain rule or a certain protocol.
The present disclosure provides a data sharing system, including a storage module and at least two processing modules, wherein:
at least two processing modules share a storage module;
at least two processing modules communicate through preset rules to realize data sharing.
The data sharing system of the present disclosure supports heterogeneous multiprocessor scenarios. The external memory module is a common memory module for multiple processors, and the processors may be the same processor, different processors, or partially the same.
In some embodiments of the disclosure, the at least two processing modules may be processors with the same/different structures, processor cores with the same/different structures, and arithmetic units with the same/different structures in the processor cores with the same/different structures.
In some embodiments of the present disclosure, the preset rules include a communication protocol, a transfer protocol, a handshake protocol, and/or a bus protocol.
In some embodiments of the present disclosure, the communicating via the preset rule includes: the at least two processing modules comprise a first processing module and a second processing module, the first processing module sends a request signal and a corresponding data address to the second processing module, and the second processing module replies an effective signal and data to the first processing module according to the request signal and the corresponding data address to realize data sharing. It should be noted that the at least two processing modules herein are not limited to include the first processing module and the second processing module, and for example, the at least two processing modules may further include a third processing module, and any two of the three modules may perform communication by using the preset rule.
The present disclosure also provides a data sharing method, including the following steps:
the at least two processing modules are communicated through a preset rule to realize data sharing;
wherein, the two processing modules share one storage module.
As shown in fig. 2, in some embodiments of the present disclosure, the at least two processing modules are two memories, processor 1 and processor 2, and the communication between the two processors refers to the communication between internal storage modules inside the processors. The external memory module allows the processor 1 and the processor 2 to directly access and read data to the required positions of the internal memory module 1 and the internal memory module 2, respectively. The consistency of the data of the external memory module and the internal memory module of the processor is maintained by a certain consistency protocol. In the prior art, when the processor 1 changes the data in its internal storage module, it adopts the "write through" mode to change the data at the corresponding position in the internal storage module 1 and change the corresponding position of the data in the external storage module; the external memory module simultaneously sends a disable signal to the corresponding data in the internal memory module 2. When the processor 2 uses this data, after finding the failure signal, it reads the new value from the external memory module and writes it to the corresponding location in the internal memory module 2. In this embodiment, for data in the internal storage module 1, the processor 2 may send a request signal and a corresponding data address to the processor 1 through a certain preset rule, and after receiving the request signal, the processor 1 replies a valid signal and the data to complete data interaction; therefore, for a structure with a plurality of processors, the same storage space can be maintained, and the direct communication among the plurality of processors can be realized through a certain defined rule, so that the storage communication overhead is reduced, and the data access delay is reduced.
The processor 1, the processor 2, and the like in the present embodiment may be the same processor or different processors. The method is particularly suitable for cooperation between a novel artificial neural network processor and a traditional general-purpose processor. As can be assumed processor 1 is a general purpose processor CPU and processor 2 is an artificial neural network processor.
Specifically, as shown in fig. 3, the artificial neural network processor may be a structure for executing a forward operation of an artificial neural network, and includes an instruction cache unit 1, a controller unit 2, a direct memory access unit 3, an H-tree module 4, a master operation module 5, and a plurality of slave operation modules 6. The instruction cache unit 1, the controller unit 2, the direct memory access unit 3, the H-tree module 4, the master operation module 5, and the slave operation module 6 may be implemented by hardware circuits (e.g., application specific integrated circuits ASIC).
The instruction cache unit 1 reads in the instruction through the direct memory access unit 3 and caches the read instruction; the controller unit 2 reads the instruction from the instruction cache unit 1, and translates the instruction into a microinstruction for controlling the behavior of other modules, such as a direct memory access unit 3, a master operation module 5, a slave operation module 6, and the like; the direct memory access unit 3 can access and store an external address space, and directly read and write data to each cache unit in the processor to complete loading and storing of the data.
As shown in fig. 4, the H-tree module 4 has an H-tree structure, and the H-tree module 4 forms a data path between the master operation module 5 and the plurality of slave operation modules 6. The H tree is a binary tree path formed by a plurality of nodes, each node sends the upstream data to two downstream nodes as it is, and combines the data returned by the two downstream nodes and returns the data to the upstream node. For example, in the initial calculation stage of each layer of artificial neural network, neuron data in the master operation module 5 is sent to each slave operation module 6 through the H-tree module 4; after the calculation process of the slave operation module 6 is completed, the values of the neurons output from the slave operation module are gradually pieced together in the H-tree to form a complete vector composed of neurons as an intermediate result vector. Here, taking the fully-connected layer of the neural network as an example for explanation, assuming that N slave operation modules are shared in the processor, the intermediate result vector is segmented by N, that is, each segment has N elements, and the ith slave operation module calculates the ith element in each segment. N elements are spliced into a vector with the length of N by the H tree module and then returned to the main operation module. Therefore, if the network only has N output neurons, each slave operation unit only needs to output the value of a single neuron, and if the network has m × N output neurons, each slave operation unit needs to output m neuron values.
As shown in fig. 5, which is a block diagram illustrating a structure example of the main operation module 5, the main operation module 5 includes an operation unit 51, a data dependency relationship determination unit 52, and a neuron cache unit 53. The neuron cache unit 53 is used for caching input data and output data used by the main operation module 5 in a calculation process, the operation unit 51 completes various operation functions of the main operation module 5, and the data dependency relationship judgment unit 52 is a port of the operation unit 51 for reading and writing the neuron cache unit 53, and can ensure the reading and writing consistency of data in the neuron cache unit. Meanwhile, the data dependency relationship determination unit 52 is also responsible for sending the read data to the slave computation module 6 through the H-tree module 4, and the output data of the slave computation module 6 is directly sent to the operation unit 51 through the H-tree module 4. The instruction output by the controller unit 2 is sent to the calculation unit 51 and the data dependency relationship judgment unit 52 to control the behavior thereof.
As shown in fig. 6, each slave operation module 6 is a block diagram illustrating a structure example of the slave operation module 6, and includes an operation unit 61, a data dependency relationship determination unit 62, a neuron buffering unit 63, and a weight value buffering unit 64. The arithmetic unit 61 receives the microinstruction sent by the controller unit 2 and performs arithmetic logic operation; the data dependency relationship determination unit 62 is responsible for reading and writing operations on the neuron cache unit 63 in the calculation process. Before the data dependency judgment unit 62 performs the read/write operation, it is first ensured that there is no read/write consistency conflict for the data used in the instructions, for example, all the microinstructions sent to the data dependency unit 62 are stored in the instruction queue inside the data dependency unit 62, and in the queue, if the range of the read data of the read instruction conflicts with the range of the write data of the write instruction located at the front of the queue position, the instruction must wait until the dependent write instruction is executed before being executed; the neuron buffer unit 63 buffers the input neuron vector data and the output neuron value data of the slave operation module 6. The weight buffer unit 64 buffers the weight data required by the slave computing module 6 in the calculation process. For each slave operation module 6, only the weights between all input neurons and part of the output neurons are stored. Taking the fully-connected layer as an example, the output neurons are segmented according to the number N of the slave operation units, and the weight corresponding to the nth output neuron of each segment is stored in the nth slave operation unit.
The slave operation module 6 realizes the parallel arithmetic logic operation in the forward operation process of each layer of artificial neural network. Taking an artificial neural network fully-connected layer (MLP) as an example, the process is y ═ f (wx + b), wherein multiplication of the weight matrix w and the input neuron vector x can be divided into unrelated parallel computing sub-tasks, that is, since out and in are column vectors, each slave operation module 6 only computes the product of a corresponding part of scalar elements in and the column corresponding to the weight matrix w, each obtained output vector is a partial sum to be accumulated of the final result, and the partial sums are added pairwise in the H-tree module 4 to obtain the final result. The calculation process becomes a parallel process of calculating partial sums and a subsequent process of accumulation. Each slave operation module 6 calculates an output neuron value, and all the output neuron values are spliced into a final intermediate result vector in the H-tree module 4. Therefore, each slave operation module 6 only needs to calculate the value of the output neuron corresponding to the module in the intermediate result vector y. The H-tree module 4 sums all neuron values output from the operation module 6 to obtain a final intermediate result vector y. The main operation module 5 performs subsequent calculations based on the intermediate result vector y, such as biasing, pooling (e.g., max pooling (MAXPOOLING) or mean pooling (AVGPOOLING)), activating, sampling, and the like.
In the structure, a common storage module of the CPU and the artificial neural network processor is provided, and the two processors are allowed to directly access and respectively read data into a cache of the CPU and a cache unit of the artificial neural network processor. When the CPU is about to change the data in the buffer, the corresponding position of the data in the buffer is changed by adopting a write-through mode, and the corresponding position of the data in the external storage module is changed, and meanwhile, a failure signal is sent to the corresponding data in the artificial neural network processor. When the artificial neural network processor uses the data, after finding out the failure signal, reading a new value from the external storage module, and writing the new value to a corresponding position of a cache unit in the artificial neural network processor. In addition, for data in the CPU, the artificial neural network processor can complete data interaction by a defined rule that a request signal and a corresponding data address are sent to the CPU, and the CPU replies an effective signal and the data after receiving the request signal. Therefore, for a heterogeneous multiprocessor structure, the data sharing system provided by the embodiment can reduce storage communication overhead and reduce data access delay by maintaining the same storage space.
Each processor is internally provided with a plurality of cores, an internal core internal storage module and an external core storage module, and the data of the external core storage module can be directly accessed by several or all the cores. In some embodiments of the present disclosure, as shown in fig. 7, a data sharing system is proposed, in which at least two processing modules are two processor cores, and data sharing between the two processing modules is implemented by an internal core internal storage module, and the storage module refers to an external core storage module. In this embodiment, a core internal memory module that one core wants to access another core may be accessed via a communication protocol. The core external memory module allows the core 1 and the core 2 to access, and then the core 1 and the core 2 respectively read the required data to the corresponding positions of the core internal memory module 1 and the core internal memory module 2. The consistency of the data of the out-of-core storage module and the in-core storage module is maintained by some consistency protocol. In the prior art, when a core 1 changes data in its core internal memory module, only data at a corresponding position in the core internal memory module 1 is changed in a "write back" manner, and meanwhile, a core external memory module sends an invalid signal to a core internal memory module 2. When the part of data in the core internal storage module 1 is swapped out or the core 2 uses the data, after finding a failure signal, reading a new value from the core external storage module and writing the new value to a corresponding position in the core internal storage module 2. However, in this embodiment, for the data in the core internal memory module 1, the core 2 may also complete data interaction by sending a request signal and a corresponding data address to the core 1 first and replying a valid signal and data after the core 1 receives the request signal through a certain defined rule. The core and the core may be the same, such as a neural network core, or different, such as a neural network core and a CPU core. Therefore, the data can be protected to a certain extent, the access of the same or different structures to check the data storage is allowed, and the consistency of the data is maintained. Meanwhile, the access and memory overhead is reduced, and the access and memory delay is reduced.
Each neural network core includes a plurality of neural network arithmetic units, and thus, as shown in fig. 8, in some embodiments of the present disclosure, a data sharing system is proposed, in which at least two processing modules refer to three arithmetic units, and the three arithmetic units can directly access the memory module in the core, and also can directly transmit related data in a certain direction, which is beneficial to reducing the number of accesses to the memory module by data transmission between the arithmetic units, thereby reducing power consumption and access delay. It is not assumed that, when the neural network operation is completed, the operation unit 1 calculates an output value 1, and the result is represented by out1, where n is (n1, n2, … …, nk), and w is (w1, w2, … …, wk), then out1 is n1 w1+ n2 w2+ … … + nk wk. Similarly, the output result of the arithmetic unit 2 is out2, the corresponding neuron is m ═ m1, m2, … …, mk, and the synapse value is w ═ w1, w2, … …, wk, then out2 ═ m1 ═ w1+ m2 × w2+ … … + mk wk. The output result of the arithmetic unit 3 is out3, and if the corresponding neuron element is q ═ (q1, q2, … …, qk) and the synapse value is w ═ (w1, w2, … …, wk), then out3 ═ q1 ═ w1+ q2 × w2+ … … + qk × wk. Specifically, firstly, the arithmetic unit 1 reads n and w from the core internal memory module, and directly performs arithmetic to obtain out 1; the operation unit 2 reads m from the internal storage module of the core, receives the synapse value w transmitted from the operation unit 1, and performs corresponding operation to obtain out 2; the operation unit 3 reads q from the internal memory module of the core, and receives the synapse value w from the operation unit 1 to perform corresponding operation, so as to obtain out 3. Therefore, the access frequency of the internal storage module of the core is reduced, the delay and the power consumption are reduced, the operation speed is improved, and the operation energy consumption is saved.
In some embodiments of the present disclosure, in the data sharing system in the previous embodiment, one or more layers of storage units may be further added in the core, allowing 1 storage module unit to be shared by several operation units or 1 storage unit to be private by 1 operation unit. As shown in fig. 9, it is assumed here that the shared system includes two storage units, and the storage unit 1 is shared by the arithmetic unit 1 and the arithmetic unit 2, the arithmetic unit 1 and the arithmetic unit 2 can directly access the storage unit 1, and the arithmetic unit 3 cannot directly access; the storage unit 2 is private to the arithmetic unit 3, the arithmetic unit 3 is directly accessible, and the arithmetic unit 1 and the arithmetic unit 2 are not directly accessible. Therefore, if the arithmetic unit 1 wants to access the arithmetic result in the arithmetic unit 3, the arithmetic result can be directly obtained through the arithmetic unit 3, the kernel internal storage module does not need to be accessed through the storage unit 1, then the storage unit 2 is enabled to update the kernel internal storage module and then is transmitted into the storage unit 1, and then the arithmetic unit 1 is allowed to access the long process, so that the data is effectively protected, namely other calculation units without permission (such as the arithmetic unit 1) can not randomly change the storage unit (such as the storage unit 2), the memory access times can be greatly reduced, the waste of a plurality of same data stored on a chip to the storage resource on the chip is avoided, the delay and the power consumption are reduced, the arithmetic speed is further improved, and the arithmetic energy consumption is saved.
The above-mentioned embodiments are intended to illustrate the objects, aspects and advantages of the present disclosure in further detail, and it should be understood that the above-mentioned embodiments are only illustrative of the present disclosure and are not intended to limit the present disclosure, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims (10)

1. A data sharing system comprising a storage module and at least two processing modules, wherein:
the at least two processing modules share the storage module;
the at least two processing modules communicate with each other through a preset rule to realize data sharing,
the at least two processing modules are processors with the same/different structures, processor cores with the same/different structures, and one of arithmetic units with the same/different structures in the processor cores with the same/different structures.
2. The data sharing system of claim 1, wherein the preset rules include a communication protocol, a transfer protocol, a handshake protocol, and/or a bus protocol.
3. The data sharing system of any one of claims 1 to 2, wherein the communicating by preset rules comprises: the at least two processing modules comprise a first processing module and a second processing module, the first processing module sends a request signal and a corresponding data address to the second processing module, and the second processing module replies an effective signal and data to the first processing module according to the request signal and the corresponding data address to realize data sharing.
4. The data sharing system of any one of claims 1 to 3, wherein the at least two processing modules comprise physical processors.
5. The data sharing system of claim 4, wherein the physical processor comprises a neural network processor.
6. The data sharing system of claim 5 wherein the neural network processor comprises means for performing an artificial neural network forward operation.
7. The data sharing system of claim 6, wherein the means for performing artificial neural network forward operations comprises an instruction cache unit and a direct memory access unit, wherein:
the instruction cache unit is used for reading in the instructions through the direct memory access unit and caching the read instructions.
8. The data sharing system of claim 7, wherein the means for performing artificial neural network forward operations further comprises:
the controller unit is used for reading the instruction from the instruction cache unit and decoding the instruction into the microinstruction.
9. The data sharing system of any one of claims 7 to 8, wherein the means for performing artificial neural network forward operations further comprises an H-number module, a master operation module, and a plurality of slave operation modules, wherein:
the H-tree module is used for transmitting input neuron vectors of the layer to all the slave operation modules through the H-tree module at the stage of starting calculation of reverse training of each layer of neural network, and after the calculation process of the slave operation modules is completed, the H-tree module is used for splicing output neuron values of all the slave operation modules step by step into an intermediate result vector;
and the main operation module is used for finishing subsequent calculation by utilizing the intermediate result vector.
10. The data sharing system of claim 9, wherein the direct memory access unit is further configured to write data from an external address space to the corresponding data cache unit of the master computing module and each slave computing module, or read data from the data cache unit to the external address space.
CN202110668344.XA 2017-06-26 2017-06-26 Data sharing system and data sharing method thereof Pending CN113468096A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110668344.XA CN113468096A (en) 2017-06-26 2017-06-26 Data sharing system and data sharing method thereof

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110668344.XA CN113468096A (en) 2017-06-26 2017-06-26 Data sharing system and data sharing method thereof
CN201710497394.XA CN109117415B (en) 2017-06-26 2017-06-26 Data sharing system and data sharing method thereof

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201710497394.XA Division CN109117415B (en) 2017-06-26 2017-06-26 Data sharing system and data sharing method thereof

Publications (1)

Publication Number Publication Date
CN113468096A true CN113468096A (en) 2021-10-01

Family

ID=64822743

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202110668344.XA Pending CN113468096A (en) 2017-06-26 2017-06-26 Data sharing system and data sharing method thereof
CN201710497394.XA Active CN109117415B (en) 2017-06-26 2017-06-26 Data sharing system and data sharing method thereof

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201710497394.XA Active CN109117415B (en) 2017-06-26 2017-06-26 Data sharing system and data sharing method thereof

Country Status (1)

Country Link
CN (2) CN113468096A (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110058884B (en) * 2019-03-15 2021-06-01 佛山市顺德区中山大学研究院 Optimization method, system and storage medium for computational storage instruction set operation
CN111949317B (en) * 2019-05-17 2023-04-07 上海寒武纪信息科技有限公司 Instruction processing method and device and related product
CN110265029A (en) * 2019-06-21 2019-09-20 百度在线网络技术(北京)有限公司 Speech chip and electronic equipment
CN110889500A (en) * 2019-12-09 2020-03-17 Oppo广东移动通信有限公司 Shared data storage module, neural network processor and electronic device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246466A (en) * 2007-11-29 2008-08-20 华为技术有限公司 Management method and device for sharing internal memory in multi-core system
US20100125717A1 (en) * 2008-11-17 2010-05-20 Mois Navon Synchronization Controller For Multiple Multi-Threaded Processors
CN106164874A (en) * 2015-02-16 2016-11-23 华为技术有限公司 The access method of data access person catalogue and equipment in multiple nucleus system

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1992005490A1 (en) * 1990-09-18 1992-04-02 Fujitsu Limited Exclusive control method for shared memory
KR100230454B1 (en) * 1997-05-28 1999-11-15 윤종용 Cache memory testing method in multiprocessor system
CN1531684A (en) * 2001-06-29 2004-09-22 �ʼҷ����ֵ������޹�˾ Data processing apparatus and method for operating data processing apparatus
US20060041715A1 (en) * 2004-05-28 2006-02-23 Chrysos George Z Multiprocessor chip having bidirectional ring interconnect
KR100725100B1 (en) * 2005-12-22 2007-06-04 삼성전자주식회사 Multi-path accessible semiconductor memory device having data transfer mode between ports
US8677075B2 (en) * 2010-05-18 2014-03-18 Lsi Corporation Memory manager for a network communications processor architecture
CN102741828B (en) * 2009-10-30 2015-12-09 英特尔公司 To the two-way communication support of the heterogeneous processor of computer platform
CN101980149B (en) * 2010-10-15 2013-09-18 无锡中星微电子有限公司 Main processor and coprocessor communication system and communication method
CN102184157B (en) * 2011-05-19 2012-10-10 华东师范大学 Information display device based on dual processor cooperation
CN103347037A (en) * 2013-05-29 2013-10-09 成都瑞科电气有限公司 WCF realization-based communication front-end processor system and communicating method
US20150012711A1 (en) * 2013-07-04 2015-01-08 Vakul Garg System and method for atomically updating shared memory in multiprocessor system
US10915468B2 (en) * 2013-12-26 2021-02-09 Intel Corporation Sharing memory and I/O services between nodes
US9971397B2 (en) * 2014-10-08 2018-05-15 Apple Inc. Methods and apparatus for managing power with an inter-processor communication link between independently operable processors
CN104699631B (en) * 2015-03-26 2018-02-02 中国人民解放军国防科学技术大学 It is multi-level in GPDSP to cooperate with and shared storage device and access method
CN106407145A (en) * 2015-08-03 2017-02-15 联想(北京)有限公司 An interface access method and system and a memory card
CN106502806B (en) * 2016-10-31 2020-02-14 华为技术有限公司 Bus protocol command processing device and related method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246466A (en) * 2007-11-29 2008-08-20 华为技术有限公司 Management method and device for sharing internal memory in multi-core system
US20100125717A1 (en) * 2008-11-17 2010-05-20 Mois Navon Synchronization Controller For Multiple Multi-Threaded Processors
CN106164874A (en) * 2015-02-16 2016-11-23 华为技术有限公司 The access method of data access person catalogue and equipment in multiple nucleus system

Also Published As

Publication number Publication date
CN109117415A (en) 2019-01-01
CN109117415B (en) 2024-05-14

Similar Documents

Publication Publication Date Title
US11580367B2 (en) Method and system for processing neural network
CN106940815B (en) Programmable convolutional neural network coprocessor IP core
KR102175044B1 (en) Apparatus and method for running artificial neural network reverse training
US20190057302A1 (en) Memory device including neural network processor and memory system including the memory device
CN107301455B (en) Hybrid cube storage system for convolutional neural network and accelerated computing method
CN109993285B (en) Apparatus and method for performing artificial neural network forward operations
CN107341542B (en) Apparatus and method for performing recurrent neural networks and LSTM operations
US10452538B2 (en) Determining task scores reflective of memory access statistics in NUMA systems
JP7451614B2 (en) On-chip computational network
CN109117415B (en) Data sharing system and data sharing method thereof
US10922258B2 (en) Centralized-distributed mixed organization of shared memory for neural network processing
US20240320045A1 (en) Memory sharing for machine learning processing
Pinkham et al. Near-sensor distributed DNN processing for augmented and virtual reality
US10915445B2 (en) Coherent caching of data for high bandwidth scaling
CN111752879B (en) Acceleration system, method and storage medium based on convolutional neural network
KR20220160637A (en) Distributed Graphics Processor Unit Architecture
Chang et al. A reconfigurable neural network processor with tile-grained multicore pipeline for object detection on FPGA
KR20230063791A (en) AI core, AI core system and load/store method of AI core system
JP7413549B2 (en) Shared scratchpad memory with parallel load stores
US7594080B2 (en) Temporary storage of memory line while waiting for cache eviction
KR20210081663A (en) Interconnect device, operation method of interconnect device, and artificial intelligence(ai) accelerator system
Igual et al. Scheduling algorithms‐by‐blocks on small clusters
CN115205092A (en) Graphical execution of dynamic batch components using access request response
US10620958B1 (en) Crossbar between clients and a cache
CN113434813A (en) Matrix multiplication method based on neural network and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination