CN108234147A

CN108234147A - DMA broadcast data transmission method based on host counting in GPDSP

Info

Publication number: CN108234147A
Application number: CN201711480231.7A
Authority: CN
Inventors: 马胜; 雷元武; 张美迪; 万江华; 陈胜刚; 李勇; 彭元喜; 孙书为
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2018-06-29
Anticipated expiration: 2037-12-29
Also published as: CN108234147B

Abstract

The invention discloses a DMA broadcast data transmission method based on host counting in GPDSP, which comprises the following steps: starting DMA broadcast data transmission by a host DMA, generating a broadcast read request and then sending the broadcast read request to the outside of a core through an on-chip network; the host DMA receives the read return data of each slave DMA, counts to confirm whether the data transmission is finished, when the data transmission is confirmed to be finished, the host DMA sends out a buffer emptying command to all the slave DMAs, and each slave DMA receives the buffer emptying command and executes the buffer emptying operation to finish the broadcast transmission. The invention can start one DMA transmission transaction to realize DMA broadcast data transmission, and has the advantages of simple realization principle, low cost, low DMA transmission power consumption and starting overhead, high data transmission efficiency and DDR reading efficiency, large transmission bandwidth and the like.

Description

DMA broadcast data transmission method based on host count in GPDSP

技术领域technical field

本发明涉及GPDSP(General Purpose Digital Signal Processor，通用数字信号处理器)技术领域，尤其涉及一种GPDSP中基于主机计数的DMA(Director Memory Access，直接存储访问)广播数据传输方法。The invention relates to the technical field of GPDSP (General Purpose Digital Signal Processor, general purpose digital signal processor), in particular to a DMA (Director Memory Access, direct storage access) broadcast data transmission method based on host counting in GPDSP.

背景技术Background technique

GPDSP是一种既保持嵌入式DSP基本特征和高性能低功耗的优势，又能够高效支持通用科学计算的新型体系结构，该结构能够克服一般DSP用于科学计算的上述问题，可同时提供对64位高性能计算机和嵌入式高精度信号处理的高效支持。该结构具有如下特征：①具有双精度浮点和64位顶点数据的直接表示，通用寄存器、数据总线、指令位宽64位以上，地址总线40位以上；②CPU与DSP异构多核紧密耦合，CPU核支持完整操作系统，DSP核的标量单元支持操作系统微核；③考虑CPU核、DSP核以及DSP核内向量阵列结构的统一编程模式；④保持它机交叉仿真调试，同时提供本地CPU宿主调试模式；⑤保留除位数之外的普通DSP的基本特征。GPDSP is a new architecture that not only maintains the basic characteristics of embedded DSP and the advantages of high performance and low power consumption, but also can efficiently support general scientific computing. Efficient support for 64-bit high-performance computers and embedded high-precision signal processing. The structure has the following characteristics: ①It has direct representation of double-precision floating point and 64-bit vertex data, general-purpose registers, data buses, and instruction bits are more than 64 bits wide, and the address bus is more than 40 bits; ②CPU and DSP heterogeneous multi-core are tightly coupled, CPU The core supports a complete operating system, and the scalar unit of the DSP core supports the micro-kernel of the operating system; ③Consider the unified programming mode of the CPU core, DSP core, and the vector array structure in the DSP core; ④Maintain cross-simulation debugging of other machines, and provide local CPU host debugging at the same time mode; ⑤ retain the basic characteristics of ordinary DSP except the number of bits.

GPDSP通常由多个同构的64位处理单元组成处理阵列来获得较高浮点运算能力，然而由于GPDSP需处理的数据量巨大，导致GPDSP核内存储部件和核外存储部件之间需要交换大量的数据。核外存储空间存放的数据首先需要搬移到核内存储空间以方便内核进行计算，内核计算得到的结果需要搬移到核外存储空间进行保存，此时核内存储部件和核外存储部件之间的数据传输速率成为限制GPDSP处理速度的关键因素，与通用处理器相同，GPDSP也面临着“存储墙”的问题。GPDSP usually consists of multiple homogeneous 64-bit processing units to form a processing array to obtain higher floating-point computing capabilities. However, due to the huge amount of data that GPDSP needs to process, a large number of exchanges between GPDSP internal storage components and external storage components are required. The data. The data stored in the out-of-core storage space first needs to be moved to the in-core storage space to facilitate the calculation by the kernel, and the results obtained by the kernel calculation need to be moved to the out-of-core storage space for storage. The data transfer rate becomes the key factor limiting the processing speed of GPDSP. Same as general-purpose processors, GPDSP also faces the problem of "storage wall".

DMA可以在处理核进行数据计算的同时，后台高速进行数据搬移，搬移过程不需要处理核的参与，DMA能较好缓解“存储墙”问题。由于DMA技术将内核的计算操作和存储部件的数据搬移操作重叠执行，在一定程度上降低了核内存储部件和核外存储部件之间的数据传输速度对GPDSP处理性能的影响。然而随着GPDSP中集成的处理核数目的不断增加，现有DMA数据传输方式已不能满足多核并行处理对数据量的需求，高效多核DMA涉及必须考虑应用程序的访存需求和多核GPDSP的硬件结构特性。DMA can move data at high speed in the background while the processing core is performing data calculation. The moving process does not require the participation of the processing core. DMA can better alleviate the "storage wall" problem. Since the DMA technology overlaps the calculation operation of the kernel and the data movement operation of the storage unit, the impact of the data transmission speed between the internal storage unit and the external storage unit on the GPDSP processing performance is reduced to a certain extent. However, as the number of processing cores integrated in GPDSP continues to increase, the existing DMA data transmission methods can no longer meet the data volume requirements of multi-core parallel processing. Efficient multi-core DMA involves the memory access requirements of applications and the hardware structure of multi-core GPDSP. characteristic.

如矩阵乘、快速傅里叶变换、HPL(High Performance Linpack)等常用算法和应用程序在多核GPDSP上并行实现时，所有内核在一段时间内可能会访问同一块存储空间，例如进行GEMM矩阵乘运算(C+＝AB)，A矩阵为共享矩阵，所有DSP核都需要矩阵A；如果使用传统的DMA传输方式，每个DSP核都发起点到点传输读取DDR同一个位置上的数据块，此时由于每个核到DDR的距离不同，导致可能出现不同的核读取的数据处在不同的DDR页上，这会导致DDR页命中丢失、DDR换页次数增多，同时增加了访存延时，大大降低了DDR的读效率；如果存在多个或所有核启动DMA传输事务，不仅会造成大量的功耗，还会造成网络的压力，而且对访问核外存储空间DDR时会出现竞争或者命中丢失等情况。When common algorithms and applications such as matrix multiplication, fast Fourier transform, and HPL (High Performance Linpack) are implemented in parallel on multi-core GPDSP, all cores may access the same storage space for a period of time, such as GEMM matrix multiplication operations (C+=AB), the A matrix is a shared matrix, and all DSP cores need matrix A; if the traditional DMA transmission method is used, each DSP core initiates a point-to-point transmission to read the data block on the same position of the DDR. Because the distance between each core and DDR is different, the data read by different cores may be on different DDR pages, which will lead to the loss of DDR page hits, increase the number of DDR page changes, and increase the memory access delay. , which greatly reduces the read efficiency of DDR; if there are multiple or all cores to start DMA transfer transactions, it will not only cause a lot of power consumption, but also cause pressure on the network, and there will be competition or hits when accessing the DDR storage space outside the core loss etc.

发明内容Contents of the invention

本发明要解决的技术问题就在于：针对现有技术存在的技术问题，本发明提供一种实现原理简单、成本低、DMA传输功耗及启动开销小、数据传输效率及DDR读效率高且传输带宽大的GPDSP中基于主机计数的DMA广播数据传输方法。The technical problem to be solved by the present invention is that: aiming at the technical problems existing in the prior art, the present invention provides a simple implementation principle, low cost, low DMA transmission power consumption and start-up overhead, high data transmission efficiency and DDR read efficiency, and transmission DMA broadcast data transmission method based on host count in GPDSP with large bandwidth.

为解决上述技术问题，本发明提出的技术方案为：In order to solve the problems of the technologies described above, the technical solution proposed by the present invention is:

一种GPDSP中基于主机计数的DMA广播数据传输方法，该方法包括：由主机DMA启动DMA广播数据传输，生成广播读请求后经片上网络发送至核外存储空间；核外存储空间根据所述广播读请求将读返回数据发送至片上网络，GPDSP中各个核从片上网络中接收读返回数据并写入核内存储空间，主机DMA接收读返回数据并进行计数以确认数据传输是否完成。A kind of DMA broadcast data transmission method based on host computer counting in a kind of GPDSP, the method comprises: start DMA broadcast data transmission by host DMA, generate broadcast read request and send to the storage space outside the core through the network on chip; The read request sends the read return data to the on-chip network. Each core in GPDSP receives the read return data from the on-chip network and writes it into the core storage space. The host DMA receives the read return data and counts to confirm whether the data transmission is completed.

作为本发明的进一步改进，所述确认传输是否完成具体包括：预先分别设置包括源帧数SrcArrCnt、源帧剩余单元数SrcEleCnt、目的帧数DstArrCnt以及目的帧剩余单元数DstEleCnt的广播传输参数，所述源帧数SrcArrCnt用于配置核外搬移数据的帧数，所述源帧剩余单元数SrcEleCnt用于统计当前源帧中还未读取的数据单元数，所述目的帧数DstArrCnt用于配置写入核内存储空间的数据帧数，所述目的帧剩余单元数DstEleCnt用于统计当前目的帧中还未写完的数据单元数，根据所述广播传输参数的值确认数据传输是否完成。As a further improvement of the present invention, the confirmation of whether the transmission is completed specifically includes: respectively setting broadcast transmission parameters including the source frame number SrcArrCnt, the source frame remaining unit number SrcEleCnt, the destination frame number DstArrCnt and the destination frame remaining unit number DstEleCnt in advance, the The number of source frames SrcArrCnt is used to configure the number of frames of data moved outside the core, the number of remaining units of the source frame SrcEleCnt is used to count the number of data units that have not been read in the current source frame, and the number of destination frames DstArrCnt is used to configure write The number of data frames in the kernel storage space, the number of remaining units of the target frame DstEleCnt is used to count the number of unfinished data units in the current target frame, and confirm whether the data transmission is completed according to the value of the broadcast transmission parameter.

作为本发明的进一步改进，所述源帧数SrcArrCnt、源帧剩余单元数SrcEleCnt、目的帧数DstArrCnt以及目的帧剩余单元数DstEleCnt满足下式：As a further improvement of the present invention, the source frame number SrcArrCnt, the source frame remaining unit number SrcEleCnt, the destination frame number DstArrCnt and the destination frame remaining unit number DstEleCnt satisfy the following formula:

(SrcArrCnt+1)*SrcEleCnt＝＝(DstArrCnt+1)*DstEleCnt；(SrcArrCnt+1)*SrcEleCnt==(DstArrCnt+1)*DstEleCnt;

其中SrcArrCnt+1为所需核外搬移数据的帧数，SrcEleCnt为当前源帧中还未读取的数据单元数，DstArrCnt+1为所需写入核内存储空间的数据帧数，DstEleCnt为当前目的帧中还未写完的数据单元数。Among them, SrcArrCnt+1 is the number of frames of data to be moved outside the core, SrcEleCnt is the number of data units that have not been read in the current source frame, DstArrCnt+1 is the number of data frames that need to be written into the internal storage space, and DstEleCnt is the current The number of unwritten data units in the destination frame.

作为本发明的进一步改进，该方法还包括用于配置DMA传输模式的传输模式参数TMODE，当所述传输模式参数TMODE有效时，启动执行DMA广播数据传输。As a further improvement of the present invention, the method further includes a transmission mode parameter TMODE for configuring the DMA transmission mode, and when the transmission mode parameter TMODE is valid, DMA broadcast data transmission is started.

作为本发明的进一步改进，所述广播读请求包括用于标识数据返回核信息的读返回选择向量RetVec，根据所述读返回选择向量RetVec确定读返回数据所需返回的目的核。As a further improvement of the present invention, the broadcast read request includes a read return selection vector RetVec for identifying data return core information, and the target core to be returned for read return data is determined according to the read return selection vector RetVec.

作为本发明的进一步改进，所述读返回选择向量RetVec具体有多位，每一位对应标识一个参与传输的参与核是否需要返回读返回数据的状态。As a further improvement of the present invention, the read return selection vector RetVec specifically has multiple bits, and each bit corresponds to indicating whether a participating core participating in the transmission needs to return the status of the read return data.

作为本发明的进一步改进：所述广播读请求还包括读地址、读掩码、读返回地址中一种或多种信息。As a further improvement of the present invention: the broadcast read request further includes one or more information of a read address, a read mask, and a read return address.

作为本发明的进一步改进，当确认完成数据传输时，还包括清空缓冲步骤，具体步骤为：主机DMA向所有从DMA发出清空缓冲命令，各从机DMA接收到所述空缓冲命令并执行清空缓冲操作，结束广播传输。As a further improvement of the present invention, when it is confirmed that the data transmission is completed, the step of clearing the buffer is also included, and the specific steps are: the master DMA sends an empty buffer command to all slave DMAs, and each slave DMA receives the empty buffer command and executes a clear buffer Action to end the broadcast transmission.

作为本发明的进一步改进，该方法的具体步骤为：As a further improvement of the present invention, the specific steps of the method are:

S1.预先分别设置包括源帧数SrcArrCnt、源帧剩余单元数SrcEleCnt、目的帧数DstArrCnt以及目的帧剩余单元数DstEleCnt的广播传输参数；S1. Preset the broadcast transmission parameters including source frame number SrcArrCnt, source frame remaining unit number SrcEleCnt, destination frame number DstArrCnt and destination frame remaining unit number DstEleCnt;

S2.配置所述广播传输参数后主机DMA启动DMA广播数据传输，并根据所述广播传输参数生成广播读请求后经片上网络发送至核外存储空间；S2. After the broadcast transmission parameters are configured, the host DMA starts DMA broadcast data transmission, and generates a broadcast read request according to the broadcast transmission parameters, and then sends it to the off-core storage space via the on-chip network;

S3.核外存储空间根据所述广播读请求将读返回数据发送至片上网络，GPDSP中各个核从片上网络中接收读返回数据并写入核内存储空间，主机DMA接收读返回数据并更新所述广播传输参数以进行计数；S3. The storage space outside the core sends the read return data to the on-chip network according to the broadcast read request, and each core in the GPDSP receives the read return data from the on-chip network and writes it into the core storage space, and the host DMA receives the read return data and updates all the above broadcast transmission parameters for counting;

S4.当主机DMA接收到最后一块数据时计数完成，主机DMA对所有DSP核发出清空缓冲的命令，从机DMA收到清空命令后执行清空操作，并在清空完成后根据中断使能位发出中断请求；当从机DMA缓存清空后，置位内部预先设置的广播结束寄存器BOR的值，广播传输事务结束。S4. When the host DMA receives the last piece of data, the counting is completed, the host DMA sends a command to clear the buffer to all DSP cores, and the slave DMA executes the clear operation after receiving the clear command, and sends an interrupt according to the interrupt enable bit after clearing is completed Request; when the slave DMA buffer is cleared, set the value of the internal preset broadcast end register BOR, and the broadcast transmission transaction ends.

作为本发明的进一步改进，所述步骤S3中主机DMA接收到读返回数据时，还包括数据有效判断步骤，具体步骤为：判断数据是否有效，如果为有效，将数据转发至核内存储空间，启动主机DMA进行计数，如果无效，直接启动主机DMA进行计数。As a further improvement of the present invention, when the host DMA receives the read return data in the step S3, it also includes a step of judging whether the data is valid, and the specific steps are: judging whether the data is valid, if it is valid, forwarding the data to the storage space in the kernel, Start the host DMA to count, if it is invalid, directly start the host DMA to count.

与现有技术相比，本发明的优点在于：Compared with the prior art, the present invention has the advantages of:

1)本发明GPDSP中基于主机计数的DMA广播数据传输方法，通过一次DMA传输事务把核外存储空间的同一块数据块搬移至芯片上的全部DSP内核的核内存储空间中，由主机DMA产生读请求并且对传输数据块进行计数来确认传输完成，使得只需要启动一个DSP核的DMA广播传输事务，可以将GPDSP的核外存储空间的同一块数据块以广播的形式传输给芯片上的所有DSP核，满足所有DSP核对数据的需求的传输方式，避免所有核同时启动DMA传输，能够有效降低DMA传输功耗及启动开销，减轻片上网络的拥堵。1) In the DMA broadcast data transmission method based on host counting in the GPDSP of the present invention, the same data block in the external storage space is moved to the internal storage space of all DSP cores on the chip through a DMA transfer transaction, which is generated by the host DMA Read the request and count the transmission data blocks to confirm the completion of the transmission, so that only one DSP core DMA broadcast transmission transaction needs to be started, and the same data block in the GPDSP’s external storage space can be transmitted to all on the chip in the form of broadcast The DSP core is a transmission method that meets the needs of all DSPs to check data, avoids all cores from starting DMA transmission at the same time, can effectively reduce DMA transmission power consumption and startup overhead, and reduce network congestion on a chip.

2)本发明GPDSP中基于主机计数的DMA广播数据传输方法，可以实现类似于GEMM矩阵乘运算中(C+＝AB)A矩阵的广播传输，且由于只需启动一次DMA传输事务就能满足所有DSP核对核外数据的需求，能大大减少核外存储空间DDR的换页次数，减少了DDR的访问次数，从而大大提高了DDR的读效率以及DDR的行命中率，同时有效提高了传输的带宽。2) The DMA broadcast data transmission method based on host counting in the GPDSP of the present invention can realize the broadcast transmission similar to (C+=AB) A matrix in the GEMM matrix multiplication operation, and can satisfy all DSPs because only one DMA transmission transaction needs to be started Checking the demand for out-of-core data can greatly reduce the number of page changes of DDR in the out-of-core storage space, and reduce the number of DDR accesses, thereby greatly improving the read efficiency of DDR and the line hit rate of DDR, and effectively improving the transmission bandwidth.

3)本发明GPDSP中基于主机计数的DMA广播数据传输方法，进一步通过设置源帧数SrcArrCnt、源帧剩余单元数SrcEleCnt、目的帧数DstArrCnt以及目的帧剩余单元数DstEleCnt的广播传输参数，使得通过配置广播传输参数即可方便的实现DMA广播数据传输控制，从而简单、高效的实现启动一个DSP核的DMA广播传输事务，可以将GPDSP的核外存储空间的同一块数据块以广播的形式传输给芯片上的所有DSP核，且基于广播传输参数可以实现灵活配置。3) The DMA broadcast data transmission method based on host counting in the GPDSP of the present invention, further by setting the broadcast transmission parameters of the source frame number SrcArrCnt, the source frame remaining unit number SrcEleCnt, the destination frame number DstArrCnt and the destination frame remaining unit number DstEleCnt, so that by configuring Broadcast transmission parameters can easily implement DMA broadcast data transmission control, so as to simply and efficiently start a DMA broadcast transmission transaction of a DSP core, and transmit the same block of data in the GPDSP core storage space to the chip in the form of broadcast All DSP cores on the system can be flexibly configured based on broadcast transmission parameters.

附图说明Description of drawings

图1是本实施例采用的GPDSP体系结构原理示意图。FIG. 1 is a schematic diagram of the principle of the GPDSP architecture adopted in this embodiment.

图2是本实施例中DMA在GPDSP中的位置及工作原理示意图。FIG. 2 is a schematic diagram of the position and working principle of the DMA in the GPDSP in this embodiment.

图3是本发明具体实施例中实现DMA广播数据传输的原理示意图。Fig. 3 is a schematic diagram of the principle of implementing DMA broadcast data transmission in a specific embodiment of the present invention.

图4是本发明具体实施例中DMA广播数据传输传输参数字的原理示意图。Fig. 4 is a schematic diagram of the principle of DMA broadcast data transmission transmission parameter words in a specific embodiment of the present invention.

图5是本实施例实现DMA广播数据传输的实现流程示意图。FIG. 5 is a schematic diagram of an implementation process of implementing DMA broadcast data transmission in this embodiment.

具体实施方式Detailed ways

以下结合说明书附图和具体优选的实施例对本发明作进一步描述，但并不因此而限制本发明的保护范围。The present invention will be further described below in conjunction with the accompanying drawings and specific preferred embodiments, but the protection scope of the present invention is not limited thereby.

如图1～5所示，本实施例GPDSP中基于主机计数的DMA广播数据传输方法包括：由主机DMA启动DMA广播数据传输，生成广播读请求后经片上网络发送至核外存储空间；核外存储空间根据广播读请求将读返回数据发送至片上网络，GPDSP中各个核从片上网络中接收读返回数据并写入核内存储空间，主机DMA接收读返回数据并进行计数以确认数据传输是否完成，即由发起DMA传输事务的DSP核作为主机在广播传输中负责产生读请求，同时为所有其他核的传输数据进行计数以确认数据传输完成。As shown in Figures 1 to 5, the DMA broadcast data transmission method based on host counting in the GPDSP of this embodiment includes: starting DMA broadcast data transmission by the host DMA, generating a broadcast read request and sending it to the storage space outside the core through the network on chip; The storage space sends the read return data to the on-chip network according to the broadcast read request. Each core in the GPDSP receives the read return data from the on-chip network and writes it into the core storage space. The host DMA receives the read return data and counts to confirm whether the data transmission is completed. , that is, the DSP core that initiates the DMA transfer transaction is responsible for generating a read request in the broadcast transfer as the host, and at the same time counts the transfer data of all other cores to confirm the completion of the data transfer.

本实施例上述方法，通过一次DMA传输事务把核外存储空间的同一块数据块搬移至芯片上的全部DSP内核的核内存储空间中，由主机DMA产生读请求并且对传输数据块进行计数来确认传输完成，使得只需要启动一个DSP核的DMA广播传输事务，可以将GPDSP的核外存储空间的同一块数据块以广播的形式传输给芯片上的所有DSP核，满足所有DSP核对数据的需求的传输方式，避免所有核同时启动DMA传输，能够有效降低DMA传输功耗及启动开销，减轻片上网络的拥堵。In the method described above in this embodiment, the same data block in the external storage space is moved to the internal storage space of all DSP cores on the chip through a DMA transfer transaction, and the host DMA generates a read request and counts the transmission data blocks. Confirm the completion of the transmission, so that only one DSP core needs to start the DMA broadcast transmission transaction, and the same block of data in the GPDSP's external storage space can be transmitted to all DSP cores on the chip in the form of broadcast to meet the data verification requirements of all DSPs. The transmission method prevents all cores from starting DMA transmission at the same time, which can effectively reduce the power consumption and startup overhead of DMA transmission, and reduce the congestion of the network on chip.

本实施例上述方法可以实现类似于GEMM矩阵乘运算中(C+＝AB)A矩阵的广播传输，且由于只需启动一次DMA传输事务就能满足所有DSP核对核外数据的需求，能大大减少核外存储空间DDR的换页次数，减少了DDR的访问次数，从而大大提高了DDR的读效率以及DDR的行命中率，同时有效提高了传输的带宽。The above method of the present embodiment can realize the broadcast transmission similar to the (C+=AB) A matrix in the GEMM matrix multiplication operation, and because only one DMA transfer transaction needs to be started to meet the needs of all DSPs to check the data outside the core, it can greatly reduce the number of cores. The number of page changes of DDR in the external storage space reduces the number of DDR accesses, thereby greatly improving the read efficiency of DDR and the line hit rate of DDR, and effectively improving the transmission bandwidth.

本实施例采用的GPDSP体系结构如图1所示，该多核GPDSP由内核节点、IO节点、片上网络、DDR控制器、核外存储部件DDR组成，其中每个内核节点包含两个DSP核，DDR控制器控制DDR数据的搬移，片上网络实现各DSP之间以及DSP和核外存储空间之间的数据通信。The GPDSP architecture adopted in this embodiment is shown in Figure 1. This multi-core GPDSP is made up of kernel nodes, IO nodes, on-chip networks, DDR controllers, and external storage components DDR, wherein each kernel node includes two DSP cores, and the DDR The controller controls the movement of DDR data, and the network on chip realizes the data communication between each DSP and between DSP and the storage space outside the core.

如图2所示，本实施例中DMA在DSP核中通过配置总线PBUS与SPU相连，通过数据总线与核内存储空间(向量存储部件AM和标量存储部件SM)相连，通过核外总线接口与核外存储空间DDR相连；SPU标量处理单元负责给DMA产生传输参数字，使得DMA可以主动从核内存储空间搬移至核外存储空间或者从核外存储空间搬移至核内存储空间，DMA也可以被动接收来自片上网络的读写请求。As shown in Figure 2, in the present embodiment, DMA is connected with SPU by configuring bus PBUS in the DSP core, is connected with storage space (vector storage part AM and scalar storage part SM) in the core by data bus, and is connected with outside bus interface by core The external storage space DDR is connected; the SPU scalar processing unit is responsible for generating transmission parameter words for the DMA, so that the DMA can actively move from the internal storage space to the external storage space or from the external storage space to the internal storage space, and the DMA can also Passively accepts read and write requests from the on-chip network.

本实施例中，确认传输是否完成具体包括：预先分别设置包括源帧数SrcArrCnt、源帧剩余单元数SrcEleCnt、目的帧数DstArrCnt以及目的帧剩余单元数DstEleCnt的广播传输参数，源帧数SrcArrCnt用于配置核外搬移数据的帧数，表示要从核外存储空间搬移的数据帧数为SrcArrCnt+1，源帧剩余单元数SrcEleCnt用于统计当前源帧中还未读取的数据单元数，数据单元为GPDSP中DMA传输的最小粒度；目的帧数DstArrCnt用于配置写入核内存储空间的数据帧数，表示写入核内存储空间的数据帧数为DstArrCnt+1，目的帧剩余单元数DstEleCnt用于统计当前目的帧中还未写完的数据单元数，根据广播传输参数的值确认数据传输是否完成。In this embodiment, confirming whether the transmission is completed specifically includes: respectively setting broadcast transmission parameters including the source frame number SrcArrCnt, the source frame remaining unit number SrcEleCnt, the destination frame number DstArrCnt, and the destination frame remaining unit number DstEleCnt in advance, and the source frame number SrcArrCnt is used for Configure the number of data frames moved out of the core, indicating that the number of data frames to be moved from the out-of-core storage space is SrcArrCnt+1, and the number of remaining units in the source frame SrcEleCnt is used to count the number of data units that have not been read in the current source frame. Data units It is the minimum granularity of DMA transmission in GPDSP; the destination frame number DstArrCnt is used to configure the number of data frames written into the in-kernel storage space, indicating that the number of data frames written into the in-kernel storage space is DstArrCnt+1, and the number of remaining units of the destination frame is used for DstEleCnt To count the number of unwritten data units in the current destination frame, and confirm whether the data transmission is completed according to the value of the broadcast transmission parameter.

本实施例通过设置源帧数SrcArrCnt、源帧剩余单元数SrcEleCnt、目的帧数DstArrCnt以及目的帧剩余单元数DstEleCnt的广播传输参数，使得通过配置广播传输参数即可方便的实现DMA广播数据传输控制，从而简单、高效的实现启动一个DSP核的DMA广播传输事务，可以将GPDSP的核外存储空间的同一块数据块以广播的形式传输给芯片上的所有DSP核。In this embodiment, by setting the broadcast transmission parameters of the source frame number SrcArrCnt, the source frame remaining unit number SrcEleCnt, the destination frame number DstArrCnt, and the destination frame remaining unit number DstEleCnt, the DMA broadcast data transmission control can be conveniently realized by configuring the broadcast transmission parameters. Therefore, it is simple and efficient to realize the DMA broadcast transmission transaction of starting a DSP core, and the same data block in the external storage space of the GPDSP can be transmitted to all DSP cores on the chip in the form of broadcast.

本实施例中，源帧数SrcArrCnt、源帧剩余单元数SrcEleCnt、目的帧数DstArrCnt以及目的帧剩余单元数DstEleCnt满足下式：In this embodiment, the source frame number SrcArrCnt, the source frame remaining unit number SrcEleCnt, the destination frame number DstArrCnt and the destination frame remaining unit number DstEleCnt satisfy the following formula:

本实施例DMA进行广播数据传输时，具体主机DMA将会对核外存储空间DDR中的一块数据块发起广播读请求，返回的数据发送给所有DSP核；由源帧数SrcArrCnt表示核外搬移数据的帧数，其值为SrcArrCnt+1，SrcEleCnt表示当前源帧剩余单元数，数据单元为DMA传输的最小单位，广播传输数据大小为(SrcArrCnt+1)*SrcEleCnt，当SrcEleCnt为0时，当前帧读请求计算完毕，SrcArrCnt的值减1；当SrcArrCnt为0并且SrcEleCnt也为0时，读请求计算完毕；DstArrCnt表示目的帧数，DstEleCnt表示当前目的帧剩余单元数，其中(SrcArrCnt+1)*SrcEleCnt＝＝(DstArrCnt+1)*DstEleCnt。When the DMA in this embodiment performs broadcast data transmission, the specific host DMA will initiate a broadcast read request to a block of data in the DDR storage space outside the core, and the returned data will be sent to all DSP cores; the source frame number SrcArrCnt represents the data moved outside the core The number of frames, its value is SrcArrCnt+1, SrcEleCnt indicates the number of remaining units of the current source frame, the data unit is the smallest unit of DMA transmission, the broadcast transmission data size is (SrcArrCnt+1)*SrcEleCnt, when SrcEleCnt is 0, the current frame After the calculation of the read request is completed, the value of SrcArrCnt is decremented by 1; when SrcArrCnt is 0 and SrcEleCnt is also 0, the calculation of the read request is completed; DstArrCnt indicates the number of destination frames, and DstEleCnt indicates the number of remaining units of the current destination frame, where (SrcArrCnt+1)*SrcEleCnt ==(DstArrCnt+1)*DstEleCnt.

本实施例中，还包括用于配置DMA传输模式的传输模式参数TMODE，当传输模式参数TMODE有效时，启动执行DMA广播数据传输，具体配置当TMODE＝“2‘b11”时，传输模式为广播数据传输，即当TMODE＝“2‘b11”时，由主机DMA启动广播数据传输。In this embodiment, the transmission mode parameter TMODE for configuring the DMA transmission mode is also included. When the transmission mode parameter TMODE is valid, the DMA broadcast data transmission is started. When the specific configuration is TMODE="2'b11", the transmission mode is broadcast Data transmission, that is, when TMODE = "2'b11", the broadcast data transmission is started by the host DMA.

本实施例中，广播读请求包括用于标识数据返回核信息的读返回选择向量RetVec，根据读返回选择向量RetVec确定读返回数据所需返回的目的核，即由发送的广播读请求中携带标志数据返回信息，各DSP核根据读返回选择向量RetVec的值确定是否需要返回读返回数据。读返回选择向量RetVecc共有n位，每一位对应标识一个DSP核是否需要返回读返回数据的状态，即每一位对应一个DSP核，表示是否需要返回数据给对应核。广播读请求还包括读地址、读掩码、读返回地址等，即由读请求携带读地址、读掩码、读返回地址、读返回选择向量RetVec等信息。In this embodiment, the broadcast read request includes the read return selection vector RetVec used to identify the data return core information, according to the read return selection vector RetVec to determine the destination core that needs to be returned for the read return data, that is, the flag is carried in the sent broadcast read request For data return information, each DSP core determines whether to return read return data according to the value of the read return selection vector RetVec. The read return selection vector RetVecc has n bits in total, and each bit corresponds to whether a DSP core needs to return the state of reading and returning data, that is, each bit corresponds to a DSP core, indicating whether it needs to return data to the corresponding core. The broadcast read request also includes a read address, a read mask, and a read return address, that is, the read request carries information such as the read address, the read mask, the read return address, and the read return selection vector RetVec.

在具体实施例中，当TMODE＝“2‘b11”(即DMA进行广播数据传输)，配置好广播传输参数后，DMA发起广播数据传输，DMA根据广播参数SrcArrCnt、SrcEleCnt、DstArrCnt、DstEleCnt生成广播读请求传输至片上网络，读请求中包括读地址、读掩码、读返回地址和参数读返回选择向量RetVec，读返回选择向量RetVec共有n位，在广播传输中，信号RetVec的n位全为1，表示读返回数据返回全部的DSP核；核外存储空间根据读请求返回数据至片上网络，所有从机DMA经片上网络被动性地接收数据并写入核内。In a specific embodiment, when TMODE="2'b11" (that is, DMA performs broadcast data transmission), after broadcast transmission parameters are configured, DMA initiates broadcast data transmission, and DMA generates broadcast read data according to broadcast parameters SrcArrCnt, SrcEleCnt, DstArrCnt, and DstEleCnt. The request is transmitted to the on-chip network. The read request includes the read address, read mask, read return address and parameter read return selection vector RetVec. The read return selection vector RetVec has n bits. In broadcast transmission, the n bits of the signal RetVec are all 1 , indicating that the read return data is returned to all DSP cores; the external storage space returns data to the on-chip network according to the read request, and all slave DMA passively receives data through the on-chip network and writes it into the core.

本实施例在点到点DMA传输模式的基础上，通过配置5个参数：传输模式TMODE、源帧数SrcArrCnt、源帧剩余单元数SrcEleCnt，目的帧数DstArrCnt、目的帧剩余单元数DstEleCnt来控制传输过程，由主机DMA依据配置好的传输参数生成广播数据传输请求，同时进行搬移数据的计数统计直至广播传输结束，能够启动一个DSP核的DMA广播传输事务，即可以将GPDSP的核外存储空间的同一块数据块以广播的形式传输给芯片上的所有DSP核。In this embodiment, based on the point-to-point DMA transmission mode, five parameters are configured: transmission mode TMODE, source frame number SrcArrCnt, source frame remaining unit number SrcEleCnt, destination frame number DstArrCnt, destination frame remaining unit number DstEleCnt to control transmission In the process, the host DMA generates a broadcast data transmission request according to the configured transmission parameters, and at the same time counts and counts the moved data until the broadcast transmission ends, and can start a DSP core DMA broadcast transmission transaction, that is, the GPDSP’s external storage space The same block of data is transmitted to all DSP cores on the chip in the form of broadcast.

本实施例中，当确认完成数据传输时，还包括清空缓冲步骤，具体步骤为：当主机DMA计数完成后，主机DMA向所有从DMA发出清空缓冲命令，各从机DMA接收到空缓冲命令并执行清空缓冲操作，清空完毕后此次传输事务结束。In this embodiment, when it is confirmed that the data transmission is completed, the step of clearing the buffer is also included. The specific steps are: after the master DMA count is completed, the master DMA sends an empty buffer command to all slave DMAs, and each slave DMA receives the empty buffer command and Execute the operation of clearing the buffer. After clearing, the transfer transaction ends.

本发明具体实施例中实现DMA广播数据传输数据如图3所示，其中芯片带有12个DSP核，每个核中带有独立的DMA和LM，LM为核内存储空间(包括向量存储部件AM和标量存储部件SM)；Array表示搬移的帧，C0～C11表示每一行数据块，大小为512bits即8words。本次广播传输数据块大小4x96words；DMA发起广播数据传输，共传输4帧数据，每帧数据大小为96words。DDR按照图中箭头所示方向搬移数据，DMA每次先把DDR的一页数据搬完，再进行翻页搬移下一页数据；DDR根据读请求将数据发送至片上网络，从机DSP经网络被动性地接收数据。由上述可知，采用本实施例广播数据传输方式，能够极大地减少了DDR的换页次数、降低传输延时，有效减少了DDR的访问次数，同时提升了传输带宽以及DDR的读效率。Realize DMA broadcast data transmission data in the specific embodiment of the present invention as shown in Figure 3, wherein chip has 12 DSP cores, has independent DMA and LM in each core, and LM is the storage space (comprising vector memory part) in the core AM and scalar storage part SM); Array represents the moved frame, and C0-C11 represent each row of data blocks, the size of which is 512bits or 8words. The data block size of this broadcast transmission is 4x96words; DMA initiates broadcast data transmission, and a total of 4 frames of data are transmitted, and the data size of each frame is 96words. DDR moves data in the direction shown by the arrow in the figure. DMA first moves one page of DDR data each time, and then flips the page to move the next page of data; DDR sends data to the on-chip network according to the read request, and the slave DSP passes through the network. Receive data passively. From the above, it can be known that adopting the broadcast data transmission method of this embodiment can greatly reduce the number of DDR page changes, reduce the transmission delay, effectively reduce the number of DDR accesses, and improve the transmission bandwidth and DDR read efficiency at the same time.

本发明具体实施例中DMA广播数据传输传输参数字如图4所示，具体包括传输模式TMODE、源帧数SrcArrCnt、源帧剩余单元数SrcEleCnt，目的帧数DstArrCnt、目的帧剩余单元数DstEleCnt，其中TMODE位宽为2，当TMODE值为2‘b11时，DMA启动广播数据传输，从核外存储空间DDR中搬移同一数据块至12个DSP核中；SrcArrCnt为源帧单元数，位宽为32，最大帧数为2的32次方；SrcEleCnt为当前源帧剩余单元数，位宽为32，最大值为2的32次方减1；DstArrCnt为目的帧单元数，位宽为32，最大值为2的32次方；DstEleCnt为当前目的帧剩余单元数，最大值为2的32次方减1。In the specific embodiment of the present invention, the DMA broadcast data transmission transmission parameter word is as shown in Figure 4, which specifically includes the transmission mode TMODE, the source frame number SrcArrCnt, the source frame remaining unit number SrcEleCnt, the destination frame number DstArrCnt, and the destination frame remaining unit number DstEleCnt, wherein The TMODE bit width is 2. When the TMODE value is 2'b11, the DMA starts broadcast data transmission, and moves the same data block from the external storage space DDR to 12 DSP cores; SrcArrCnt is the number of source frame units, and the bit width is 32 , the maximum number of frames is 2 to the 32nd power; SrcEleCnt is the number of remaining units of the current source frame, the bit width is 32, and the maximum value is 2 to the 32th power minus 1; DstArrCnt is the number of destination frame units, the bit width is 32, the maximum value is 2 to the 32nd power; DstEleCnt is the number of remaining units in the current destination frame, and the maximum value is 2 to the 32th power minus 1.

如图5所示，本实施例实现GPDSP中DMA广播数据传输的具体步骤为：As shown in Figure 5, the concrete steps that the present embodiment realizes DMA broadcast data transmission among the GPDSP are:

S2.配置广播传输参数后主机DMA启动DMA广播数据传输，并根据广播传输参数SrcArrCnt、SrcEleCnt、DstArrCnt、DstEleCnt生成广播读请求，读请求中包括读地址、读掩码、读返回地址和参数读返回选择向量RetVec，生成的广播读请求经片上网络发送至核外存储空间；S2. After configuring the broadcast transmission parameters, the host DMA starts the DMA broadcast data transmission, and generates a broadcast read request according to the broadcast transmission parameters SrcArrCnt, SrcEleCnt, DstArrCnt, and DstEleCnt. The read request includes the read address, read mask, read return address and parameter read return Select the vector RetVec, and the generated broadcast read request is sent to the off-core storage space through the on-chip network;

S3.核外存储空间根据广播读请求将读返回数据发送至片上网络，GPDSP中各个核从片上网络中接收读返回数据并写入核内存储空间，主机DMA接收读返回数据并更新广播传输参数以进行本核传输数据计数；S3. The extra-core storage space sends the read return data to the on-chip network according to the broadcast read request. Each core in the GPDSP receives the read return data from the on-chip network and writes it into the core storage space. The host DMA receives the read return data and updates the broadcast transmission parameters. To count the data transmitted by the core;

本实施例中，步骤S3中主机DMA接收到读返回数据时，还包括数据有效判断步骤，具体步骤为：判断数据是否有效，如果为有效，将数据转发至核内存储空间，启动主机DMA进行计数，如果无效，直接启动主机DMA进行计数。In this embodiment, when the host DMA receives the read return data in step S3, it also includes a step of judging whether the data is valid. Counting, if invalid, directly start the host DMA to count.

如图5所示，本实施例具体配置好广播数据传输参数字后，主机DMA发起广播数据传输，主机DMA给核外存储空间DDR发出广播读请求；DDR返回读返回数据到片上网络，主机DMA从片上网络接收每个核的读返回数据，其他所有核从片上网络上被动地接收读返回数据；如果主机DMA检测到数据有效则写入本核核内存储空间，无效则对数据进行计数；当主机计数完成后会对其他核发出清空缓冲的命令，从机把广播结束标识寄存器的值置1，完成清空操作后，如果中断使能位为1，则发出中断。As shown in Figure 5, after the broadcast data transmission parameter word is specifically configured in this embodiment, the host DMA initiates broadcast data transmission, and the host DMA sends a broadcast read request to the external storage space DDR; DDR returns the read data to the network on chip, and the host DMA Receive the read return data of each core from the on-chip network, and all other cores passively receive the read return data from the on-chip network; if the host DMA detects that the data is valid, it will be written into the core storage space of the core, and if it is invalid, the data will be counted; When the master counts, it will issue a command to clear the buffer to other cores, and the slave will set the value of the broadcast end flag register to 1. After the clearing operation is completed, if the interrupt enable bit is 1, an interrupt will be issued.

上述只是本发明的较佳实施例，并非对本发明作任何形式上的限制。虽然本发明已以较佳实施例揭露如上，然而并非用以限定本发明。因此，凡是未脱离本发明技术方案的内容，依据本发明技术实质对以上实施例所做的任何简单修改、等同变化及修饰，均应落在本发明技术方案保护的范围内。The above are only preferred embodiments of the present invention, and do not limit the present invention in any form. Although the present invention has been disclosed above with preferred embodiments, it is not intended to limit the present invention. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention shall fall within the protection scope of the technical solution of the present invention.

Claims

1. the DMA broadcast data transmission method based on host counting in a kind of GPDSP, it is characterized in that, the method comprises: start DMA broadcast data transmission by host DMA, generate the broadcast read request and send to the storage space outside the core through the network on chip; The storage space sends the read return data to the on-chip network according to the broadcast read request, each core in the GPDSP receives the read return data from the on-chip network and writes it into the core storage space, and the host DMA receives the read return data and counts to confirm the data transmission Is it done.

2. in the GPDSP according to claim 1, based on the DMA broadcast data transmission method counted by the host, it is characterized in that, whether the confirmation transmission is completed specifically includes: setting respectively in advance comprising source frame number SrcArrCnt, source frame remaining unit number SrcEleCnt, The broadcast transmission parameters of the destination frame number DstArrCnt and the destination frame remaining unit number DstEleCnt, the source frame number SrcArrCnt is used to configure the frame number of the data moved outside the core, and the source frame remaining unit number SrcEleCnt is used to count the current source frame. The number of data units read, the destination frame number DstArrCnt is used to configure the number of data frames written into the kernel storage space, and the number of remaining units of the destination frame DstEleCnt is used to count the number of unwritten data units in the current destination frame , confirming whether the data transmission is completed according to the value of the broadcast transmission parameter.

3. in the GPDSP according to claim 2, based on the DMA broadcast data transmission method of host counting, it is characterized in that, said source frame number SrcArrCnt, source frame remaining unit number SrcEleCnt, destination frame number DstArrCnt and destination frame remaining unit number DstEleCnt Satisfies the following formula:

(SrcArrCnt+1)*SrcEleCnt==(DstArrCnt+1)*DstEleCnt;

Among them, SrcArrCnt+1 is the number of frames of data to be moved outside the core, SrcEleCnt is the number of data units that have not been read in the current source frame, DstArrCnt+1 is the number of data frames that need to be written into the internal storage space, and DstEleCnt is the current The number of unwritten data units in the destination frame.

4. according to claim 1 or 2 or 3 described GPDSP based on the DMA broadcast data transmission method of host counting, it is characterized in that, the method also comprises the transmission mode parameter TMODE that is used to configure DMA transmission mode, when described transmission mode When the parameter TMODE is valid, start to execute DMA broadcast data transmission.

5. according to claim 1 or 2 or 3 described GPDSP based on the DMA broadcast data transmission method of mainframe counting, it is characterized in that, described broadcast read request comprises the read that is used to mark data and returns core information and returns selection vector RetVec, according to The read return selection vector RetVec determines the target core to which the read return data needs to be returned.

6. in the GPDSP according to claim 5, based on the DMA broadcast data transmission method of host counting, it is characterized in that, said read returns selection vector RetVec to have many bits specifically, and whether each bit corresponds to identifying a participating core that participates in transmission needs Returns the status of the read returned data.

7. the DMA broadcast data transmission method based on master counting in the GPDSP according to claim 6 is characterized in that, described broadcast read request also comprises one or more information in read address, read mask, read return address.

8. according to claim 1 or 2 or 3 described GPDSP based on the DMA broadcast data transmission method of host computer counting, it is characterized in that, when confirming that data transmission is completed, also comprise empty buffer step, concrete steps are: host computer DMA to all The slave DMA sends out an empty buffer command, and each slave DMA receives the empty buffer command and performs an empty buffer operation to end the broadcast transmission.

9. according to claim 1 or 2 or 3 described GPDSP based on the DMA broadcast data transmission method of master counting, it is characterized in that, the concrete steps of this method are:

S1. Preset the broadcast transmission parameters including source frame number SrcArrCnt, source frame remaining unit number SrcEleCnt, destination frame number DstArrCnt and destination frame remaining unit number DstEleCnt;

S2. After the broadcast transmission parameters are configured, the host DMA starts DMA broadcast data transmission, and generates a broadcast read request according to the broadcast transmission parameters, and then sends it to the off-core storage space via the on-chip network;

S3. The storage space outside the core sends the read return data to the on-chip network according to the broadcast read request, and each core in the GPDSP receives the read return data from the on-chip network and writes it into the core storage space, and the host DMA receives the read return data and updates all the above broadcast transmission parameters for counting;

S4. When the host DMA receives the last piece of data, the counting is completed, the host DMA sends a command to clear the buffer to all DSP cores, and the slave DMA executes the clear operation after receiving the clear command, and sends an interrupt according to the interrupt enable bit after clearing is completed Request; when the slave DMA buffer is cleared, set the value of the internal preset broadcast end register BOR, and the broadcast transmission transaction ends.

10. the DMA broadcast data transmission method based on host counting in the GPDSP according to claim 9, it is characterized in that, when host DMA receives read and return data among the described steps S3, also comprise data valid judging step, concrete steps are: Determine whether the data is valid, if it is valid, forward the data to the storage space in the kernel, start the host DMA to count, if it is invalid, directly start the host DMA to count.