CN116822595A

CN116822595A - Configuration method of processing unit PE array and related equipment

Info

Publication number: CN116822595A
Application number: CN202210264327.4A
Authority: CN
Inventors: 张鑫; 蔡兆晖; 何雷骏; 邵芳琳
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-03-17
Filing date: 2022-03-17
Publication date: 2023-09-29
Also published as: WO2023173912A1

Abstract

The embodiment of the application discloses a configuration method and related equipment for a PE array of a processing unit, which are used for configuring the PE array. The application can be applied to a chip, wherein the chip comprises a processing module and a PE array. The processing module generates isomorphism characteristics of M operators, then determines static configuration of N PEs in the PE array according to the isomorphism characteristics, and determines M dynamic configurations based on the static configuration and overall configuration of the M operators in the PE array, wherein the dynamic configurations are other configurations except the static configuration in the overall configuration. Then, the PE array can be configured based on the static configuration and one of the M dynamic configurations, without switching the static configuration of the PE array, thereby reducing the switching overhead.

Description

Configuration method of processing unit PE array and related equipment

Technical Field

The application relates to the technical field of chips, in particular to a configuration method of a processing unit (processing element, PE) array and related equipment.

Background

Coarse-grained reconfigurable architecture (CGRA) chip is a new generation programmable acceleration architecture with the flexibility of field programmable gate array (field programmable gate array, FPGA) chip and the high energy efficiency ratio characteristic of application specific integrated circuit (application specific integrated circuit, ASIC) chip, and the PE array in the CGRA chip is configured by configuration words, so that the CGRA chip can execute corresponding algorithm.

Currently, the CGRA chip can configure the PE array according to an operator of a program to obtain the configured PE array. When the CGRA chip executes the operator, the operator can be executed based on the service data by only transmitting the service data to the PE array without switching the configuration of the PE array.

However, in a CGRA chip, due to the limited number of PEs in the PE array, multiple different operators often need to multiplex the same PE or PEs in the PE array. When the CGRA chip executes the operator 1 and then executes the operator 2, the configuration of one or more PEs for multiplexing needs to be switched, so that the switching overhead is high, and the further improvement of the performance of the CGRA chip is restricted.

Disclosure of Invention

The embodiment of the application provides a configuration method and related equipment of a PE array, which are used for configuring the PE array.

The first aspect of the present application provides a method for configuring a PE array of a processing unit, which is applicable to a chip, where the chip includes a processing module and a PE array. The processing module generates isomorphism characteristics of M operators, M is a positive integer, then static configuration of N PE in the PE array is determined according to the isomorphism characteristics, N is a positive integer, M dynamic configurations are determined based on the static configuration and overall configuration of the M operators in the PE array, and the dynamic configurations comprise other configurations except the static configuration in the overall configuration. Then, the PE array may be configured based on the static configuration and one of the M dynamic configurations, i.e. only the dynamic configuration needs to be switched, without switching the static configuration, reducing the switching overhead.

In some possible implementations, in the step of configuring the PE array based on the static configuration and at least one of the M dynamic configurations performed by the PE array, the method may include: when executing a first operator of the M operators, configuring the PE array based on the static configuration and a first dynamic configuration corresponding to the first operator, wherein the first dynamic configuration is one of the M dynamic configurations; when executing a second operator of the M operators, the PE array switches the first dynamic configuration to a second dynamic configuration corresponding to the second operator, wherein the second operator is an operator executed after the first operator is executed, and the second dynamic configuration is one of the M dynamic configurations. Therefore, the PE array only needs to switch the dynamic configuration of the PE array, and does not need to switch the static configuration, thereby reducing the switching overhead.

In some possible implementations, before the processing module performs the step of generating isomorphic features of the M operators, the method may further include: and the processing module acquires M data flow graphs corresponding to the M operators, and can extract isomorphism characteristics according to the M data flow graphs, so that the isomorphism characteristics of the M operators are obtained.

In some possible implementations, the chip further includes a storage module, and before the step of configuring the PE array based on at least one of the static configuration and the M dynamic configurations, the method may further include: the storage module stores the mapping relation between the static configuration and the index number; the processing module transmits a configuration word to the PE array, wherein the configuration word comprises at least one dynamic configuration of the index number and the M dynamic configurations; and the PE array acquires the static configuration with the mapping relation with the index number from the storage module. When the processing module transmits multiple configuration words, static configuration is not required to be directly transmitted, and the configuration words are replaced by index numbers, so that transmission overhead is reduced, and transmission efficiency of the configuration words is improved.

In some possible implementations, the configuration word further includes a configuration number of times, the configuration number of times being used to indicate a number of times that the configuration word is configured based on. For multiple configuration words with the same static configuration and the same dynamic configuration, the configuration words can be abbreviated as 1 configuration word, so that the transmission overhead is further reduced, and the transmission efficiency is improved.

In some possible implementations, the isomorphism characteristic includes a routing configuration of each node in the N nodes, any 2 nodes in the N nodes are directly connected or indirectly connected, the static configuration includes a routing configuration of the N PEs, and then the routing configuration in the N PEs does not need to be modified, so that switching overhead is reduced.

In some possible implementations, the isomorphism feature further includes a functional configuration of at least 1 node in the N nodes, and the static configuration further includes a functional configuration of at least 1 PE in the N PEs, so that a routing configuration of at least one PE in the N PEs does not need to be modified, thereby reducing a handover overhead.

In some possible implementations, the chip further includes a MEM interface, where the MEM interface may obtain source code of the program and transmit the source code of the program to the processing module, so that the processing module may generate a dataflow graph of M operators based on the source code of the program, to obtain M dataflow graphs.

In some possible implementation manners, the processing module may extract isomorphism characteristics from 1 data flow graph, where the isomorphism characteristics are at least two identical local structures in the data flow graph, and by multiplexing N PEs in the PE array corresponding to the isomorphism characteristics, the number of required PEs may be reduced, and usability may be enhanced.

In some possible implementations, the dynamic configuration further includes a configuration of at least 1 PE other than the N PEs, and then, for operators that cannot be configured by an integer number of isomorphic features, the configuration word is also applicable, thereby enhancing the applicability of the configuration word.

In some possible implementation manners, isomorphism characteristics can also be local structures with different granularities, so that the processing module can determine the isomorphism characteristics with different granularities according to requirements under different conditions, the number of required PEs is reduced, and the usability is enhanced.

In some possible implementations, the storage module includes a configuration random access memory (config RAM) and a static configuration template library (template lib), where the config RAM is used to store configuration words, the template lib is used to store mapping relationships between index numbers and static configurations, so that more than many configuration words with the same static configuration need only be stored in the template lib, only one static configuration need be stored in the config RAM, and compared with storing static configurations for each configuration word, the storage overhead is reduced.

In some possible implementations, the M operators may be all operators in a program or may be part of operators in a program, so that the chip may determine one or more different isomorphic features for a program according to needs, and the method is applicable to multiple operators in a program that cannot extract a suitable isomorphic feature, and improves applicability of the multiple operators.

A second aspect of the application provides a chip for performing the method of any of the preceding first aspects.

A third aspect of the application provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of any of the first aspects described above.

A fourth aspect of the application provides a computer program product comprising computer-executable instructions stored on a computer-readable storage medium; the processor of the device may read the computer-executable instructions from a computer-readable storage medium, the processor executing the computer-executable instructions causing the device to implement the method provided by the first aspect or any one of the possible implementations of the first aspect.

A fifth aspect of the present application provides a communication device that may include a processor, a memory, and a communication interface. The processor is coupled with the memory and the communication interface. The memory is used for storing instructions, the processor is used for executing the instructions, and the communication interface is used for communicating with other communication devices under the control of the processor. The instructions, when executed by a processor, cause the processor to perform the method of the first aspect or any possible implementation of the first aspect.

The technical effects of the second to fifth aspects or any one of the possible implementation manners of the second to fifth aspects may be referred to the technical effects of the first aspect or the technical effects of the different possible implementation manners of the first aspect, which are not described herein.

Drawings

FIG. 1-1 is a schematic diagram of a PE array;

FIGS. 1-2 are schematic diagrams of data links in an embodiment of the present application;

FIGS. 1-3 are schematic diagrams of data links in an embodiment of the present application;

FIGS. 1-4 are schematic illustrations of an embodiment of a chip according to an embodiment of the present application;

fig. 2-1 is a schematic flow chart of an embodiment one of a method for configuring a PE array according to an embodiment of the present application;

FIGS. 2-2 are schematic diagrams of data flow diagram 1 in an embodiment of the present application;

FIGS. 2-3 are schematic diagrams of data flow diagram 2 in an embodiment of the present application;

FIGS. 2-4 are schematic diagrams of data flow diagram 3 in an embodiment of the present application;

FIGS. 2-5 are schematic illustrations of isomorphic features in an embodiment of the application;

FIGS. 2-6 are schematic diagrams illustrating the division of data flow FIG. 3 into partial structures in an embodiment of the present application;

FIGS. 2-7 are further schematic diagrams of isomorphic features in embodiments of the application;

FIGS. 2-8 are further schematic illustrations of isomorphic features in an embodiment of the application;

FIGS. 2-9 are further schematic illustrations of isomorphic features in an embodiment of the application;

FIGS. 2-10 are schematic diagrams of static configurations in an embodiment of the application;

FIGS. 2-11 are schematic diagrams of data flow diagram 4 in an embodiment of the present application;

FIGS. 2-12 are schematic diagrams illustrating two separate memory spaces divided in a configuration buffer (cfg buffer) according to embodiments of the present application;

FIGS. 2-13 are diagrams illustrating a cfg buffer sequentially receiving 3 configuration words transmitted by a configuration random access memory (config RAM) in accordance with an embodiment of the present application;

fig. 3 is a schematic structural diagram of a configuration device of a PE array according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a communication device according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a configuration method and related equipment of a PE array, which are used for configuring a PE array of a processing unit in a CGRA chip.

Embodiments of the present application are described below with reference to the accompanying drawings.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which embodiments of the application have been described in connection with the description of the objects having the same attributes. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The CGRA chip is a new generation programmable acceleration architecture which combines the flexibility of the FPGA chip and the high energy efficiency ratio characteristic of the ASIC chip. The CGRA chip is internally provided with a PE array, and the PE array comprises a plurality of PEs and is used for executing an algorithm. The PE is configured with a plurality of logic gates (logic gates) for performing corresponding operation modes, such as addition, subtraction, multiplication, division, and the like. The user can configure at least one PE of the PE array in the CGRA chip through the configuration word, so that the CGRA chip can execute a corresponding algorithm.

For example, please refer to fig. 1-1, which is a 3×3 PE array in a CGRA chip, wherein each element may be denoted as PEij (i=0, 1,2; j=0, 1, 2), wherein the arrow indicates the direction of direct data transfer between the connected PEs. For example, the connected PEs 00 and 01 may transfer data directly to each other. When a user can configure at least one PE in the PE array, a corresponding data link is formed that can be used to execute a corresponding algorithm, such as the data link shown in fig. 1-2.

Currently, the CGRA chip may configure the PE array according to an operator of the program, to obtain a data link corresponding to the operator, where the data link includes configurations of a plurality of PEs in the PE array. When the CGRA chip executes the operator, service data only need to be transmitted to the data link corresponding to the operator, and the configuration of PE in the data link does not need to be switched.

For example, the data link corresponding to the operator 2 is shown in fig. 1-3, where SUB is subtraction, and the difference is that PE11 is changed from Addition (ADD) to Subtraction (SUB), and the CGRA chip still needs to perform overall configuration switching on the N PEs corresponding to the data link.

Therefore, the application provides a PE array configuration method and related equipment, which are used for configuring a PE array.

The application is applicable to a chip, wherein the chip comprises a processing module and a PE array. The processing module generates isomorphism characteristics of M operators, M is a positive integer, then static configuration of N PE in the PE array is determined according to the isomorphism characteristics, N is a positive integer, M dynamic configurations are determined based on the static configuration and overall configuration of the M operators in the PE array, and the dynamic configurations are configured as other configurations except the static configuration in the overall configuration. Then, the PE array may be configured based on the static configuration and one of the M dynamic configurations, i.e. only the dynamic configuration needs to be switched, without switching the static configuration, reducing the switching overhead.

The present application is illustratively applicable to the chip 100 shown in fig. 1-4, wherein the chip 100 includes a memory (MEM) interface 110, a processing module 120, a memory module 130, and a PE array 140. It should be noted that, the chip 100 may be an FPGA chip or a CGRA chip, or other chips with reconfigurable properties, which are not limited herein.

The MEM interface 110 is an interface for interaction between the internal device of the chip 100 and an external device. Illustratively, the MEM interface 110 may receive the source code of the program and the service data from an external setting of the chip 100 and transmit the source code of the program to the processing module 120 and the service data to the memory module 130.

A compiler (compler) 121 may be built into the processing module 120, where the compler 121 is a logic module. compiler 121 may be used to: generating isomorphism characteristics of M operators based on source codes of the program, determining static configuration of N PE in the PE array according to the isomorphism characteristics, and determining M dynamic configurations based on the static configuration and overall configuration of the M operators in the PE array, wherein the dynamic configurations comprise other configurations except the static configuration in the overall configuration. The compiler 121 may store at least one of the static configuration and the M dynamic configurations in the storage module 130. In some possible implementations, the compiler 121 may forward at least one of the static configuration and the M dynamic configurations to the memory module 130 through the MEM interface 110. In some possible implementations, the compiler 121 may also be directly connected to the memory module, thereby forwarding at least one of the static configuration and the M dynamic configurations directly to the memory module 130.

The memory module 130 may be a random access memory (random access memory, RAM) built into the chip 100. The storage module 130 may transmit at least one of the static configuration and the M dynamic configurations to the PE array 140 to cause the PE array 140 to be configured based on the at least one of the static configuration and the M dynamic configurations.

A configuration buffer (Cfg buffer) 141,Cfg buffer 141 is built into the PE array 140 and is configured to receive at least one of the static configuration and the M dynamic configurations transmitted by the storage module 130, so that the PE array 140 configures the PE array 140 based on the static configuration and the one of the M dynamic configurations. The processing module 120 also includes a configuration switch (Cfg switch) 122 that can be used to switch the dynamic configuration in the Cfg buffer 141.

In some possible implementations, the compiler 121 can store at least one of the static configuration and the M dynamic configurations in the storage module 130. In some possible implementations, the compiler 121 further stores the mapping relationship between the static configuration and the index number in the storage module 130, and transmits a configuration word to the PE array 140, where the configuration word includes at least one of the index number and the M dynamic configurations. Note that, combiners 121 may forward the configuration word to PE array 140 by sending the configuration word to memory module 130, and then, by memory module 130, or combiners 121 may forward the configuration word directly to PE array 140, which is not limited herein.

In some possible implementations, the storage module 130 may be divided into a plurality of areas, which are a configuration random access memory (config RAM) 131, a static configuration template library (template lib) 132, and a data random access memory (data RAM) 133, respectively. Wherein config RAM 131 is used to store configuration words and transmit configuration words to PE array 140; the template lib 132 is configured to store a mapping relationship between the static configuration and the index number, and return a corresponding static configuration to the PE array 140 based on the index number in the configuration word; the data RAM 133 is used to store traffic data and to transfer it to the PE array 140. Then, the Cfg buffer 141 may receive the configuration word transmitted by the config RAM 131 of the storage module 130, obtain a static configuration from the template lib 132 of the storage module 130 based on the index number in the configuration word, configure the PE array 140 based on the static configuration and one of the M dynamic configurations, and calculate the service data based on the configured PE array 140 to execute the corresponding operator.

The foregoing describes the chip 100, and next describes a method for configuring a PE array based on execution in the chip 100, referring to fig. 2-1, the method embodiment mainly includes the following steps:

201. The processing module generates M Data Flow Graphs (DFGs) corresponding to M operators based on source codes of the program, wherein M is a positive integer.

In the embodiment of the application, the chip can receive the source code and service data of the program through the MEM interface, and then the MEM interface transmits the source code of the program to the processing module and transmits the service data to the storage module. When the processing module receives the source code of the program, a data flow diagram corresponding to each operator in the M operators can be generated based on the source code of the program, so that M data flow diagrams are obtained. Wherein one dataflow graph includes a functional configuration and a routing configuration for each of a plurality of nodes. In some possible implementations, the M operators may be all operators in a program, or may be part of operators in a program, which is not limited herein.

For example, m=3, i.e. 3 operators, operator 1, operator 2 and operator 3, respectively. Wherein, operator 1 is used to calculate the multiply-add operation between 2×2 order matrices: a, b+c, D; the operator 2 is used to calculate the multiplication and subtraction between the 2×2 order matrices: a, B-C, D; the operator 3 is used to calculate the multiply-add operation between 4×4 order matrices:

K ₀ *K ₁ +K ₂ *K ₃ +K ₄ *K ₅ +K ₆ *K ₇

wherein A, B, C and D are two-dimensional matrices:

K ₀ 、K ₁ 、K ₂ 、K ₃ 、K ₄ 、K ₅ 、K ₆ And K ₇ All are 4-dimensional matrices:

wherein p=0, 1,2,3,4,5,6,7.

Illustratively, taking operator 1 as an example, let a two-dimensional matrix E ₁ =a×b+c×d, then E ₁ There are 4 elements:

for E ₁ Any one element E of (2) ₁ ij (i=0, 1; j=0, 1), all require performing an operation:

E ₁ ij＝(Ai0*B0i+Ai1*B1i)+(Ci0*D0i+Ci1*D1i)

as such, the processing module may generate the dataflow graph 1 shown in fig. 2-2 based on the source code of operator 1. The data flow diagram 1 includes 7 nodes, namely MUL0, MUL1, MUL2, MUL3, ADD0, ADD1, ADD2. Wherein, the functional configurations of MUL0, MUL1, MUL2 and MUL3 are all multiplications, and the functional configurations of ADD0, ADD1 and ADD2 are all additions; the routes of MUL0, MUL1 are configured to point to ADD0, the routes of MUL2, MUL3 are configured to point to ADD1, and the routes of ADD0, ADD1 are configured to point to ADD2.

Wherein, MUL0 is used for executing the operation of Ai0×B0i, MUL1 is used for executing the operation of Ai1×B1i, MUL2 is used for executing the operation of Ci0×D0i, MUL3 is used for executing the operation of Ci1×D1i, ADD0 is used for executing the operation of MUL0+MUL1, ADD1 is used for executing the operation of MUL2+MUL3, ADD2 is used for executing the operation of ADD0+ADD1, and finally E is obtained ₁ The value of ij.

Taking operator 2 as an example, let a two-dimensional matrix E ₂ =a×b—c×d, then E ₂ There are 4 elements:

for E ₂ Any one element E of (2) ₂ ij (i=0, 1; j=0, 1), all require performing an operation:

E ₂ ij＝(Ai0*B0i+Ai1*B1i)+(Ci0*D0i+Ci1*D1i)

as such, the processing module may generate the dataflow graph 2 shown in fig. 2-3 based on the source code of operator 2. The data flow diagram 2 includes 7 nodes, namely MUL0, MUL1, MUL2, MUL3, ADD0, ADD1, SUB0. Wherein, the functional configurations of MUL0, MUL1, MUL2 and MUL3 are all multiplications, the functional configurations of ADD0 and ADD1 are all additions, and the functional configuration of SUB0 is subtractions; the routes of MUL0, MUL1 are configured to point to ADD0, the routes of MUL2, MUL3 are configured to point to ADD1, and the routes of ADD0, ADD1 are configured to point to SUB0.

Wherein MUL0 is used for executing the operation of Ai0 x B0i, MUL1 is used for executing the operation of Ai1 x B1i, MUL2 is used for executing the operation of Ci0 x D0i, MUL3 is used for executing the operation of Ci1 x D1i, ADD0 is used for executing the operation of MUL0+MUL1, ADD1 is used for executing the operation of MUL2+MUL3, SUB0 is used for executing the operation of ADD0-ADD1, and finally E is obtained ₂ The value of ij.

Taking operator 3 as an example, let a two-dimensional matrix E ₃ ＝K ₀ *K ₁ +K ₂ *K ₃ +K ₄ *K ₅ +K ₆ *K ₇ Then E ₃ There are 16 elements:

for E ₃ Any one element E of (2) ₃ ij (i=0, 1,2,3; j=0, 1,2, 3), all require performing operations:

E ₃ ij＝(K ₀ i0*K ₁ i0+K ₀ i1*K ₁ i1+K ₀ i2*K ₁ i2+K ₀ i3*K ₁ i3)+(K ₂ i0*K ₃ i0+K ₂ i1*K ₃ i1+K ₂ i2*K ₃ i2+K ₂ i3*K ₃ i3)+(K ₄ i0*K ₅ i0+K ₄ i1*K ₅ i1+K ₄ i2*K ₅ i2+K ₄ i3*K ₅ i3)+(K ₆ i0*K ₇ i0+K ₆ i1*K ₇ i1+K ₆ i2*K ₇ i2+K ₆ i3*K ₇ i3)

in an embodiment of the application, the processing module may generate the dataflow graph 3 shown in fig. 2-4 based on the source code of the operator 3. The data flow chart 2 comprises 31 nodes, namely MUL0 to MUL15 and ADD0 to ADD114, wherein the functional configurations of the MUL0 to MUL15 are all multiplications, and the functional configurations of the ADD0 to ADD114 are all additions; the routing configurations of MUL0 through MUL15, ADD0 through ADD114 are shown in FIGS. 2-4, and are not described in detail herein. Wherein ADD14 is used for executing the operation of ADD13+ADD12 to finally obtain E ₃ The value of ij.

202. The processing module extracts isomorphism characteristics according to the M data flow diagrams, wherein the isomorphism characteristics correspond to the same local structures among the M operators.

In some possible implementations, the isomorphism characteristic may include a routing configuration for each of N nodes, with any 2 of the N nodes being directly or indirectly connected. In some possible implementations, the isomorphism feature further includes a functional configuration of at least 1 node of the N nodes. By way of example, the isomorphism characteristic determined between the data flow graph 1 shown in fig. 2-2 and the data flow graph 2 shown in fig. 2-3 may be as shown in fig. 2-5, where the isomorphism characteristic includes 7 nodes, and any 2 nodes in the 7 nodes are directly connected or indirectly connected, respectively a, b, c, d, e, f, g, where the routes of a and b are configured to point to e, the routes of c and d are configured to point to f, and the routes of e and f are configured to point to g. Exemplary, isomorphic features further include the functional configurations of at least 1 node of the 7 nodes, exemplary, as shown in fig. 2-5, the functional configurations of a, b, c, d are all multiplications, the functional configurations of e and f are all additions, and the functional configuration of g is not limited.

In some possible implementations, the processing module may extract isomorphism features from 1 data flow graph, where the isomorphism features are at least two identical local structures in the data flow graph, and by multiplexing N PEs in the PE array corresponding to the isomorphism features, the number of required PEs is reduced, and usability is enhanced. For example, the processing module may divide fig. 2-4 as in fig. 2-6, thereby dividing the data flow graph 3 into 5 partial structures of similar structure, based on which 5 partial structures isomorphic features as shown in fig. 2-7 may be extracted. The isomorphism characteristic comprises 7 nodes which are a, b, c, d, e, f, g respectively, wherein the routes of a and b are configured to point to e, the routes of c and d are configured to point to f, and the routes of e and f are configured to point to g. The functional configuration of a, b, c, d is not limited, and the functional configurations of e, f and g are all addition.

In the embodiment of the application, the processing module may extract isomorphism characteristics as shown in fig. 2-8 based on the data flow diagram 1, the data flow diagram 2 and the data flow diagram 3, wherein the isomorphism characteristics comprise 7 nodes, which are a, b, c, d, e, f, g respectively. The routing configuration of a and b is configured to point to e, the routing configuration of c and d is configured to point to f, the routing configuration of e and f is configured to point to g and a, b, c, d, g, the functional configurations of e and f are all not limited, and the functional configurations of e and f are all addition.

In some possible implementation manners, isomorphism characteristics can also be local structures with different granularities, so that the chip can determine the isomorphism characteristics with different granularities according to requirements under different conditions, the number of required PE is reduced, and the usability is enhanced. By way of example, based on data flow graph 1, data flow graph 2 and data flow graph 3, isomorphism characteristics as shown in fig. 2-9 can be extracted, wherein the isomorphism characteristics comprise 3 nodes, namely a, b and c, respectively, and the routing configuration of a and b is directed to c, and the functional configurations of a, b and c are not limited. The isomorphic features shown in fig. 2-9 have a smaller granularity than the isomorphic features shown in fig. 2-8.

In some possible implementations, the M operators may be all operators in a program or may be part of operators in a program, so that the chip may determine one or more different isomorphic features for a program according to needs, and the method is applicable to multiple operators in a program that cannot extract a suitable isomorphic feature, and improves applicability of the multiple operators. For example, if the program includes 6 operators, where the 6 operators correspond to 6 data flow diagrams respectively, and are respectively data flow diagrams 1/2/3/4/5/6, the processing module may extract isomorphic feature 1 based on data flow diagrams 1/2/3, and extract isomorphic feature 2 based on data flow diagrams 4/5/6.

It should be noted that, the steps 201 to 202 are optional, so long as the processing module may generate isomorphic features of M operators, which is not limited herein. For example, the chip may determine isomorphism characteristics based on the formulas of the M operators, without limitation.

203. The processing module determines the static configuration of N PEs in the PE array according to the isomorphism characteristics, wherein N is a positive integer.

In some possible implementations, the isomorphism feature includes N nodes, and the available N PEs are selected from the PE array based on connection relationships between the N nodes in the isomorphism feature, where the connection relationships of the N PEs are the same as the connection relationships of the N nodes in the isomorphism feature, and one node in the isomorphism feature corresponds to one PE of the N PEs one to one. And then, based on the configuration of each node in the isomorphism characteristics, carrying out corresponding configuration on corresponding PEs in the N PEs to obtain the static configuration of the N PEs. Correspondingly, if the isomorphism characteristic comprises the routing configuration of each node in the N nodes, the static configuration also comprises the routing configuration of each PE in the N PEs; if the isomorphism characteristic includes a functional configuration of at least 1 node of the N nodes, the static configuration also includes a functional configuration of at least 1 PE of the N PEs.

1-1, a PE array is an architecture of 3×3. For isomorphic features as shown in fig. 2-8, mapping onto N PEs (PE 00, PE01, PE02, PE11, PE20, PE21, PE22, i.e., n=7) of the PE array as shown in fig. 1-1 results in a static configuration of 7 PEs as shown in fig. 2-10. The routing configuration of N PEs (PE 00, PE01, PE02, PE11, PE20, PE21, PE 22) forms one transmission path. In some possible implementations, the functional configurations of PE01 and PE21 are additive, and the functional configurations of PE00, PE02, PE11, PE20, PE22 are not limited.

204. The processing module determines M dynamic configurations based on the static configuration and an overall configuration of the M operators in the PE array, the dynamic configurations including configurations other than the static configuration in the overall configuration.

Illustratively, if the static configuration extracted based on operator 1, operator 2, and operator 3 is as shown in fig. 2-10, then the dynamic configuration is the functional configuration of PE00, PE02, PE11, PE20, PE 22. Wherein, the dynamic configuration corresponding to the operator 1 is as follows: the functional configurations of PE00, PE02, PE20 and PE22 are all multiplications, and the functional configuration of PE11 is additive; the dynamic configuration corresponding to operator 2 is: the functional configurations of PE00, PE02, PE20 and PE22 are all multiplications, and the functional configuration of PE11 is subtractions; operator 3 corresponds to 5 dynamic configurations, wherein 4 dynamic configurations are: the functional configurations of PE00, PE02, PE20 and PE22 are all multiplications, and the functional configuration of PE11 is subtractions; operator 3 corresponds to 1 part of dynamic configuration in 5 parts of dynamic configuration as: the functional configurations of PE00, PE02, PE11, PE20, PE22 are all subtractions.

In some possible implementations, the dynamic configuration further includes a configuration of at least 1 PE other than the N PEs in the PE array. 2-11, the data flow diagram 4 corresponding to the operator 4, based on the isomorphism characteristics shown in FIGS. 2-10, the dynamic configuration corresponding to the operator 4 may also include a routing configuration of the PE10, the routing configuration of the PE10 being directed to the PE11, and the functionality of the PE10 being configured as an addition.

205. The processing module stores the mapping relation between the static configuration and the index number in the storage module.

Alternatively, in some possible implementations, the processing module may generate a statically configured index number and store a mapping relationship between the index number and the statically configured index number in the storage module. In some possible implementations, the storage module may store the mapping relationship through a template lib therein. For example, there are 2 static configurations, namely static configuration 1 and static configuration 2, respectively, the processing module may generate 2 index numbers, namely index number 1 and index number 2, where index number 1 has a mapping relationship with static configuration 1, index number 2 has a mapping relationship with static configuration 2, and store the mapping relationship between index number and static configuration in the template lib of the storage module.

Exemplary, the template lib of the memory module is shown in Table 1:

TABLE 1

Idx	Cfg
		#0	Cfg_template_0
#1	Cfg_template_1
		#n	Cfg_template_n

Wherein the items under the idx column are denoted as index numbers, the items under the cfg column are static configurations of the data link.

206. The processing module transmits the configuration word to the PE array.

In some possible implementations, the configuration word includes at least one of a static configuration and M dynamic configurations.

Exemplary, as shown in fig. 2, is an example of configuration words corresponding to 3 operators in the embodiment of the present application.

TABLE 2

Wherein the static configuration of operator 1, operator 2 and operator 3 are all the same. Operator 1 corresponds to 1 part of the configuration word, operator 2 corresponds to 1 part of the configuration word, and operator 3 corresponds to 5 parts of the configuration word. Among the 5 configuration words of operator 3, the dynamic configuration of the first 4 configuration words is the same, and only the 5 th configuration is different.

In some possible implementations, the configuration word may further include a configuration number indicating the number of times configuration is performed based on the configuration word, and then the configuration word may be abbreviated as 1 configuration word for multiple configuration words having multiple same static configurations and same dynamic configurations, so as to further reduce transmission overhead. Exemplary, as shown in table 3, is an example of a configuration word for 3 operators in an embodiment of the present application.

TABLE 3 Table 3

Wherein the number of configuration copies is represented by using the term under the column of num. It should be noted that each 1 set of configuration words is different as shown in table 3, i.e., 1 set of configuration words corresponds to one reconfigurable period.

In some possible implementations, the dynamic configuration includes configurations of at least 1 PE other than the N PEs, which are other than the static configuration, so that the dynamic configuration may also include configurations of other PEs other than the N nodes of the PE array, which enhances applicability.

For example, as shown in fig. 2-11, a data flow diagram 4 corresponding to operator 4 is shown. The isomorphism characteristic shown in fig. 2-5 can only be used as the partial structure of the data flow graph 4 of the operator 4, and for the configuration of the remaining node, the configuration of the PE10 can be corresponding to the configuration of at least 1 PE other than N PEs. Exemplary, as shown in Table 4-1 or Table 4-2, are examples of configuration words for 3 operators in embodiments of the present application.

TABLE 4-1

TABLE 4-2

The dynamic configuration is divided into a Cfg_operation_list part and an other Cfg part, wherein the Cfg_operation_list part is other configurations except the static configuration in the configuration of N PE, and the other Cfg part is the configuration of at least 1 PE except the N PE.

In some possible implementation manners, the configuration word includes at least one dynamic configuration of index numbers and M dynamic configurations, and the static configuration is represented by the index numbers, so that transmission overhead is effectively reduced, and transmission efficiency is improved. Exemplary, as shown in Table 5-1, table 5-2, table 5-3 or Table 5-4, are examples of configuration words corresponding to 3 operators in an embodiment of the present application.

TABLE 5-1

TABLE 5-2

TABLE 5-3

Tables 5 to 4

In some possible implementations, the processing module may transmit the configuration word to the storage module, which then stores the configuration word through the built-in config ram and transmits the configuration word to the PE array. In some possible implementations, the transmission module may transmit all the configuration words to the storage module at a time, and the storage module sequentially transmits the configuration words to the PE array according to a certain rule, where each configuration word is transmitted one at a time. The configuration words received by the memory module but not yet transmitted to the PE array may be stored in a config ram. Since the configuration word includes the index number, rather than the static configuration itself, the storage requirements are greatly reduced.

207. The PE array acquires static configuration with a mapping relation with the index number from the storage module.

Alternatively, in some possible implementations, cfg buffers in the PE array may obtain a static configuration from the template lib of the memory module based on the index number. Illustratively, cfg buffer in the PE array requests static configuration from the storage module based on the index number, the storage module determines the static configuration from the template lib based on the index number and the mapping relationship, and returns the static configuration to the PE array.

It should be noted that, when the PE array receives a new configuration word, the index number therein is checked. If the index number is the same as the index number in the last received configuration word, the PE array does not need to acquire static configuration from the storage module, but only needs to switch dynamic configuration along the static configuration of the last configuration word, so that transmission overhead is reduced.

In some possible implementations, as shown in fig. 2-12, two separate storage spaces may be divided within the cfg buffer, namely storage space 1 and storage space 2, where storage space 1 is used for storing static configuration and storage space 2 is used for storing dynamic configuration. 2-13, the cfg buffer in the PE array receives a configuration word 1, a configuration word 2 and a configuration word 3 sequentially transmitted by the storage module, wherein the configuration word 1 comprises an index number and dynamic configuration dynamic0, the configuration word 2 comprises an index number and dynamic configuration dynamic1, and the configuration word 3 comprises an index number and dynamic configuration dynamic2. When cfg buffer in the PE array receives configuration word 1, static configuration is obtained from the template lib in the storage module based on the index number, and static configuration and dynamic configuration 0 are stored. When cfg buffer in the PE array receives configuration word 2, it may be determined that the index number in configuration word 2 is the same as the index number in configuration word 1, and then static configuration (static) needs to be obtained from config RAM in the memory module, and dynamic configuration dynamic0 is switched to dynamic configuration dynamic1 in configuration word 2. When cfg buffer in the PE array receives configuration word 3, it may be determined that the index number in configuration word 3 is the same as the index number in configuration word 2, and then static configuration (static) needs to be obtained from config RAM in the storage module, and dynamic configuration dynamic1 is switched to dynamic configuration dynamic2 in configuration word 2. And as only dynamic configuration is required to be switched, static configuration is not required to be switched, and the switching overhead is reduced.

208. The PE array is configured based on at least one of a static configuration and M dynamic configurations.

Illustratively, when the chip executes a first operator of the M operators, the PE array is configured based on the static configuration and a first dynamic configuration corresponding to the first operator, the first dynamic configuration being one of the M dynamic configurations; when the chip executes a second operator in the M operators, the PE array switches the first dynamic configuration into a second dynamic configuration corresponding to the second operator, the second operator is an operator executed after the first operator is executed, and the second dynamic configuration is one of the M dynamic configurations. Because only the dynamic configuration is switched, the static configuration is not required to be switched, and the switching overhead is reduced.

For example, when the operator 1, the operator 2 and the operator 3 correspond to the same static configuration, and the PE array sequentially executes the operator 1, the operator 2 and the operator 3 in any order, the cfg switch in the processing module only needs to switch the dynamic configuration in the cfg buffer of the PE array, and does not need to switch the static configuration, thereby saving the switching overhead.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

In order to facilitate better implementation of the above-described aspects of embodiments of the present application, the following provides related devices for implementing the above-described aspects.

Referring to fig. 3, a chip 300 according to an embodiment of the present application includes:

a processing module 310 and a PE array 320; wherein,,

the processing module 310 is configured to generate isomorphism characteristics of M operators, where the isomorphism characteristics correspond to the same local structures among the M operators, and M is a positive integer; determining static configuration of N PE in the PE array according to the isomorphism characteristics, wherein N is a positive integer; determining M dynamic configurations based on the static configuration and the overall configuration of the M operators in the PE array, wherein the dynamic configurations are other configurations except the static configuration in the overall configuration;

the PE array 320 is configured based on at least one of the static configuration and the M dynamic configurations.

In some possible implementations, the PE array 320 is specifically configured to: when a first operator of the M operators is executed, configuring based on the static configuration and a first dynamic configuration corresponding to the first operator, wherein the first dynamic configuration is one of the M dynamic configurations; when executing a second operator in the M operators, switching the first dynamic configuration into a second dynamic configuration corresponding to the second operator, wherein the second operator is an operator executed after the first operator is executed, and the second dynamic configuration is one of the M dynamic configurations.

In some possible implementations, the processing module 310 is configured to obtain M data flow graphs corresponding to the M operators, and extract the isomorphism feature according to the M data flow graphs.

In some possible implementations, the chip 300 further includes: a storage module 330; the processing module 310 is further configured to transmit, to the storage module 330, a mapping relationship between the static configuration and the index number, and transmit, to the PE array, a configuration word, where the configuration word includes at least one dynamic configuration of the index number and the M dynamic configurations; the PE array 320 is further configured to obtain the static configuration from the storage module 330 based on the index number.

It should be noted that, because the content of information interaction and execution process between the modules/units of the above-mentioned device is based on the same concept as the method embodiment of the present application, the technical effects brought by the content are the same as the method embodiment of the present application, and the specific content can be referred to the description in the foregoing illustrated method embodiment of the present application, which is not repeated herein.

The embodiment of the application also provides a computer storage medium, wherein the computer storage medium stores a program, and the program executes part or all of the steps described in the embodiment of the method.

The embodiment of the application also provides a computer program product, wherein the computer program product stores a program, and the program executes part or all of the steps described in the embodiment of the method.

Referring to fig. 4, referring to another communication device provided in the embodiment of the present application, a communication device 400 includes:

a receiver 401, a transmitter 402, a processor 403 and a memory 404. In some embodiments of the application, the receiver 401, transmitter 402, processor 403, and memory 404 may be connected by a bus or otherwise, where a bus connection is illustrated in FIG. 4.

Memory 404 may include read only memory and random access memory and provides instructions and data to processor 403. A portion of memory 404 may also include non-volatile random access memory (non-volatile random access memory, NVRAM). The memory 404 stores an operating system and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, where the operating instructions may include various operating instructions for performing various operations. The operating system may include various system programs for implementing various underlying services and handling hardware-based tasks.

The processor 403 controls the operation of the communication device 400, the processor 403 may also be referred to as a central processing unit (central processing unit, CPU). In a specific application, the various components of the communications device 400 are coupled together by a bus system, which may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. For clarity of illustration, however, the various buses are referred to in the figures as bus systems.

The method disclosed in the above embodiment of the present application may be applied to the processor 403 or implemented by the processor 403. Processor 403 may include a chip as described in fig. 3. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 404, and the processor 403 reads the information in the memory 404 and, in combination with its hardware, performs the steps of the method described above.

The receiver 401 may be used to receive input digital or character information and generate signal inputs related to relevant settings and function control of the communication apparatus 400, the transmitter 402 may include a display device such as a display screen, and the transmitter 402 may be used to output digital or character information through an external interface.

In the embodiment of the present application, the processor 403 is configured to execute the method for configuring the processing unit PE array executed by the foregoing communication apparatus 400.

It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines.

The technical solution of the present application may be embodied in essence or contributing to the prior art in the form of a software product stored in a readable storage medium such as a floppy disk, a U-disk, a mobile hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Claims

1. A method for configuring a PE array of a processing unit, the method being for a chip, the chip including a processing module and a PE array, the method comprising:

the processing module generates isomorphism characteristics of M operators, wherein the isomorphism characteristics correspond to the same local structures among the M operators, and M is a positive integer;

the processing module determines the static configuration of N PE in the PE array according to the isomorphism characteristic, wherein N is a positive integer;

the processing module determining M dynamic configurations based on the static configuration and an overall configuration of the M operators in the PE array, the dynamic configurations including other configurations of the overall configuration than the static configuration;

the PE array is configured based on at least one of the static configuration and the M dynamic configurations.

2. The method of claim 1, wherein configuring the PE array based on at least one of the static configuration and the M dynamic configurations by the PE array comprises:

when executing a first operator of the M operators, configuring the PE array based on the static configuration and a first dynamic configuration corresponding to the first operator, wherein the first dynamic configuration is one of the M dynamic configurations;

When executing a second operator of the M operators, the PE array switches the first dynamic configuration to a second dynamic configuration corresponding to the second operator, wherein the second operator is an operator executed after the first operator is executed, and the second dynamic configuration is one of the M dynamic configurations.

3. The method of claim 1 or 2, wherein before the processing module generates isomorphic features for M operators, further comprising:

the processing module acquires M data flow diagrams corresponding to the M operators;

the processing module generating isomorphism characteristics of the M operators includes:

and the processing module extracts the isomorphism characteristics according to the M data flow diagrams.

4. The method of claims 1-3, wherein the chip further comprises a memory module, the PE array further comprising, prior to configuring based on at least one of the static configuration and the M dynamic configurations:

the storage module stores the mapping relation between the static configuration and the index number;

the processing module transmits a configuration word to the PE array, wherein the configuration word comprises at least one dynamic configuration of the index number and the M dynamic configurations;

And the PE array acquires the static configuration with the mapping relation with the index number from the storage module.

5. The method of claim 4, wherein the configuration word further comprises a configuration number of times, the configuration number of times being used to indicate a number of times configuration is performed based on the configuration word.

6. The method according to any of claims 1-5, wherein the isomorphism characteristic comprises a routing configuration for each of N nodes, any 2 of the N nodes being directly connected or indirectly connected; the static configuration includes routing configurations of the N PEs.

7. The method of claim 6, wherein the isomorphism characteristic further comprises a functional configuration of at least 1 node of the N nodes; the static configuration further includes a functional configuration of at least 1 PE of the N PEs.

8. The method of any of claims 1-7, wherein the dynamic configuration further comprises a configuration of at least 1 PE other than the N PEs.

9. A chip, comprising:

processing module and PE array:

the processing module is used for: generating isomorphism characteristics of M operators, wherein the isomorphism characteristics correspond to the same local structure among the M operators, and M is a positive integer; determining static configuration of N PE in the PE array according to the isomorphism characteristics, wherein N is a positive integer; determining M dynamic configurations based on the static configuration and an overall configuration of the M operators in the PE array, wherein the dynamic configurations comprise other configurations except the static configuration in the overall configuration;

The PE array is used for: and configuring based on at least one dynamic configuration of the static configuration and the M dynamic configurations.

10. The chip of claim 9, wherein the PE array is specifically configured to:

when a first operator of the M operators is executed, configuring based on the static configuration and a first dynamic configuration corresponding to the first operator, wherein the first dynamic configuration is one of the M dynamic configurations;

when executing a second operator in the M operators, switching the first dynamic configuration into a second dynamic configuration corresponding to the second operator, wherein the second operator is an operator executed after the first operator is executed, and the second dynamic configuration is one of the M dynamic configurations.

11. The chip according to claim 9 or 10, wherein,

the processing module is further configured to: and obtaining M data flow graphs corresponding to the M operators, and extracting the isomorphism characteristics according to the M data flow graphs.

12. The chip of claims 9-11, further comprising: a storage module;

the storage module is used for: storing the mapping relation between the index number and the static configuration;

The processing module is further configured to: transmitting a configuration word to the PE array, wherein the configuration word comprises at least one dynamic configuration of the index number and the M dynamic configurations;

the PE array is further configured to: and acquiring the static configuration with the mapping relation with the index number from the storage module.

13. A computer readable storage medium, characterized in that the computer readable storage medium stores a program, which causes a computer device to execute the method according to any one of claims 1-8.

14. A computer program product, the computer program product comprising computer-executable instructions stored on a computer-readable storage medium; a processor of an apparatus reads the computer-executable instructions from the computer-readable storage medium, the execution of the computer-executable instructions by the processor causing the apparatus to perform the method of any one of claims 1-8.

15. A communication device comprising a processor, a memory, and a communication interface;

the processor is coupled with the memory and the communication interface;

The memory is used for storing instructions, the processor is used for executing the instructions, and the communication interface is used for communicating with other communication devices under the control of the processor;

the instructions, when executed by the processor, cause the processor to perform the method of any of claims 1-8.