CN113407350B

CN113407350B - Instruction processing device, processor, chip, computing equipment and corresponding method

Info

Publication number: CN113407350B
Application number: CN202110815712.9A
Authority: CN
Inventors: 李浩洋
Original assignee: Kunlun Core Beijing Technology Co ltd
Current assignee: Kunlun Core Beijing Technology Co ltd
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2024-07-12
Anticipated expiration: 2041-07-19
Also published as: CN113407350A

Abstract

The disclosure provides an instruction processing device, a processor, a chip, a computing device and a corresponding method, relates to the technical field of computers, and particularly relates to the chip technology. The implementation scheme of the instruction processing device is as follows: comprises an execution unit; at least one performance register group for recording the execution information of the execution unit during the execution of the instruction to be debugged; a configuration register for configuring debug parameters; and trigger circuitry configured to enable one or more of the at least one set of performance registers to record execution information in accordance with the debug parameters.

Description

Instruction processing device, processor, chip, computing equipment and corresponding method

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a chip technology, and in particular, to an instruction processing apparatus, a processor, a chip, a computing device, and a corresponding method.

Background

The quality of the programming of program instructions can affect the computational efficiency of the processor. This effect is particularly pronounced in the context of neural network calculations using hardware accelerators (e.g., NPU, GPU, FPGA, dedicated processors). The hardware accelerator generally comprises a plurality of co-processing units and an internal memory, wherein each co-processing unit is internally integrated with a large number of concurrent computing units for accelerating the processing of computing tasks. When the hardware accelerator performs calculation, a neural network algorithm needs to be mapped onto a concurrent calculation unit of the hardware accelerator through program instructions. Different mapping schemes are possible for the same neural network algorithm. Because of the internal structure of the hardware accelerator, the limitation of the computing resources and the data paths, the computing efficiency may be greatly different although the different mapping schemes can obtain the correct computing result.

Thus, in some cases, a user may wish to obtain execution information of a processor when executing certain instruction or instructions in order to evaluate the performance of the instructions accordingly.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

The present disclosure provides an instruction processing apparatus, processor, chip, computing device, and corresponding method.

According to an aspect of the present disclosure, there is provided an instruction processing apparatus including: an execution unit; at least one performance register set, which is used for recording the execution information of the execution unit during the execution of the instruction to be debugged; a configuration register for configuring debug parameters; and trigger circuitry configured to enable one or more of the at least one set of performance registers to record the execution information in accordance with the debug parameters.

According to another aspect of the present disclosure, there is provided a processor comprising at least one instruction processing apparatus as described above.

According to another aspect of the present disclosure, there is provided a chip comprising at least one of the above processors.

According to another aspect of the present disclosure, there is provided a computing device including the above chip.

According to another aspect of the present disclosure, there is provided a method of evaluating instruction execution performance, comprising: acquiring execution information of an instruction to be debugged when the instruction to be debugged is executed by one or more instruction processing devices; determining performance indexes of the instructions to be debugged according to the execution information; judging whether the execution performance of the instruction to be debugged reaches a target according to the performance index.

According to another aspect of the present disclosure, there is provided an apparatus for evaluating instruction execution performance, including: an information acquisition module configured to acquire execution information of an instruction to be debugged when executed by one or more of the above-described instruction processing apparatuses; the index determining module is configured to determine the performance index of the instruction to be debugged according to the execution information; and the performance judging module is configured to judge whether the execution performance of the instruction to be debugged reaches a target according to the performance index.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of evaluating instruction execution performance described above.

According to another aspect of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions is provided. The computer instructions are for causing a computer to perform the method of evaluating instruction execution performance described above.

According to another aspect of the present disclosure, a computer program product is provided, including a computer program. The computer program when executed by a processor implements the above-described method of evaluating instruction execution performance.

According to one or more embodiments of the present disclosure, by setting a configuration register, a trigger circuit, and at least one performance register group in an instruction processing apparatus, execution information of an execution unit during execution of an instruction to be debugged can be recorded.

According to the embodiment of the disclosure, the execution information of the instruction to be debugged can be recorded only by setting a small number of registers (a configuration register and at least one performance register group) and a trigger circuit with a simple structure in the instruction processing device. The recording and the transmission of the execution information can multiplex the original register read-write access in the instruction processing device, and no additional data access or special memory structure connected with the instruction processing device is required to be arranged, so that the production cost and the design cost are lower, and the occupied chip area is smaller. In addition, the embodiment of the disclosure has no invasion to the original data path of the instruction processing device and the memory connected with the instruction processing device when recording the execution information, and can accurately record the execution information of the instruction to be debugged.

Further, according to the embodiment of the disclosure, according to the execution information of the instruction to be debugged, the execution performance of the instruction to be debugged can be accurately estimated, so that the instruction to be debugged is optimized.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 shows a schematic diagram of an instruction processing apparatus according to an embodiment of the present disclosure;

FIG. 2 shows a schematic diagram of a processor according to an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of a chip according to an embodiment of the disclosure;

FIG. 4 illustrates a flowchart of a method of evaluating instruction execution performance according to an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of an execution timing diagram according to an embodiment of the present disclosure;

FIG. 6 illustrates a schematic diagram of calibrating execution time information according to an embodiment of the present disclosure;

FIG. 7 illustrates a flow chart of an instruction evaluation and optimization process according to an embodiment of the present disclosure;

FIG. 8 illustrates a block diagram of an apparatus for evaluating instruction execution performance according to an embodiment of the present disclosure; and

Fig. 9 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

In some cases, a user may wish to obtain execution information of a processor when executing certain instruction or instructions in order to evaluate the performance of the instructions accordingly.

In the related art, the acquisition schemes of the processor execution information mainly include the following two types:

One scheme is to add special hardware in the processor to realize the recording and transmission of the execution information in the execution process of the processor. In order to record and uniformly process and transmit the execution information of each component (such as a processor core, a memory and the like) in the processor, a special storage module is required to be arranged for temporarily storing the information, and an independent data path is designed for summarizing and uniformly outputting the execution information of each component. In addition, in order to ensure that the output bandwidth is sufficient, a special data compression module may be further required to reduce the amount of data transferred, and these designs all occupy extra chip area, thereby increasing the production cost of the chip. Meanwhile, the added hardware also needs to pay development and verification cost additionally, so that the design cost of the chip is increased.

Alternatively, the relevant execution information is transferred and recorded by multiplexing the memory resources and data paths already present in the processor. The disadvantage of this solution is that the recording and transfer of the execution information is intrusive to the working state of the processor. Because the recorded execution information shares a storage bandwidth and a data path with a control flow and a data flow generated in the working process of the processor, the working of the processor is often slow due to occupied storage and network bandwidth, so that the finally obtained processor execution information cannot truly reflect the normal working state of the processor. Especially for neural network processors, since the calculation is aimed at a high-density, large-data-volume and high-throughput neural network (such as deep convolutional neural network, convolutional Deep Neural Networks, CDNN), the storage and network bandwidth are key resources of the processor in most cases, so that the misalignment degree of the obtained execution information of the processor is further aggravated, and it is difficult for a user to obtain a key instruction optimization direction and an optimization point according to the obtained execution information.

For this reason, the embodiments of the present disclosure provide an instruction processing apparatus capable of recording execution information at low cost and accurately when executing instructions. Further, based on the execution information of the instruction recorded by the instruction processing device, the disclosure also provides a method and a device for evaluating the execution performance of the instruction, which can accurately evaluate the execution performance of the instruction based on the execution information of the instruction so as to optimize the instruction. In the embodiments of the present disclosure, a process of executing an instruction and recording execution information of the instruction is described as "debugging" the instruction, and accordingly, the instruction executed and recorded with the execution information is "instruction to be debugged".

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic diagram of an instruction processing apparatus 100 according to an embodiment of the present disclosure. Instruction processing apparatus 100 may be, for example, a single-core processor, a processor core of a multi-core processor, or a processing element in an electronic system. It should be noted that the processors herein include, but are not limited to, central processing units (Central Processing Unit, CPU), neural network processors (Neural-network Processing Unit, NPU), graphics processors (Graphics Processing Unit, GPU), tensor processors (Tensor Processing Unit, TPU), digital signal processors (DIGITAL SIGNAL Processor, DSP), and the like.

As shown in fig. 1, the instruction processing apparatus 100 includes: an execution unit 110; at least one performance register set 122 (four performance register sets 122-1 to 122-4 are shown in fig. 1) for recording execution information during execution of instructions to be debugged by the execution unit 110; a configuration register 124 for configuring debug parameters; and trigger circuitry 130 configured to enable one or more of the at least one set of performance registers 122 to record execution information according to the debug parameters.

According to the embodiment of the disclosure, the execution information of the instruction to be debugged can be recorded only by setting a small number of registers (namely a configuration register and at least one performance register group) and a trigger circuit with a simple structure in the instruction processing device. The recording and the transmission of the execution information can multiplex the original register read-write access in the instruction processing device, and no additional data access or special memory structure connected with the instruction processing device is required to be arranged, so that the production cost and the design cost are lower, and the occupied chip area is smaller. In addition, the embodiment of the disclosure has no invasion to the original data path of the instruction processing device and the memory connected with the instruction processing device when recording the execution information, and can accurately record the execution information of the instruction to be debugged.

According to some embodiments, execution unit 110 includes circuitry operable to execute instructions, which may include, for example, decoders and different types of computing units.

The decoder may, for example, fetch instructions in the form of high-level machine instructions or macro-instructions in the memory 102 and decode the instructions to generate low-level micro-operations, micro-code entry points, micro-instructions, or other low-level instructions or control signals. The low-level instructions or control signals may implement the operation of the high-level instructions through low-level (e.g., circuit-level or hardware-level) operations. The decoder may be implemented in different ways including, but not limited to, microcode, look-up tables, hardware implementations, programmable Logic Arrays (PLAs), and the like. The present disclosure is not limited by the manner in which the decoder may be implemented, and any manner in which a decoder may be implemented is within the scope of the present disclosure.

The computing unit performs an operation according to the decoded instruction. The computational units include, but are not limited to, arithmetic Logic units (ARITHMETIC AND Logic units), multiplier array circuits, adder array circuits, vector processing circuits, format conversion (e.g., converting floating point numbers to fixed point numbers or vice versa) circuits, and the like.

According to some embodiments, as shown in fig. 1, the instruction processing apparatus 100 includes a register unit 120, the register unit 120 is coupled to an execution unit 110, and the execution unit 110 can read and write registers in the register unit 120 through a preset register read and write path. Register unit 120 may include different bit widths, different types, different numbers of register sets or registers that may be used to store control information, status information, operands of instructions, and the like, during execution of the instructions by execution unit 110.

In the embodiment of the present disclosure, the register unit 120 includes at least one performance register set 122 and the configuration register 124, so that the execution unit 110 or the trigger circuit 130 can read and write the performance register set 122 and the configuration register 124 through the original register read and write path of the register unit 120 without setting an additional data path.

It will be appreciated that other registers (e.g., general purpose registers, vector registers, etc.) may be included in register unit 120 in addition to performance register set 122 and configuration registers 124, without undue limitation.

According to some embodiments, the performance register set comprises at least one time counter and/or at least one event counter, wherein the time counter is used for recording the time when the first preset event occurs, and the event counter is used for recording the number of times when the second preset event occurs. According to some embodiments, the first preset event may include, for example, one or more of the following: instruction execution begins, instruction execution completes, read memory begins, read memory ends, write memory begins, write memory ends. The second preset event may include, for example, one or more of the following: a computing unit, a read memory block, a write memory block is used.

For example, as shown in FIG. 1, the performance register 122-1 includes a time counter 122-1A and an event counter 122-1B. The time counter 122-1A is configured to record the current time when a specific event (a first preset event) occurs. For example, the time at which an instruction to be debugged starts executing, the time at which an instruction to be debugged ends executing, the time at which an instruction to be debugged enters and exits a particular module (e.g., memory, etc.), etc. are recorded. The event counter 122-1B is configured to perform a self-increment operation when a specific event (a second preset event) occurs, so as to record the number of times the specific event occurs. For example, the number of times the memory 102 is accessed, the number of times the external memory (external memory is not shown in fig. 1) is accessed, the number of uses of various computing units (e.g., multipliers, adders, etc.) inside the execution unit 110, and the like are recorded.

Configuration registers 124 are used to configure debug parameters. Trigger circuitry 130 is configured to enable one or more of at least one set of performance registers 122 to record execution information of an instruction to be debugged, according to debug parameters in configuration registers 124. Only one trigger circuit 130 is shown in fig. 1, but it will be appreciated that in other embodiments, multiple trigger circuits may be provided, each configured to enable one or more sets of performance registers. For example, the same number of trigger circuits as the performance register sets may be provided, each for enabling a corresponding one of the performance register sets.

According to some embodiments, in particular, the trigger circuit 130 may further comprise one or more trigger sub-circuits, each for enabling one or more registers (time registers or event registers) of the respective performance registers.

According to some embodiments, the configurable debug parameters of configuration register 124 include a debug MODE, and accordingly, configuration register 124 includes a MODE register 124A for configuring the debug MODE, the name of MODE register 124A may be TRACE_MODE, for example. There may be a variety of debug modes, and different debug modes may be distinguished by the value in mode register 124A. For example, the debug mode may include a flag mode and an automatic mode. When the value in the mode register 124A is 0, it indicates that the current debug mode is a flag mode; when the value in the mode register 124A is 1, it indicates that the current debug mode is an automatic mode.

In the marking mode, the instruction processing apparatus 100 takes one or more instructions marked by a marking instruction in a target instruction sequence as instructions to be debugged, and records execution information of the execution unit 110 during execution of the instructions to be debugged. For example, the marking instructions include a debug start marking instruction (trace_begin) and a debug end marking instruction (trace_end), and an instruction in the target instruction sequence between the debug start marking instruction trace_begin and the debug end marking instruction trace_end is an instruction to be debugged. Accordingly, according to some embodiments, the trigger circuit 130 is further configured to: in the case that the debug mode is the tag mode, identifying whether the current instruction (i.e. the instruction currently executed by the execution unit 110) is a tag instruction; and in response to determining that the current instruction is a marker instruction, enabling one or more of the at least one performance register sets 122 to record execution information.

According to some embodiments, the marker instruction includes a first field for indicating a marker type and a second field for indicating a performance register set, the marker type including a debug start and a debug end. For example, when the value of the first field is 1, it indicates that the current marker instruction is the debug start marker instruction trace_begin; when the value of the first field is 0, it indicates that the current marker instruction is the debug end marker instruction trace_end. The value of the second field may be, for example, an identification of the set of performance registers. Accordingly, the trigger circuit 130 is further configured to: and in response to determining that the current instruction is a marked instruction, enabling the performance register set corresponding to the second field to record the execution information. That is, the flag instruction is used to start or end the debugging process of an instruction to be debugged, and can specify into which performance register group execution information in the debugging process is stored.

According to some embodiments, where the debug mode is an auto mode, the debug parameters also include loop parameters corresponding to the auto mode, and accordingly, the configuration registers 124 include registers (e.g. start register 124B, number register 124C, step size register 124D, self-increment register 124E, etc. shown in fig. 1 and described below) for configuring the loop parameters. The loop parameters are used to control the debugging process in the automatic mode. In the automatic mode, some or all of the instructions in the target instruction sequence may be regarded as instructions to be debugged based on the configured loop parameters, and the execution information of the execution unit 110 during execution of the instructions to be debugged may be automatically recorded without requiring a user to manually set a marking instruction in the target instruction sequence. The automatic mode is suitable for the situation that the instruction to be debugged contains a large number of instructions, for example, the instruction to be debugged is the codes of a plurality of convolution layers of the neural network, or the codes of residual units of ResNet, and the like.

Specifically, the loop parameter is used to divide the instruction to be debugged into a plurality of instruction groups, and accordingly, the trigger circuit 130 is further configured to: in the case that the debug mode is an auto mode, each of the at least one performance register set is cyclically enabled to record execution information during which each of the plurality of instruction sets is executed, respectively. According to some embodiments, execution unit 110 may automatically generate a marker instruction for each instruction group (i.e., add trace_begin instruction at the front of each instruction group and trace_end instruction at the back of each instruction group) based on the loop parameters in configuration registers 124, the marker instruction including a second field therein for indicating the performance register group. Accordingly, the trigger circuit 130 may cycle to enable the corresponding set of performance registers to record execution information based on the marking instructions of each instruction set.

For example, in the instruction processing apparatus 100 shown in fig. 1, there are 4 performance register sets 122-1 to 122-4 in total. According to the loop parameters stored in the configuration register 124, 9 instruction groups, i.e., instruction groups 1 to 9, are divided from the instructions to be debugged. In the automatic mode, the trigger circuit 130 enables the performance register set 122-1 to record the execution information of the instruction set 1, enables the performance register set 122-2 to record the execution information of the instruction set 2, enables the performance register set 122-3 to record the execution information of the instruction set 3, enables the performance register set 122-4 to record the execution information of the instruction set 4, enables the performance register set 122-1 to record the execution information of the instruction set 5 again, and so on until all the execution information of the instruction sets are recorded.

According to some embodiments, after the performance register set 122 records the execution information of a certain instruction set, the recorded execution information may be transferred to the memory 102 for storage (further, the execution information may also be transferred from the memory 102 to an external memory with a larger capacity, where the external memory is not shown in fig. 1), so as to avoid that the execution information of the instruction set currently stored in the performance register set 122 is overwritten by the execution information of the next instruction set and lost.

According to some embodiments, the cycle parameters include: a start instruction number indicating the number of a first instruction in a first instruction group in a loop; a single debug instruction number for indicating the number of instructions included in a single instruction group; the instruction interval is used for indicating the serial number difference of the first instruction in the two adjacent instruction groups; and a loop increment for indicating a value by which the start instruction number increases after one loop ends. Accordingly, the configuration registers 124 may include a start register 124B for configuring a start instruction number, a number register 124C for configuring a number of single debug instructions, a step size register 124D for configuring an instruction interval, and a self increment register 124E for configuring a loop increment.

Through the multiple circulation parameters, the division mode of the instruction group can be flexibly configured, and flexible debugging in an automatic mode is realized.

Tables 1 to 3 below show cases where the instruction for debugging is associated with each performance register group (performance register groups 122-1 to 122-4) in each cycle when the MODE register 124A (register name trace_mode, value 0 indicates the flag MODE, value 1 indicates the auto MODE), the start register 124B (register name trace_begin), the number register 124C (register name trace_num), the STEP size register 124D (register name trace_step), and the self-increment register 124E (register name trace_ OVERHEAD) are set to different values.

TABLE 1

TABLE 2

TABLE 3 Table 3

According to some embodiments, instruction processing apparatus 100 further comprises a timer (the timer is not shown in fig. 1) configured to start timing when the execution of the instruction to be debugged starts, and to end timing when the execution of the instruction to be debugged is completed, to generate a timeline of the debugging process. When the first preset event occurs, the trigger circuit 130 may enable the corresponding performance register set to write the time of the current timer into the time counter of the corresponding performance register set.

The timer may be implemented, for example, as a Write-Only (WO) counter in the register unit 120. When the instruction to be debugged starts to be executed, the timer starts to count up; and finishing timing when the execution of the instruction to be debugged is completed, namely resetting the value in the timer.

To avoid obscuring the description, a relatively simple instruction processing apparatus 100 is shown in fig. 1. It will be appreciated that in other embodiments, the instruction processing apparatus may also include other modules, such as an instruction fetch unit, a Cache (Cache), and so on. The present disclosure is not limited to the specific structure of the instruction processing apparatus 100.

The instruction processing apparatus of the present disclosure may be applied in a processor as a processor core in the processor for performing specific computing tasks. Correspondingly, the embodiment of the disclosure also provides a processor, which comprises at least one instruction processing device.

Fig. 2 shows a schematic diagram of a processor 200 according to an embodiment of the present disclosure. As shown in fig. 2, the processor 200 includes three instruction processing devices 210 to 230, a control unit 240, a memory 250, and a DMA (Direct Memory Access ) 260. The processor 200 may be, for example, an NPU and the instruction processing devices 210-230 may be co-processing units in the NPU having the features of the instruction processing device 100 described above. The control unit 240 is used for issuing calculation tasks to the instruction processing devices 210 to 230 and coordinating the calculation processes of the three. DMA260 may be coupled to other modules (e.g., other processors, memory, etc.) of the chip on which processor 200 is located via on-chip interconnect unit 202.

The processor of the disclosed embodiments may be integrated in a chip to enable the chip to provide the processing functions supported by the processor. Correspondingly, the embodiment of the disclosure also provides a chip comprising at least one processor.

Fig. 3 shows a schematic diagram of a chip 300 according to an embodiment of the disclosure. The chip 300 may be, for example, an embedded chip. As shown in fig. 3, the chip 300 includes an on-chip interconnect unit 310, and a Central Processing Unit (CPU) 320, one or more coprocessors 330, a memory unit 340, and a display unit 350 interconnected by the on-chip interconnect unit 310. Coprocessor 330 may be NPU, GPU, TPU or the like, for example. One or more of the central processor 320 and the co-processor 330 may be a processor that integrates instruction processing apparatus of embodiments of the present disclosure. The storage unit 340 may be, for example, a Static Random-Access Memory (SRAM), a high-bandwidth Memory (High Bandwidth Memory, HBM), a double data rate Memory (Graphics Double Data Rate, GDDR) for graphics, or the like. The display unit 350 is used to drive one or more external displays.

The chips described above may be included in a computing device to implement corresponding functions in the computing device, including, but not limited to, executing related control programs, performing data analysis, computing and processing, network communications, controlling peripherals of the computing device, and the like. Correspondingly, the embodiment of the disclosure also provides a computing device comprising the chip. The computing device may be, for example, but not limited to, a vehicle-mounted device, an industrial control device, a sensing device, an intelligent home device (intelligent speaker, intelligent door lock, intelligent display device), etc.

As described above, the instruction processing apparatus according to the embodiment of the present disclosure can accurately record the execution information of an instruction to be debugged. Based on recorded execution information of an instruction to be debugged, the embodiment of the disclosure provides a method for evaluating the execution performance of the instruction, which can accurately evaluate the execution performance of the instruction to be debugged based on the execution information of the instruction to be debugged so as to optimize the instruction to be debugged.

The method of evaluating instruction performance of embodiments of the present disclosure may be performed in a debug host coupled to an instruction processing apparatus. For example, the instruction processing apparatus (or a processor, a chip where the instruction processing apparatus is located) may be connected to a debug host through a debug interface, such as a PCI-e (PERIPHERAL COMPONENT INTERCONNECT-express) interface, a JTAG (Joint Test Action Group) interface, etc., execution information of an instruction to be debugged recorded by the instruction processing apparatus is transferred to the debug host, and then the method for evaluating instruction performance of the embodiments of the present disclosure is performed by the debug host based on the execution information. The debug host may be, for example, a desktop personal computer, a notebook computer, or the like. In some embodiments, the debug host may also be a server or mobile device.

FIG. 4 illustrates a flow chart of a method 400 of evaluating instruction execution performance according to an embodiment of the present disclosure. As shown in fig. 4, the method 400 includes:

Step 410, obtaining execution information of an instruction to be debugged when executed by one or more instruction processing apparatuses (e.g., instruction processing apparatus 100 shown in fig. 1) according to an embodiment of the present invention;

Step 420, determining performance index of the instruction to be debugged according to the execution information; and

Step 430, judging whether the execution performance of the instruction to be debugged reaches the target according to the performance index.

According to the embodiment of the disclosure, according to the execution information of the instruction to be debugged, the execution performance of the instruction to be debugged can be accurately estimated, so that the instruction to be debugged is optimized.

The various steps of method 400 are described in detail below.

As described above, the instruction processing apparatus may collect various execution information including the time and the number of times of occurrence of various events during execution of the instruction to be debugged. The execution information acquired in step 410 may be all or part of the execution information acquired by the instruction processing apparatus.

The kind of the execution information acquired in step 410 may be determined according to the kind of the performance index to be calculated in step 420.

According to some embodiments, the instruction to be debugged is executed by an instruction processing apparatus, and the performance index of step 420 includes a first amount of data read by the instruction processing apparatus from the external memory per unit time and a second amount of data calculated by a calculation unit of the instruction processing apparatus per unit time. Accordingly, step 430 includes: judging whether the first data volume is matched with the second data volume or not; and in response to determining that the first data amount matches the second data amount, determining that the execution performance of the instruction to be debugged achieves the target.

According to some embodiments, the first data amount read from the external memory by the instruction processing apparatus in a unit time may be calculated according to the formula mn/(t 2-t 1), where m is a maximum throughput of the bus (i.e. a maximum data amount that can be read each time), n is a number of occurrences of a read data valid signal on the bus during execution of the instruction to be debugged, t1 is a time when reading the external memory starts, and t2 is a time when reading the external memory ends. n, t1, t2 are the execution information acquired in step 410, and m is a theoretical value (not belonging to the execution information to be acquired).

According to some embodiments, the second data amount calculated by the calculating unit of the instruction processing apparatus in a unit time may be calculated according to the formula pq/(t 4-t 3), where p is a theoretical throughput (i.e. a maximum data amount that can be calculated each time) of the calculating unit (e.g. a multiplier-adder) of the instruction processing apparatus, q is a total number of times the calculating unit is occupied during execution of the instruction to be debugged, t3 is a time when the calculating unit starts to perform the calculation, and t4 is a time when the calculating unit ends the calculation. q, t3, t4 are the execution information acquired in step 410, and p is a theoretical value (not belonging to the execution information to be acquired).

It will be appreciated that the criterion for determining whether the first data amount and the second data amount match may be set according to the specific situation. In some embodiments, the data read from the external memory is only involved in one operation, in which case a match of the first amount of data to the second amount of data may mean that the first amount of data and the second amount of data are equal (or substantially equal). In other embodiments, the data read from the external memory may each participate in multiple operations, in which case the first data amount and the second data amount match may refer to the second data amount being an integer multiple of (or approximately an integer multiple of) the first data amount.

Based on the above embodiments, the first data amount may represent access performance of the instruction to be debugged, and the second data amount may represent computing performance of the instruction to be debugged. The more matched the two, the more fully utilized the storage resource and the computing resource in the execution process of the instruction to be debugged, and no obvious short board exists, the better the execution performance of the instruction to be debugged. If the two are not matched, the instruction to be debugged can be optimized until the two are matched.

In the above-described embodiments, the first data amount and the second data amount may be used to characterize the overall performance of the instruction to be debugged, and thus both may be denoted as "characterization parameters".

According to some embodiments, the instruction to be debugged is executed by an instruction processing apparatus, and the performance metrics in step 420 include at least one of: the first proportion of the times of partial writing of the external memory by the instruction processing device and the total times of writing of the external memory, the second proportion of the times of blocking the access of the internal memory by the instruction processing device and the total times of accessing the internal memory the number of uses of the calculation unit in the instruction processing apparatus is a third proportion of the maximum number of times the calculation unit is available. Accordingly, step 430 includes: judging whether the execution performance of the instruction to be debugged reaches the target according to the relative size of the performance index and the preset value.

In embodiments of the present disclosure, "partial write" to a memory refers to the amount of data written to the memory once being less than the data bit width of the memory. For example, a certain memory has a data bit width of 4 bytes, and a certain write operation writes 2 bytes of data to the memory, and the write operation is "partial write".

According to some embodiments, the number of partial writes a1 by the instruction processing apparatus to an external memory (i.e. a memory external to the processor in which the instruction processing apparatus is located, such as HBM, GDDR, etc.), and the total number of writes b1 to the external memory may be obtained in step 410, with a first ratio = a1/b1, respectively. If the first ratio is too large (greater than a predetermined value), a large external memory performance penalty is incurred. To improve performance, the instructions to be debugged may be optimized to increase aligned access to the external memory (i.e., the amount of data written once is an integer multiple of the external memory data bit width), reducing the first ratio.

According to some embodiments, in order to improve the access performance of an internal memory (i.e., a memory located inside a processor where an instruction processing apparatus is located), a storage space of the internal memory is generally divided into a plurality of blocks (banks), and a read-write interface for sharing data with data located in the same block cannot be accessed by a plurality of instruction processing apparatuses at the same time; data located in different blocks can be accessed by different instruction processing devices simultaneously by using different read-write interfaces. In the execution process of the instruction to be debugged, if the data to be accessed are located on the same block, the access can only be completed in sequence in an arbitration mode, and the request which is not arbitrated is blocked temporarily, so that the access performance is reduced.

According to some embodiments, the number of times the instruction processing apparatus accesses the internal memory a2 is blocked and the total number of times the internal memory b2 is accessed may be obtained in step 410, the second ratio = a2/b2 accordingly. If the second ratio is too large (greater than a preset value), it indicates that the data distribution in the internal memory is not reasonable, and the internal memory is easy to be blocked. To improve performance, the instructions to be debugged may be optimized such that the data may only be distributed among different blocks of internal memory to reduce the second proportion, reducing the likelihood of access blocking.

According to some embodiments, for NPUs and GPUs, the computing units (e.g., multipliers) therein are arranged in an array (i.e., an array of computing units), thereby obtaining a relatively high parallel computing capability. However, some computing tasks have a small computing scale, which results in a portion of the computing units being unused, and thus, a waste of computing power.

According to some embodiments, the number of uses a3 of the computing units in the instruction processing apparatus and the number of uses b3 of the computing unit array may be obtained in step 410, and accordingly, the third ratio=a3/(b 3×c), where c is the number of computing units included in the computing unit array. If the third ratio is too small (smaller than a preset value), the calculation unit is not fully utilized, and a large resource waste exists. In order to improve performance, the instructions to be debugged may be optimized to increase the third ratio, so that the computing unit is fully utilized.

In the above embodiment, the first proportion, the second proportion and the third proportion can help to locate specific problems existing in the instruction to be debugged, and represent specific factors influencing the execution efficiency, so that a specific optimization direction can be provided for a user. Therefore, the three can be noted as "traceability parameters".

According to some embodiments, the instructions to be debugged are cooperatively executed by a plurality of instruction processing apparatuses, each of the plurality of instruction processing apparatuses executing a portion of the instructions to be debugged. In this case, the time period information for each of the plurality of instruction processing apparatuses to execute the corresponding partial instruction may be acquired through step 410. And accordingly, in step 420, an execution timing diagram of the instruction to be debugged is generated according to the respective time period information of the plurality of instruction processing apparatuses. Further, bottleneck resources and optimization directions of instructions to be debugged can be determined according to the execution timing diagram.

For example, a neural network algorithm may be cooperatively executed by a plurality of co-processing units (i.e., instruction processing devices) in the NPU, each co-processing unit executing a portion of the neural network algorithm. For example, some co-processing units perform matrix operations, some co-processing units perform vector operations, some co-processing units perform data format conversion operations, etc. Through step 410, time period information of each co-processing unit executing the corresponding part of the instruction can be obtained, for example, the time period information of the co-processing unit COP0 executing the instruction includes t0 to t1, t3 to t5; the time period information of the instruction execution of the co-processing unit COP1 comprises t2 to t4 and t6 to t7; etc. Then, through step 420, the time slot information of the instruction executed by each co-processing unit may be sequentially arranged according to the unified time line, so as to generate an execution timing diagram of the instruction to be debugged.

Fig. 5 illustrates one example of an execution timing diagram according to an embodiment of the present disclosure. The execution timing diagram in fig. 5 is generated according to the time period information of executing instructions by the six co-processing units COP0 to COP5 of the NPU, each row corresponds to one co-processing unit, and one shadow block in each row corresponds to one time period, which indicates that the corresponding co-processing unit is occupied in the time period and is in a busy state. As shown in fig. 5, the co-processing unit COP3 is always busy, and therefore COP3 is a bottleneck resource. In order to improve the execution performance, the part to be debugged, which is executed by the COP3, can be optimized to improve the efficiency of executing the instruction by the COP3, and reduce the execution time of the COP3, thereby reducing the time of waiting for the COP3 of other co-processing units and improving the execution efficiency of the instruction to be debugged on the whole NPU.

According to some embodiments, only a limited number of performance register sets can be integrated in an instruction processing apparatus, typically due to chip area overhead considerations. In the case where the instruction to be debugged includes a large number of instructions, in order to obtain a timing chart of the entire execution process of the instruction to be debugged, the instruction to be debugged needs to be executed multiple times (for example, may be executed multiple times by the same instruction processing apparatus, or may be executed once by a plurality of different instruction processing apparatuses, respectively, etc.), execution information of a part of the instructions is recorded each time, and the execution information of the parts of the instructions obtained by the multiple times of execution are combined. It should be noted that, because the hardware states of different instruction processing apparatuses are different, and the storage environments of the same instruction processing apparatus at different times are also different, when the instruction to be debugged is executed for multiple times, the time spent for the same instruction is different, so that a certain instruction or a certain group of instructions cannot be executed independently to record the execution information of the instruction to be debugged, but the instruction to be debugged needs to be executed completely, and then the execution information of the required certain instruction or the required group of instructions is acquired from the instruction to be debugged, so that the environment of each time of executing the instruction to be debugged is approximately the same, and the combined timing diagram is consistent with the actual running state of the instruction to be debugged.

Further, since the hardware states of different instruction processing apparatuses are different and the storage environments of the same instruction processing apparatus at different times are also different, there may be fluctuation in the time of executing the instruction to be debugged each time, and the execution information of each execution cannot be directly combined according to the time line. Therefore, according to some embodiments, the execution time information of each execution may be calibrated, and then the calibrated execution time information is summarized and combined to obtain a timing diagram of the instruction to be debugged, so as to determine the total execution time of the instruction to be debugged. The scheme can acquire corresponding execution time information in real time and combine the execution time information during the execution of the instruction to be debugged (the mode can generate a time sequence diagram in real time), and can uniformly acquire the execution time information and combine the execution time information after the whole execution of the instruction to be debugged is finished (the mode can further adopt means such as multiple running average taking and the like to improve the accuracy of time calibration and reduce accidental influence).

In addition, when the instruction number of the instructions to be debugged is huge, if the user has a plurality of execution environments (such as a chip, a processor and other hardware environments containing the instruction processing device), the debugging work of the instructions to be debugged can be distributed to the plurality of execution environments for parallel execution, each execution environment records the execution time information of a part of instructions in the instructions to be debugged, and finally, the time sequence diagram of the whole instructions to be debugged is obtained through combination, so that the aim of shortening the evaluation time is fulfilled.

Specifically, according to some embodiments, an instruction to be debugged is divided into a plurality of instruction fragments, each instruction fragment includes a plurality of instruction groups (in some cases, one instruction group may include only one instruction), and two adjacent instruction fragments exist with a part of the same instruction groups, the instruction to be debugged is executed by one or more instruction processing apparatuses a plurality of times, and execution time information of one instruction fragment including execution time of each instruction group included in the corresponding instruction fragment is recorded by a performance register group of the corresponding instruction processing apparatus each time (in the above-described automatic mode).

In step 410, the execution time information of the corresponding instruction segment recorded each time the instruction to be debugged is executed is acquired, and accordingly, in step 420, the time offset is determined according to the execution time of the same instruction group of the two adjacent instruction segments; calibrating the execution time information of the instruction fragments except the first instruction fragment in the plurality of instruction fragments according to the time offset; and determining the total execution time of the instruction to be debugged according to the calibrated execution time information. The time offset may be, for example, an average value of differences in execution start times and execution end times of the same instruction group.

Further, in step 430, it may be determined whether the execution performance of the instruction to be debugged reaches the target according to the relative size of the total execution time and the preset value. If the total execution time is smaller than a preset value, the execution performance of the instruction to be debugged can be considered to reach the target; otherwise, the instruction to be debugged may be optimized to reduce its execution time until the total execution time is less than a preset value.

Fig. 6 shows a schematic diagram of calibrating execution time information according to an embodiment of the present disclosure. In the embodiment shown in FIG. 6, the instructions to be debugged instr 0-instr 6 are divided into two instruction fragments 630, 640, each instruction fragment comprising four instruction groups, each instruction group comprising one instruction. As shown in FIG. 6, instruction segment 630 includes instructions instr0 through instr3, and instruction segment 640 includes instructions instr3 through instr6, both of which have the same instruction(s), i.e., instruction instr3.

The instruction to be debugged is executed twice, and the execution time information of the corresponding instruction fragments is recorded by the corresponding instruction processing device in an automatic mode in the two execution processes, and the values of the adopted debugging parameters can be shown in the table 1. On the first execution, the execution time information of instruction segment 630, i.e., the execution times of instructions instr0 through instr3, are recorded (as shown in Table 1 above, the execution times of instructions instr0 through instr3 may be recorded by performance register banks 122-1 through 122-4, respectively), and the execution times of instructions instr0 through instr3 are shown as rectangular blocks 610 through 613 in FIG. 6. Similarly, at the second execution, the execution time information of instruction segment 640, i.e., the execution times of instructions instr3 through instr6, are recorded, and the execution times of instructions instr3 through instr6 are shown as rectangular blocks 623 through 626 in FIG. 6.

When combining the execution time information of instruction segment 630 and instruction segment 640, the time offset is first determined based on the execution times of the same instructions of both, i.e., the execution times 613, 623 of instruction instr 3. The time offset Δt may be an average value of a difference Δt1 between the start times and a difference Δt2 between the end times of the execution times 613, 623, that is, Δt= (Δt1++Δt2)/2. It will be appreciated that the calculated Δt1, Δt2, Δt may be positive or negative.

The execution time information of the instruction fragment 640 is then calibrated according to the time offset. Specifically, for instruction instr3 that is repeated with instruction segment 630, the execution time 613 in the preceding instruction segment 630 is directly employed, and the execution time 623 in the following instruction segment 640 is no longer employed. For instructions instr4 through instr6, the execution times are calibrated according to the time offset Δt, and the execution times 624 through 626 of instructions instr4 through instr6 are added with the time offset Δt, respectively, to obtain calibrated execution times 624 'through 626', as shown in FIG. 6.

Then, the total execution time of the instructions instr 0-instr 6 to be debugged is determined according to the execution time of each instruction after calibration, namely the execution time 610-613, 624 '-626'. The total execution time is the difference between the end time t of the last instruction instr6 and the start time t0 of the first instruction instr 0.

Based on the instruction processing device and the method for evaluating the execution performance of the instructions, the performance of the instructions to be debugged can be evaluated and optimized. FIG. 7 illustrates an exemplary flow chart of an instruction evaluation and optimization process according to an embodiment of the present disclosure. In this embodiment, the instruction to be debugged is a neural network algorithm instruction, and the instruction processing apparatus is each co-processing unit in the NPU.

As shown in fig. 7, in step 702, configuration registers of each co-processing unit are configured, and an instruction to be debugged is executed multiple times using an auto mode, generating an execution timing chart.

Then, in step 704, according to the execution timing diagram, it is checked whether the synchronization situation between the co-processing units is expected, i.e. whether the time at which the co-processing units start up the calculation is expected.

If not, step 706 is performed to modify the program code to solve the synchronization problem and step 702 is performed again.

In step 706, if it is expected, step 708 is executed to find bottleneck processing units and bottleneck instructions by the occupation time of each co-processing unit, and step 712 is executed.

In step 712, the characterization parameters of the bottleneck instruction are checked to determine if the bottleneck instruction has an optimization space.

In step 714, if there is an optimization space, step 718 is performed to check each tracing parameter of the instruction, find an optimization point and an optimization direction according to the tracing parameters, and modify the related instruction. Step 720 is then performed to tag the modified instruction or group of instructions with a tagging instruction, and by configuring a configuration register, use the tagging mode to record execution information of the modified instruction or group of instructions, and perform performance assessment based on the execution information. Subsequently, step 722 is executed to check whether the characterization parameters are improved, if not, step 718 is executed again, and if so, step 724 is executed.

In step 714, if there is no optimization space, then step 716 is performed to modify the mapping scheme. Step 724 is then performed to re-execute the instruction to be debugged, checking whether the total execution time reaches the optimization goal. In step 726, if the target is reached, step 728 is executed to complete the optimization; if the target is not reached, step 702 is re-executed.

The instruction processing apparatus and the scheme for evaluating instruction execution performance of the present disclosure have the following advantageous effects:

1. The hardware cost is low, the original register read-write channel of the instruction processing device is multiplexed to transfer data, and no additional storage resource or bus channel is required to be added;

2. the method has the advantages that the data path and the memory of the processor (the instruction processing device can be one processor core in the multi-core processor) where the instruction processing device is located are not invaded, and the recorded execution information is not misaligned because the original instruction execution process is influenced in the debugging process;

3. Both a tag mode and an automatic mode debug mode are provided. The marking mode has strong pertinence, the debugging process is rapid, and the method is suitable for rapidly evaluating the optimizing effect; the automatic mode can obtain the overall execution condition of instructions to be debugged (such as a plurality of convolution layers of a neural network, a residual unit of ResNet and the like) with larger instruction scale, so that the performance bottleneck is conveniently positioned, but the instructions to be debugged need to be executed for a plurality of times, the time consumption is slightly long, and a user can flexibly select two modes according to the needs;

4. a simple and feasible scheme for merging execution time information obtained by executing instructions to be debugged for multiple times in an automatic mode is provided. In the automatic mode, the execution time information obtained in multiple execution can be combined by a simple data processing mode of adding the reference and the offset, so that a time sequence diagram is obtained, the post-processing precision is acceptable, and the calculated amount is controllable;

5. aiming at the automatic mode, the debugging efficiency can be improved in parallel through multiple hardware in a resource time-changing mode;

6. Is non-invasive to the software stack such as drivers, frameworks, runtime, etc. The processor (such as a software stack of a neural network processor) does not need to be changed, and the compatibility cost is low.

It will be appreciated that the instruction processing apparatus of the embodiments of the present disclosure is preferably a single processor core in a single core processor or multi-core processor for processing coarse-grained instructions, and accordingly, the instructions to be debugged are preferably coarse-grained instructions. Coarse-grained instructions refer to instructions capable of performing a series of operations. For example, for an NPU, the coarse-grained instruction may be a convolution instruction, the operations performed by the convolution instruction including a batch load operation, a multiply operation, an add operation, etc. of the data. Because the coarse-grained instructions can execute a series of operations, the number of coarse-grained instructions contained in a program written for realizing a specific task (namely the target instruction sequence) is not too large, the number of instructions to be debugged is correspondingly not too large, the time consumed by the instruction processing device for debugging the instructions to be debugged is less, and the debugging efficiency is higher. In addition, for a coarse-grained instruction, the size of the processed data is different, and the corresponding computing efficiency and memory efficiency are often different. For a multi-core processor comprising a plurality of instruction processing devices, by acquiring and analyzing the execution information recorded when each instruction processing device executes a coarse-grained instruction to be debugged, on one hand, whether the dependency relationship among the plurality of instruction processing devices executing asynchronously is reasonable can be checked, and on the other hand, the division mode of the instruction to be debugged on each instruction processing device (that is, which instruction processing device is used for executing each part of the instruction to be debugged) can be optimized, so that the execution efficiency of the instruction to be debugged on the whole processor is optimized.

In addition, it may be appreciated that the instruction processing apparatus of the embodiments of the present disclosure may also be a single processor core in a single core processor or a multi-core processor for processing fine-grained instructions, and accordingly, the instruction to be debugged is a fine-grained instruction. Fine-grained instructions refer to instructions capable of performing one operation or a small number of operations of the same type. Fine-grained instructions are relative to coarse-grained instructions. The number and types of operations that a fine-grained instruction can perform is smaller than a coarse-grained instruction. However, since the number and types of operations that can be performed by the fine-grained instruction are small, the number of fine-grained instructions included in a program written for implementing a specific task (i.e., the above target instruction sequence) is generally large, and accordingly, the number of instructions to be debugged is large, so that the time taken for the instruction processing apparatus to debug the instructions to be debugged is long, and the debugging efficiency is low. On the other hand, since fine-grained instructions generally do not involve the problem of large performance variations due to changes in the size of data processed, execution information recorded when an instruction processing apparatus executes fine-grained instructions to be debugged is generally limited in optimizing effect on instructions to be debugged.

According to an embodiment of the present disclosure, there is also provided an apparatus for evaluating instruction execution performance.

Fig. 8 shows a block diagram of an apparatus 800 for evaluating instruction execution performance according to an embodiment of the present disclosure. As shown in fig. 8, the apparatus 800 includes:

An information obtaining module 810 configured to obtain execution information when an instruction to be debugged is executed by one or more instruction processing apparatuses (for example, the instruction processing apparatus 100 shown in fig. 1) according to an embodiment of the present invention;

an index determination module 820 configured to determine a performance index of the instruction to be debugged according to the execution information; and

The performance judging module 830 is configured to judge whether the execution performance of the instruction to be debugged reaches the target according to the performance index.

According to some embodiments, the instruction to be debugged is executed by an instruction processing apparatus, and the performance index includes a first data amount read by the instruction processing apparatus from an external memory in a unit time and a second data amount calculated by a calculation unit of the instruction processing apparatus in the unit time; the performance judging module comprises: a matching unit configured to determine whether the first data amount and the second data amount match; and a judging unit configured to judge that the execution performance of the instruction to be debugged reaches a target in response to determining that the first data amount matches the second data amount.

According to some embodiments, the instruction to be debugged is executed by an instruction processing apparatus, and the performance metrics include at least one of: the method comprises the steps that the number of partial writing of an external memory by an instruction processing device is a first proportion of the total number of writing of the external memory, the number of blocked access of the internal memory by the instruction processing device is a second proportion of the total number of access of the internal memory, and the number of using of a computing unit in the instruction processing device is a third proportion of the maximum number of available computing units; the performance determination module is further configured to: judging whether the execution performance of the instruction to be debugged reaches a target according to the relative size of the performance index and a preset value.

According to some embodiments, the instruction to be debugged is cooperatively executed by a plurality of instruction processing apparatuses, each of the plurality of instruction processing apparatuses executing a part of the instructions in the instruction to be debugged, the execution information including time period information for each of the plurality of instruction processing apparatuses to execute the corresponding part of the instructions; the index determination module is further configured to: and generating an execution timing diagram of the instruction to be debugged according to the time period information of each of the plurality of instruction processing devices.

According to some embodiments, the instruction to be debugged is divided into a plurality of instruction fragments, each instruction fragment comprises a plurality of instruction groups, and a part of the same instruction groups exist in two adjacent instruction fragments, the instruction to be debugged is executed by the one or more instruction processing devices multiple times, and each execution time information of one instruction fragment is recorded by a performance register group of the corresponding instruction processing device, wherein the execution time information comprises the execution time of each instruction group included in the corresponding instruction fragment; the information acquisition module is further configured to: respectively acquiring the execution time information of the corresponding instruction fragments recorded when the instruction to be debugged is executed each time; the index determination module comprises: an offset determining unit configured to determine a time offset according to execution times of the same instruction group of the adjacent two instruction fragments; a time calibration unit configured to calibrate execution time information of instruction fragments other than a first instruction fragment among the plurality of instruction fragments according to the time offset; and a total time determining unit configured to determine a total execution time of the instruction to be debugged according to the calibrated execution time information.

It should be appreciated that the various modules of the apparatus 800 shown in fig. 8 may correspond to the various steps in the method 400 described with reference to fig. 4. Thus, the operations, features, and advantages described above with respect to method 400 apply equally to apparatus 800 and the modules that it comprises. For brevity, certain operations, features and advantages are not described in detail herein.

Although specific functions are discussed above with reference to specific modules, it should be noted that the functions of the various modules discussed herein may be divided into multiple modules and/or at least some of the functions of the multiple modules may be combined into a single module. For example, the index determination module 820 and the performance determination module 830 described above may be combined into a single module in some embodiments.

It should also be appreciated that various techniques may be described herein in the general context of software hardware elements or program modules. The various modules described above with respect to fig. 8 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, these modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the information acquisition module 810, the index determination module 820, the performance determination module 830 may be implemented together in a system on a chip (SoC). The SoC may include an integrated circuit chip including one or more components of a Processor (e.g., a central processing unit (Central Processing Unit, CPU), microcontroller, microprocessor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), etc.), memory, one or more communication interfaces, and/or other circuitry, and may optionally execute received program code and/or include embedded firmware to perform functions.

According to embodiments of the present disclosure, there is also provided an electronic device, a readable storage medium and a computer program product.

Referring to fig. 9, a block diagram of an electronic device 900 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906, an output unit 907, a storage unit 908, and a communication unit 909. The input unit 906 may be any type of device capable of inputting information to the device 900, the input unit 906 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 907 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 908 may include, but is not limited to, magnetic disks, optical disks. Communication unit 909 allows device 900 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers, and/or chipsets, such as bluetooth ^TM devices, 1302.11 devices, wi-Fi devices, wi-Max devices, cellular communication devices, and/or the like.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the various methods and processes described above, such as the method 400 described above. For example, in some embodiments, the method 400 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into RAM 903 and executed by computing unit 901, one or more steps of method 400 described above may be performed. Alternatively, in other embodiments, computing unit 901 may be configured to perform method 400 by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely illustrative embodiments or examples and that the scope of the present disclosure is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. An instruction processing apparatus comprising:

An execution unit;

At least one performance register set, which is used for recording the execution information of the execution unit during the execution of the instruction to be debugged;

a configuration register for configuring debug parameters; and

Trigger circuitry configured to enable one or more of said at least one set of performance registers to record said execution information in accordance with said debug parameters,

The instruction to be debugged is divided into a plurality of instruction fragments, each instruction fragment comprises a plurality of instruction groups, a part of the same instruction groups exist in two adjacent instruction fragments, the instruction to be debugged is executed by the instruction processing device for a plurality of times, and the one or more enabled performance register groups of the instruction processing device record execution time information of one instruction fragment each time, wherein the execution time information comprises the execution time of each instruction group included in the corresponding instruction fragment.

2. The instruction processing apparatus of claim 1, wherein the set of performance registers comprises:

at least one time counter for recording the time of occurrence of the first preset event; and/or

And the at least one event counter is used for recording the occurrence times of the second preset event.

3. The instruction processing apparatus of claim 2, wherein the first preset event comprises one or more of: instruction execution, instruction execution completion, read memory start, read memory end, write memory start, write memory end;

the second preset event includes one or more of the following: a computing unit, a read memory block, a write memory block is used.

4. An instruction processing apparatus according to any one of claims 1-3, wherein the instruction to be debugged is one or more instructions in a target instruction sequence marked by a marking instruction, the debug parameters comprise a debug mode comprising a marking mode,

The trigger circuit is further configured to:

Identifying whether the current instruction is the marking instruction under the condition that the debugging mode is the marking mode; and

One or more of the at least one performance register sets are enabled to record the execution information in response to determining that the current instruction is a marker instruction.

5. The instruction processing apparatus of claim 4, wherein the marker instruction includes a first field for indicating a marker type and a second field for indicating a performance register set, the marker type including a debug start and a debug end,

The trigger circuit is further configured to: and in response to determining that the current instruction is a marked instruction, enabling the performance register set corresponding to the second field to record the execution information.

6. An instruction processing apparatus according to any one of claims 1-3, wherein the debug parameters include a debug mode including an auto mode, the debug parameters further including a loop parameter corresponding to the auto mode, the loop parameter being used to divide the instructions to be debugged into the plurality of instruction groups,

The trigger circuit is further configured to:

in the case that the debug mode is an auto mode, each of the at least one performance register set is enabled in a loop to record execution information during which each of the plurality of instruction sets is executed, respectively.

7. The instruction processing apparatus of claim 6, wherein the loop parameters comprise:

A start instruction number indicating the number of a first instruction in a first instruction group in a loop;

A single debug instruction number for indicating the number of instructions included in the instruction set;

the instruction interval is used for indicating the serial number difference of the first instruction in the two adjacent instruction groups; and

And the loop increment is used for indicating the value of the increase of the initial instruction number after one loop is ended.

8. The instruction processing apparatus according to any one of claims 1 to 3, further comprising:

a timer configured to start timing when the instruction to be debugged starts executing, and to end timing when the instruction to be debugged completes executing.

9. A processor comprising at least one instruction processing apparatus as claimed in any one of claims 1 to 8.

10. A chip comprising at least one processor as claimed in claim 9.

11. A computing device comprising the chip of claim 10.

12. A method of evaluating instruction execution performance, comprising:

Acquiring execution information of an instruction to be debugged when executed by one or more instruction processing apparatuses, the instruction processing apparatus being an instruction processing apparatus according to any one of claims 1 to 8;

determining performance indexes of the instructions to be debugged according to the execution information; and

Judging whether the execution performance of the instruction to be debugged reaches a target according to the performance index.

13. The method of claim 12, wherein the instruction to be debugged is executed by an instruction processing apparatus, the performance index includes a first amount of data read from an external memory by the instruction processing apparatus per unit time and a second amount of data calculated by a calculation unit of the instruction processing apparatus per unit time,

The step of judging whether the execution performance of the instruction to be debugged reaches a target according to the performance index comprises the following steps:

judging whether the first data volume is matched with the second data volume; and

And in response to determining that the first data amount matches the second data amount, determining that the execution performance of the instruction to be debugged reaches a target.

14. The method of claim 12, wherein the instruction to be debugged is executed by an instruction processing apparatus, the performance metrics comprising at least one of: the first proportion of the times of partial writing of the external memory by the instruction processing device is the total times of writing of the external memory, the second proportion of the times of blocking the access of the internal memory by the instruction processing device is the total times of accessing the internal memory, the third proportion of the times of using the computing unit in the instruction processing device is the maximum times of using the computing unit,

The step of judging whether the execution performance of the instruction to be debugged reaches a target according to the performance index comprises the following steps: judging whether the execution performance of the instruction to be debugged reaches a target according to the relative size of the performance index and a preset value.

15. The method of any of claims 12-14, wherein the instructions to be debugged are cooperatively executed by a plurality of instruction processing devices, each of the plurality of instruction processing devices executing a portion of the instructions to be debugged, the execution information including time period information for each of the plurality of instruction processing devices to execute the corresponding portion of instructions,

Determining the performance index of the instruction to be debugged according to the execution information comprises the following steps:

and generating an execution timing diagram of the instruction to be debugged according to the time period information of each of the plurality of instruction processing devices.

16. The method of any of claims 12-14, wherein the obtaining execution information of the instruction to be debugged, when executed by one or more instruction processing apparatus, comprises: respectively acquiring the execution time information of the corresponding instruction fragments recorded when the instruction to be debugged is executed each time;

the determining the performance index of the instruction to be debugged according to the execution information comprises the following steps:

determining a time offset according to the execution time of the same instruction group of two adjacent instruction fragments;

Calibrating execution time information of the instruction fragments except for the first instruction fragment in the plurality of instruction fragments according to the time offset; and

And determining the total execution time of the instruction to be debugged according to the calibrated execution time information.

17. An apparatus for evaluating instruction execution performance, comprising:

an information acquisition module configured to acquire execution information of instructions to be debugged when executed by one or more instruction processing apparatuses, the instruction processing apparatuses being the instruction processing apparatus according to any one of claims 1 to 8;

The index determining module is configured to determine the performance index of the instruction to be debugged according to the execution information; and

And the performance judging module is configured to judge whether the execution performance of the instruction to be debugged reaches a target according to the performance index.

18. An electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein the method comprises the steps of

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 12-16.

19. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 12-16.

20. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 12-16.