CN111062858B

CN111062858B - Efficient rendering-ahead method, device and computer storage medium

Info

Publication number: CN111062858B
Application number: CN201911380883.2A
Authority: CN
Inventors: 张竞丹; 李洋; 樊良辉; 陈成
Original assignee: Xi'an Xintong Semiconductor Technology Co ltd
Current assignee: Xi'an Xintong Semiconductor Technology Co ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2023-09-15
Anticipated expiration: 2039-12-27
Also published as: CN111062858A

Abstract

The embodiment of the invention discloses a high-efficiency rendering-in-advance method, a device and a computer storage medium; the method may include: after finishing dividing each graphic element according to the size of tile by the clipping and dividing module, immediately transmitting the graphic element subjected to dividing into a rasterization module for rasterization operation; for the primitives with the completed rasterization operation, scheduling the universal rendering core for rendering the primitives with the completed rasterization operation from the rendering core array by a scheduler according to the working state of the universal rendering core in the rendering core array and the tile covered by the primitives with the completed rasterization operation; the primitive shader module configured to be implemented by the general purpose rendering core performs primitive shading processing on the primitive which has completed the rasterization operation for tiles covered by the primitive which has completed the rasterization operation based on the scheduling of a scheduler.

Description

Efficient rendering-ahead method, device and computer storage medium

Technical Field

The embodiment of the invention relates to the technical field of graphic processing units (GPU, GPU, graphics Processing Unit), in particular to a high-efficiency rendering-ahead method, a high-efficiency rendering-ahead device and a computer storage medium.

Background

Generally, GPUs are specialized graphics rendering devices that process and display computerized graphics. GPUs are constructed in a highly parallel architecture that provides more efficient processing than a typical general purpose Central Processing Unit (CPU) for a range of complex algorithms. For example, the complex algorithm may correspond to a representation of a two-dimensional (2D) or three-dimensional (3D) computerized graphic.

But in the process of GPU rendering graphics, two rendering schemes, i.e., an immediate mode rendering (IMR, immediate Mode Rendering) scheme and a TBR scheme, are typically employed. For IMR schemes, once a command for drawing a primitive is generated by the GPU during rendering of a frame of a picture, the GPU immediately performs a series of graphics rendering pipeline operations on the primitive (which may include, for example, vertex shading, primitive assembly, clipping, rasterization, and primitive shading, depth testing, blending, etc. in turn), and writes the rendering result directly back into the frame buffer before processing the next primitive. To reduce access to the frame buffer, an on-chip buffer is added inside the GPU and made to have a high memory bandwidth. However, the size of the on-chip cache is generally limited due to the constraint of the physical area of the GPU, so when the on-chip cache capacity is limited to not have enough capacity to accommodate the entire screen, the screen is typically split into tiles (tiles), so that each tile can accommodate the on-chip cache. For example, if the on-chip cache is capable of storing 512kB of data, the picture may be divided into tiles such that the pixel data contained in each tile is less than or equal to 512kB. In this way, a scene may be rendered by dividing the picture into tiles that may be rendered into an on-chip cache and rendering each tile of the scene into the on-chip cache individually, storing the rendered tile from the on-chip cache into a frame buffer, and repeating the rendering and storing for each tile of the picture. Thus, a picture may be rendered tile by tile to render each tile of the scene. This technique is also known as TBR scheme. It can be appreciated that the TBR scheme belongs to a mode of delaying reproduction of graphics, and is widely used in mobile devices where power and system bandwidth are at a premium due to its low power consumption.

In the TBR scheme, since the rasterizing module and the general rendering core need to wait for each tile to complete the construction of the primitive list (the construction of the primitive list means that each tile records which primitives in a frame of picture cover the own pixel area), the rasterizing module starts to process the primitives, and the general rendering core starts to execute the process of coloring the primitives for each tile after the rasterizing operation is finished. That is, during construction of the primitive list, the rasterization module and unified rendering core are inactive; once the tile primitive list is constructed, the current primitive or tile needs to wait for the rasterization module or the unified rendering core to execute the previous primitive or tile before continuing. Thus, the current TBR scheme has a load imbalance phenomenon.

Disclosure of Invention

In view of this, embodiments of the present invention desire to provide an efficient render-ahead method, apparatus, and computer storage medium; the load condition in the GPU can be balanced, and the rendering efficiency is improved.

The technical scheme of the embodiment of the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides an efficient render-ahead method, where the method includes:

after finishing dividing each graphic element according to the size of tile by the clipping and dividing module, immediately transmitting the graphic element subjected to dividing into a rasterization module for rasterization operation;

For the primitives with the completed rasterization operation, scheduling the universal rendering core for rendering the primitives with the completed rasterization operation from the rendering core array by a scheduler according to the working state of the universal rendering core in the rendering core array and the tile covered by the primitives with the completed rasterization operation;

the primitive shader module configured to be implemented by the general purpose rendering core performs primitive shading processing on the primitive which has completed the rasterization operation for tiles covered by the primitive which has completed the rasterization operation based on the scheduling of a scheduler.

In a second aspect, an embodiment of the present invention provides a graphics processor GPU, the GPU comprising: a clipping and dividing module, a rasterization module, a scheduler and a general rendering core; wherein,,

the clipping and dividing module is configured to immediately transmit the divided primitives to the rasterization module after dividing each primitive according to the size of tile;

the rasterization module is configured to perform rasterization operation on the incoming primitives, and inform the general rendering core of performing fragment coloring processing on the primitives after the rasterization operation is finished;

the scheduler is configured to schedule the universal rendering core for rendering the primitive which has completed the rasterization operation from the rendering core array according to the working state of the universal rendering core in the rendering core array and the tile covered by the primitive which has completed the rasterization operation;

The fragment shader module is configured to perform fragment shading processing on the primitive which has completed the rasterization operation for tiles covered by the primitive which has completed the rasterization operation based on the scheduling of the scheduler.

In a third aspect, embodiments of the present invention provide a computer storage medium storing an efficient render-ahead program that, when executed by at least one processor, implements the steps of the efficient render-ahead method of the first aspect.

The embodiment of the invention provides a high-efficiency rendering-in-advance method, a device and a computer storage medium; when the clipping and dividing module finishes dividing each graphic element, the graphic elements are immediately transmitted into the rasterizing module to carry out rasterizing operation, and after the rasterizing operation is finished, the subsequent piece element coloring operation is carried out, so that when the graphic elements are divided, the rasterizing module is not required to be called to carry out rasterizing operation after all the graphic elements are divided, and the piece element coloring module is required to be called to carry out piece element coloring operation after all the graphic elements are divided, the utilization rate of each rendering core in the rendering core array of the graphic rendering pipeline in the GPU is improved, and the load of the rendering core array is balanced.

Drawings

FIG. 1 is a block diagram of a computing device in which one or more aspects of embodiments of the invention may be implemented;

FIG. 2 is a block diagram of a GPU in which one or more aspects of embodiments of the present invention may be implemented;

FIG. 3 is a block diagram of a graphics processing pipeline formed by the GPU architecture of FIG. 2;

FIG. 4 is a schematic diagram of a graphic to be rendered according to an embodiment of the present invention;

fig. 5 is a flow chart of an efficient render-ahead method according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Referring to FIG. 1, which illustrates a computing device 100 configured to implement one or more aspects of embodiments of the invention, the computing device 100 may include, but is not limited to, the following: wireless devices, mobile or cellular telephones (including so-called smart phones), personal Digital Assistants (PDAs), video game consoles (including video displays, mobile video gaming devices, mobile video conferencing units), laptop computers, desktop computers, television set-top boxes, tablet computing devices, electronic book readers, fixed or mobile media players, and the like. In the example of fig. 1, computing device 2 may include a Central Processing Unit (CPU) 102 and a system memory 104 that communicates via an interconnection path that may include a memory bridge 105. Memory bridge 105 may be, for example, a north bridge chip, connected to an I/O (input/output) bridge 107 via a bus or other communication path 106, such as a HyperTransport link. I/O bridge 107, which may be, for example, a south bridge chip, receives user input from one or more user input devices 108 (e.g., keyboard, mouse, trackball, touch screen or other type of input means capable of being incorporated as part of display device 110) and forwards the input to CPU 102 via path 106 and memory bridge 105. Graphics processor 112 is coupled to memory bridge 105 via bus or other communication path 113 (e.g., PCI Express, accelerated graphics port, or HyperTransport link); in one embodiment, GPU 112 may be a graphics subsystem that delivers pixels to a display device 110 (e.g., a conventional CRT or LCD based monitor). A system disk 114 is also connected to I/O bridge 107. Switch 116 provides a connection between I/O bridge 107 and other components such as network adapter 118 and various add-in cards 120 and 121. Other components (not explicitly shown), including USB or other port connections, CD drives, DVD drives, film recording devices, and the like, may also be connected to I/O bridge 107. The communication paths interconnecting the various components in FIG. 1 may be implemented using any suitable protocol, such as PCI (peripheral component interconnect), PCI-Express, AGP (accelerated graphics Port), hyperTransport, or any other bus or point-to-point communication protocol, and the connections between the different devices may use different protocols as known in the art.

In one embodiment, GPU 112 includes circuitry optimized for graphics and video processing, including, for example, video output circuitry. In another embodiment, GPU 112 includes circuitry optimized for general purpose processing while preserving the underlying (unrerling) computing architecture. In yet another embodiment, GPU 112 may be integrated with one or more other system elements, such as memory bridge 105, CPU 102, and I/O bridge 107, to form a system on chip (SoC).

It should be understood that the system shown herein is exemplary and that variations and modifications are possible. The connection topology, including the number and arrangement of bridges, the number of CPUs 102, and the number of GPUs 112, may be modified as desired. For example, in some embodiments, system memory 104 is directly connected to CPU 102 rather than through a bridge, and other devices communicate with system memory 104 via memory bridge 105 and CPU 102. In other alternative topologies, GPU 112 is connected to I/O bridge 107 or directly to CPU 102 instead of to memory bridge 105. While in other embodiments the I/O bridge 107 and memory bridge 105 may be integrated onto a single chip. Numerous embodiments may include two or more CPUs 102 and two or more GPUs 112. The specific components shown herein are optional; for example, any number of add-in cards or peripheral devices may be supported. In some embodiments, switch 116 is removed and network adapter 118 and add-in cards 120, 121 are directly connected to I/O bridge 107.

Based on the computing device 100 shown in FIG. 1, FIG. 2 illustrates a schematic block diagram of a GPU 112 in which one or more aspects of embodiments of the present invention may be implemented, in which a graphics memory 204 may be part of the GPU 112. Thus, GPU 112 may read data from graphics memory 204 and write data to graphics memory 204 without using a bus. In other words, GPU 112 may process data locally using a local storage device rather than off-chip memory. Such graphics memory 204 may be referred to as on-chip memory. This allows GPU 112 to operate in a more efficient manner by eliminating the need for GPU 112 to read and write data via a bus, which may experience heavy bus traffic. However, in some cases, GPU 112 may not include separate memory, but rather utilize system memory 10 via a bus. Graphics memory 204 may include one or more volatile or nonvolatile memory or storage devices, such as Random Access Memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), erasable Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), flash memory, magnetic data media, or optical storage media.

Based on this, GPU 112 may be configured to perform various operations related to: pixel data is generated from graphics data provided by CPU 102 and/or system memory 104 via memory bridge 105 and bus 113, interacted with local graphics memory 204 (e.g., a common frame buffer) to store and update pixel data, transfer pixel data to display device 110, and so forth.

In operation, CPU 102 is the main processor of computing device 100, controlling and coordinating the operation of other system components. Specifically, CPU 102 issues commands that control the operation of GPU 112. In some embodiments, CPU 102 writes a command stream for GPU 112 into a data structure (not explicitly shown in fig. 1 or 2), which may be located in system memory 104, graphics memory 204, or other storage locations accessible to both CPU 102 and GPU 112. A pointer to each data structure is written to a push buffer (pushbuffer) to initiate processing of the command stream in the data structure. GPU 112 reads the command stream from the one or more push buffers and then executes the commands asynchronously with respect to the operation of CPU 102. Each push buffer may be assigned an execution priority to control scheduling of the different push buffers.

As particularly depicted in fig. 2, GPU 112 includes an I/O (input/output) unit 205 that communicates with the rest of computing device 100 via a communication path 113 that is connected to memory bridge 105 (or, in an alternative embodiment, directly to CPU 102). The connection of GPU 112 to the rest of computing device 100 may also vary. In some embodiments, GPU 112 may be implemented as an add-in card that may be inserted into an expansion slot of computer system 100. In other embodiments, GPU 112 may be integrated on a single chip with a bus bridge, such as memory bridge 105 or I/O bridge 107. While in other embodiments some or all of the elements of GPU 112 may be integrated with CPU 102 on a single chip.

In one embodiment, communication path 113 can be a PCI-EXPRESS link in which dedicated channels are assigned to GPU 112, as is known in the art. The I/O unit 205 generates data packets (or other signals) for transmission over the communication path 113 and also receives all incoming data packets (or other signals) from the communication path 113, directing the incoming data packets to the appropriate components of the GPU 112. For example, commands related to processing tasks may be directed to the scheduler 207, while commands related to memory operations (e.g., reads or writes to the graphics memory 204) may be directed to the graphics memory 204.

In GPU112, a rendering core array 230 may be included, which array 230 may include C general purpose rendering cores 208, where C >1; d fixed function rendering cores 209. Based on the general purpose rendering cores 208 in array 230, GPU112 is capable of concurrently executing a large number of program tasks or computing tasks. For example, each rendering core may be programmed to be capable of performing processing tasks related to a wide variety of programs, including, but not limited to, linear and nonlinear data transforms, video and/or audio data filtering, modeling operations (e.g., applying laws of physics to determine the position, velocity, and other attributes of objects), graphics rendering operations (e.g., tessellation shaders, vertex shaders, geometry shaders, and/or fragment shader programs), and so forth.

While a fixed function rendering core 209, which may include hardware that is hardwired to perform certain functions. Although fixed-function hardware may be configured to perform different functions via, for example, one or more control signals, the fixed-function hardware typically does not include program memory that is capable of receiving user compilers. In some examples, fixed function rendering core 209 may include, for example, a processing unit to perform primitive assembly, a processing unit to perform clipping and partitioning operations, a processing unit to perform rasterization operations, and a processing unit to perform fragment operations. For the processing unit for executing the primitive assembly, the processing unit can restore the vertexes which are completely colored by the vertex shader unit into the grid structure of the graph, namely the primitive according to the original connection relation, so as to be processed by the subsequent primitive shader unit; the cutting and dividing operation comprises cutting and eliminating the assembled primitive, and dividing according to the size of tile; the rasterizing operation includes converting primitives and outputting fragments to a fragment shader; whereas the segment operation includes, for example, a depth test, a scissors test, an alpha blending, etc., the pixel data outputted through the above operation can be displayed as graphic data by the display device 110. Combining the general purpose rendering core 208 and the fixed function rendering core 209 in the above-described rendering core array 230 enables a logical model of a complete graphics rendering pipeline.

In addition, rendering core array 230 may receive processing tasks to be performed from scheduler 207. Scheduler 207 may independently schedule the tasks to be performed by resources of GPU 112, such as one or more rendering cores 208, 209 in rendering core array 230. In one example, the scheduler 207 may be a hardware processor. In the example shown in fig. 2, scheduler 207 may be included in GPU 112. In other examples, scheduler 207 may also be a separate unit from CPU 102 and GPU 112. Scheduler 207 may also be configured as any processor that receives a stream of commands and/or operations.

Scheduler 207 may process one or more command streams that include scheduling operations that are included in one or more command streams executed by GPU 112. In particular, scheduler 207 may process one or more command streams and schedule operations in the one or more command streams for execution by rendering core array 230. In operation, CPU 102, via GPU driver 103 included in system memory 104 in FIG. 1, may send a command stream to scheduler 207 that includes a series of operations to be performed by GPU 112. The scheduler 207 may receive an operation stream comprising a command stream through the I/O unit 205 and may sequentially process the operations of the command stream based on the order of operations in the command stream, and may schedule the operations in the command stream to be performed by one or more rendering cores in the array of rendering cores 230.

Also, tile cache 232 is a small amount of very high bandwidth memory located on-chip with GPU 112. However, the size of tile cache 232 is too small to hold the entire graphics data, so rendering core array 230 must perform multiple rendering rounds to render the entire graphics data. For example, rendering core array 230 may perform one rendering pass for each tile of a frame of image. In particular, tile cache 232 may include one or more volatile or non-volatile memory or storage devices, such as Random Access Memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), and the like. In some examples, tile cache 232 may be an on-chip buffer. An on-chip buffer may refer to a buffer formed on, located on, and/or disposed on the same microchip, integrated circuit, and/or die as the microchip, integrated circuit, and/or die on which GPU 112 is formed, located, and/or disposed. Furthermore, when tile cache 232 is implemented on the same chip as GPU 112, GPU 112 does not necessarily need to access tile cache 232 via communication path 113, but rather may access tile cache 232 via an internal communication interface (e.g., bus) implemented on the same chip as GPU 112. Because this interface is on-chip, it may be capable of operating at a higher bandwidth than communication path 113. It can be seen that, although the tile cache 232 has a limited storage capacity and increases the overhead on hardware, it can only be used to cache one or several small rectangles of data, but avoids the overhead of repeatedly accessing the video memory, reduces the bandwidth, and saves the power consumption.

Based on the description of fig. 1 and 2, fig. 3 illustrates an example of a graphics rendering pipeline 80 formed by the structure of the GPU 112 illustrated in fig. 2, where a core portion of the graphics rendering pipeline 80 is a logic structure formed by cascading a general purpose rendering core 208 and a fixed function rendering core 209 included in the rendering core array 230, and further, for each of the scheduler 207, the graphics memory 204, the tile cache 232, and the I/O unit 205 included in the GPU 112, a peripheral circuit or a device implementing the logic structure function of the graphics rendering pipeline 80, and accordingly, the graphics rendering pipeline 80 typically includes a programmable stage module (as illustrated by a rounded corner box in fig. 3) and a fixed function stage module (as illustrated by a square box in fig. 3), for example, the functions of the programmable stage module may be performed by the general purpose rendering core 208 included in the rendering core array 230, and the functions of the fixed function stage module may be implemented by the fixed function rendering core 209 included in the rendering core array 230. As shown in fig. 3, the graphics rendering pipeline 80 includes stages that are, in order:

vertex grabbing module 82, shown in the example of FIG. 3 as a fixed function stage, is generally responsible for supplying graphics data (triangles, lines, and points) to graphics rendering pipeline 80. For example, vertex grabbing module 82 may collect vertex data for higher order surfaces, primitives, etc., and output the vertex data and attributes to vertex shader module 84.

Vertex shader module 84, shown as a programmable stage in fig. 3, is responsible for processing received vertex data and attributes, and processing vertex data by performing a set of operations for each vertex at a time.

Primitive assembly block 86, shown in FIG. 3 as a fixed function stage, is responsible for collecting vertices output by vertex shader block 84 and assembling the vertices into geometric primitives. For example, primitive assembly module 86 may be configured to group each three consecutive vertices into a geometric primitive (i.e., triangle). In some embodiments, a particular vertex may be reused for consecutive geometric primitives (e.g., two consecutive triangles in a triangle strip may share two vertices).

The clipping and dividing module 88, which is shown as a fixed function level in fig. 3, is responsible for dividing the assembled primitives according to the size of tile after clipping and eliminating the primitives;

rasterization module 90 is typically a fixed functional level responsible for preparing primitives for primitive shader module 92. For example, rasterizing module 90 may generate fragments for shading by fragment shader module 92.

The fragment shader module 92, shown in FIG. 3 as a programmable stage, receives fragments from the rasterization module 90 and generates per-pixel data such as color. The fragment shader module 92 may also perform per-pixel processing such as texture blending and illumination model computation.

The output combiner module 94, shown in FIG. 3 as a fixed functional stage, is generally responsible for performing various operations on the pixel data, such as performing a transparency test (alpha test), a stencil test (stepil test), and blending the pixel data with other pixel data corresponding to other segments associated with the pixel. When the output combiner module 94 has completed processing pixel data (i.e., output data), the processed pixel data may be written to a render target to produce a final result.

For a conventional TBR scheme, a screen area is generally divided into tiles with equal size, for a frame of image, after the assembly phase of the graphics primitives is finished, the GPU 112 calculates which tiles in the screen are covered by the graphics primitives according to the sizes of the graphics primitives, and establishes a graphics primitive list for each tile, and once the tile is covered by the graphics primitives, the corresponding graphics primitive information is updated in the graphics primitive list of the tile until all the graphics primitives are collected. In the subsequent stages of rasterization after collection, the GPU 112 traverses the primitive list for each tile (e.g., a tile may be covered by multiple primitives), and each time a primitive in the primitive list is rendered, the tile's data is written into the on-chip cache. The final data of tile is not written into the memory until all the graphic elements in the list are processed. Specifically, in connection with the graphics rendering pipeline 80 shown in FIG. 3, the conventional TBR scheme described above may include the steps of: 1. vertex shader module 84 performs vertex shading programs on vertices; 2. primitive assembling module 86 performs primitive assembling, and clipping and dividing module 88 performs clipping and dividing tile operations; 3. repeating the second step until all the primitives are divided; 4. traversing each tile, and performing a rasterization ras operation on each primitive in each tile primitive list by a rasterization module 90; 5. the fragment shader module 92 performs a fragment shading program on the pixels of each primitive; 6. the output merger module 94 performs operations such as depth testing, blending, etc. on pixels of each primitive; 7. after each primitive is executed, the on-chip storage is written back, and when all primitives in one tile are processed, the on-chip storage is written back to the system memory 104. As can be seen from the above process, in the process of performing steps 1, 2, and 3, the fixed function rendering core 209 for performing the rasterizing operation does not perform any operation; and generic rendering core 208 does not perform any operations in performing steps 2, 3, 4.

It can be found from the above description that, for the current common unified rendering architecture, most rendering cores are in idle state in the process of building the primitive list, so that a phenomenon of unbalanced load occurs, resulting in low rendering efficiency. Based on the above, the embodiment of the invention is expected to provide a technology for improving the rendering efficiency, and in the process of rendering by adopting a TBR scheme, the load balancing of the rendering core is realized, and the rendering efficiency is improved. For example, after determining the tile covered by each assembled primitive, the graphics processing unit enters a subsequent rendering operation, such as a rasterization operation, a fragment coloring operation, a test mixing operation, and the like, without waiting for all primitives to be assembled and divided, and then performing the subsequent rendering operation, thereby improving the rendering efficiency and the utilization rate of the rendering core, and balancing the rendering core load of the GPU.

In some examples, clipping and partitioning module 88 is configured to immediately pass the partitioned primitives to rasterization module 90 after partitioning is completed for each primitive by tile size; a rasterizing module 90 configured to rasterize the incoming primitives and, after the rasterizing operation is completed, notify a primitive shader module 92 configured to implement by the general rendering core 208 to perform primitive shading processing on the rasterized primitives; a scheduler 207 configured to schedule the general purpose rendering core 208 for rendering the rasterized primitive from the rendering core array 230 according to the operating state of the general purpose rendering core 208 in the rendering core array 230 and the tile covered by the rasterized primitive; the primitive shader module 92 is configured to perform a primitive shading process on the primitive that has completed the rasterization operation for tiles covered by the primitive that has completed the rasterization operation based on the schedule of the scheduler 207.

According to the description of the above example, compared to the conventional TBR scheme, when the clipping and partitioning module 88 performs the rasterization operation on each primitive after the partitioning, and performs the subsequent primitive coloring operation after the rasterization operation is completed, it is not necessary to call the rasterization module 90 to perform the rasterization operation and call the general-purpose rendering core 208 to perform the primitive coloring operation after all primitives are partitioned when the partitioning operation is performed on the primitive, which improves the utilization rate of each rendering core 208, 209 in the rendering core array 230 of the graphics rendering pipeline 80 in the GPU 112, and balances the load of the rendering core array 230.

In the above example, the clipping and dividing module 88 is further configured to create a primitive relation table according to the primitives that have been divided after dividing according to the size of tile for each primitive, and store the primitive relation table in the system memory 104; in a specific implementation process, the primitive relation table at least includes all state information required for rendering the primitive that completes the partitioning, a tile that records the primitive that completes the partitioning, a first flag bit corresponding to each tile for identifying whether the primitive that completes the partitioning is processed by the general rendering core 208, and a second flag bit for identifying whether the primitive that completes the partitioning is rendered.

In the above example, preferably, the scheduler 207 is configured to read a primitive relation table corresponding to the primitive that has completed the rasterization operation from the system memory 104, allocate the general rendering core 208 according to the tile covered by the primitive that has completed the rasterization operation recorded in the primitive relation table, and update a scheduling table based on the allocation of the general rendering core 208, where the scheduling table is used to characterize the identifier of the general rendering core 208, the working state of the general rendering core 208, the tile identifier covered by the primitive, and the primitive identifier covered by the tile included in the rendering core array 230, where the scheduling table is based on the correspondence between the general rendering core 208 and the tile; and dispatching the general rendering core 208 according to the dispatching table and the working state of the current general rendering core 208 so as to perform fragment coloring operation on the primitive which has completed rasterization operation.

For the preferred example described above, in particular, scheduler 207 is configured to check the operational status of all general-purpose rendering cores 208 in rendering core array 230 after the primitive relationship table is read; corresponding to the idle-state general-purpose rendering core 208, scheduling the tile with the same tile mark as the tile mark corresponding to the idle-state general-purpose rendering core 208 in the tile covered by the primitive which has completed the rasterization operation to the idle-state general-purpose rendering core 208 for fragment coloring; and after the busy state general rendering core 208 is converted into the idle state, scheduling the tile with the same tile mark as the tile mark corresponding to the idle state general rendering core 208 in the tile covered by the primitive which is subjected to the rasterization operation to the idle state general rendering core 208 for performing fragment coloring processing.

For the above detailed description, the scheduler 207 is further configured to determine the rejection operation or the maintenance operation according to the set maintenance policy, corresponding to the situation that all the general rendering cores 208 are busy after some tiles in the tiles covered by the primitive having completed the rasterization operation have been scheduled to the corresponding general rendering cores 208; the removing operation includes removing the processed tile from the common rendering core 208 in a busy state to make the common rendering core 208 render the remaining tiles in the tiles covered by the primitive after the rasterization operation; the maintenance operation includes maintaining the current state unchanged, and scheduling the general rendering core 208 transferred to the idle state after the general rendering core 208 waiting for the busy state is transferred to the idle state.

For the above example, taking the graph shown in fig. 4 as an example, the size of a single tile is set to be 32×32, and the rendering core array 230 in the gpu 112 includes 32 general purpose rendering cores 208 to perform vertex shading operations and fragment shading operations, each general purpose rendering core 208 has at least one on-chip cache with a size of 32×32×3 bytes, that is, each on-chip cache can store at least one tile's color, template, and depth value; the screen size of the display device 110 is 1080×1920, and thus, the screen can be divided into 60×34 tiles in common. Setting that 3 triangle primitives are required to be drawn in a frame of picture at present, wherein the triangle primitives correspond to any one of triangle primitive No. 1 (for example, a solid line frame transparent triangle in fig. 4), triangle primitive No. 2 (for example, a solid line frame gray filled triangle in fig. 4) and triangle primitive No. 3 in fig. 4 respectively, and in the embodiment of the present invention, for two cases to be described later, triangle primitive No. 3 is exemplified by triangle primitive No. 3 (as shown by a dashed line frame transparent triangle primitive in fig. 4) and triangle primitive No. 3 (as shown by a dashed line frame gray filled triangle primitive in fig. 4); the tile covered by the triangle primitive is shown in fig. 4. In connection with fig. 1, fig. 2, and fig. 3, if the graphics shown in fig. 4 are rendered by the above-mentioned exemplary technical solution, a specific flow may include the following steps:

Step 1: after the CPU 102 prepares to complete the vertex data, the general-purpose rendering core 208 is controlled by the GPU driver 203 to perform vertex shading operations on the 9 vertex data, and write the completed vertex data back to the system memory 104.

Step 2: vertex grabbing module 82 included in graphics rendering pipeline 80 retrieves processed vertex data from system memory 104; primitive assembling module 86 assembles the three vertices into a triangle based on the primitive information;

step 3: the clipping and dividing module 88 performs clipping operations on the first assembled triangle after it is received;

step 4: the clipping and dividing module 88 divides the first clipped triangle by the size of tile; after the division, the triangle is added into the created primitive list, specifically, the primitive list needs to contain all state information needed by triangle rendering, and two flag bits are used to respectively mark whether the current tile is being processed by the general rendering core 208, and whether all primitives in the list are processed. Finally, the primitive list is written back to the system memory 104;

step 5: vertex grabbing module 82, primitive assembling module 86, clipping and dividing module 88 process the second triangle according to the same procedure described in the 4 steps above; at the same time, the rasterizing module 90 begins the rasterizing operation for the first triangle; after the rasterization operation is completed, the fragment shader module 92 configured by the general purpose rendering core 208 is notified to process tiles covered by the first triangle, and then the rasterization process is performed on the second triangle that has completed the 4 steps.

Step 6: after receiving the processing request, the primitive shader module 92 obtains the primitive list constructed in step 4 from the system memory 104, allocates tiles covered by the first triangle, and records the universal rendering core ID and triangle ID corresponding to each tile. The processing request based on the second triangle is then processed.

Step 7: the fragment shader module 92 begins processing tiles covered by the first triangle, and according to the result that each general purpose rendering core 208 processes one tile, 32 general purpose rendering cores 208 can process 32 tiles at the same time.

For step 7, the scheduler 207 maintains a schedule table with exemplary templates as shown in table 1, in which table 1, the first column indicates the ID of the general rendering core, and the second column indicates whether the current general rendering core is busy; the third column represents the ID of the tile covered by the triangle, and the last column represents which primitive the tile currently being executed by general rendering core 208 belongs to.

Universal rendering core ID	General rendering core state	Tile ID	Triangle ID

TABLE 1

Based on the schedule table template shown in Table 1, when the fragment shader module 92 begins processing the first triangle, such as triangle number 1 in FIG. 4, the schedule table is updated to that shown in Table 2:

TABLE 2

After the fragment shader module 92 finishes processing triangle number 1, the color values, depth values, and template values of the pixels in each tile are written back to the on-chip cache. When triangle number 2 arrives, scheduler 207 will check the status of each general purpose rendering core 208 first, if there is a free general purpose rendering core 208, allocate tile covered by triangle number 2 to the free general purpose rendering core 208, if tile id is the same, allocate tile to the same general purpose rendering core 208; when the general rendering core 208 is already occupied by tile and in an idle state, if the tile ids are different, the general rendering core 208 cannot be allocated to the idle general rendering core 208, otherwise, an error occurs; if the general purpose rendering core 208 is in a busy state at this time, it is necessary to wait for the general purpose rendering core 208 to change from the busy state to an idle state, such as general purpose rendering cores 208 No. 9, 10, and 11. At this time, the table update is as shown in table 3:

TABLE 3 Table 3

When triangle number 3 arrives, for example triangle number 3 a; scheduler 207 may also check for free general purpose rendering cores 208, at which time it may be seen from table 3 that only 2 general purpose rendering cores 208 remain free and not occupied by any tile. As can be seen from FIG. 4, the number of tiles covered by triangle A No. 3 is greater than 2, at this time, two tiles of triangle A No. 3 are allocated to general rendering core 208 No. 30 and No. 31, and according to the foregoing description, all general rendering cores 208 are full at this time, and scheduler 207 analyzes the current situation to determine whether to reject tiles in the table to yield general rendering core 208 to process the remaining tiles of triangle A No. 3 or to maintain the state in the current table. In an embodiment of the present invention, the scheduler 207 may use three principles as the maintenance policy to maintain the schedule table as illustrated below: 1. priority processing tile with more primitives; 2. preferentially processing triangle-dense regions; 3. the tile at the boundary is discarded preferentially. Based on the exemplary maintenance principles described above, the scheduler 207 has found, through analysis: triangle No. 3 a is far from the triangle in the current table, so the scheduler 207 determines to maintain the current state of the current table without processing the remaining tile of triangle No. 3 a.

However, if the current last triangle is triangle No. 3B, tile numbers (0, 0), (0, 1), (0, 2), (0, 3), (0, 4) are discarded according to the maintenance principle described above, and accordingly, the on-chip cache of the general rendering core 208 occupied by these tiles is cleared, and the scheduler 207 also updates the relevant flag bits in the primitive list.

Step 8: after all tiles have been processed according to step 7, a depth test or blending operation is performed on each tile by the output merger module 94 of the graphics rendering pipeline 80, and finally written back to the system memory 104.

For the foregoing description, it can be understood that, compared with the conventional TBR scheme, with the scheme for improving rendering efficiency according to the embodiment of the present invention, when all primitive lists are created, a portion of tile with a larger number of covered primitives is rendered, so that the rendering efficiency of the TBR is improved, and the efficiency is improved more significantly with the increase of the number of general rendering cores and the increase of on-chip storage.

Accordingly, referring to fig. 5, an efficient rendering-ahead method provided by an embodiment of the present invention may be applied to the GPUs shown in fig. 1, 2 and 3, and the method may include:

S501: immediately after completing dividing according to tile size for each primitive by the clipping and dividing module 88, the divided primitives are sent to the rasterizing module 90 for rasterizing operation;

s502: for the primitives with the completed rasterization operation, scheduling a general rendering core 208 for rendering the primitives with the completed rasterization operation from the rendering core array by a scheduler 207 according to the working state of the general rendering core in the rendering core array and the tile covered by the primitives with the completed rasterization operation;

s503: the primitive that has completed the rasterization operation is subjected to the primitive shading processing by the primitive shader module 92 configured to be implemented by the general purpose rendering core for the tile that is covered by the primitive that has completed the rasterization operation based on the schedule of the scheduler.

In some examples, the method further comprises: after completing dividing according to tile size for each primitive by the clipping and dividing module 88, creating a primitive relation table according to the primitives after completing dividing, and storing the primitive relation table into the system memory 104; the primitive relation table at least comprises all state information required for rendering the primitives which are completed in division, a tile covered by the primitives which are completed in division, a first flag bit corresponding to each tile and used for identifying whether the primitives which are completed in division are processed by the general rendering core 208, and a second flag bit used for identifying whether the primitives which are completed in division are completely rendered.

In some examples, the scheduling, by the scheduler 207, the universal rendering core 208 for rendering the primitive that has completed the rasterization operation from the rendering core array according to the working state of the universal rendering core in the rendering core array and the tile covered by the primitive that has completed the rasterization operation, includes:

reading, by the scheduler 207, a primitive relation table corresponding to the primitive for which the rasterization operation has been completed from the system memory 104;

allocating, by the scheduler 207, a general rendering core 208 according to the tile covered by the primitive that has completed the rasterization operation and recorded in the primitive relation table, and updating a scheduling table based on the allocation of the general rendering core 208, where the scheduling table is used to characterize an identifier of the general rendering core 208, a working state of the general rendering core 208, a tile identifier covered by the primitive, and a primitive identifier covered by the tile included in the rendering core array 230 based on a correspondence between the general rendering core 208 and the tile;

and dispatching the general purpose rendering core 208 by the dispatcher 207 according to the dispatching table and the working state of the current general purpose rendering core 208 so as to perform fragment shading operation on the primitive which has completed rasterization operation.

In some examples, the scheduling, by the scheduler 207, the general purpose rendering core 208 to perform a fragment shading operation on the primitive that has completed the rasterization operation according to the schedule table and the current working state of the general purpose rendering core 208 includes:

after the primitive relation table is read by the scheduler 207, checking the working states of all the general rendering cores 208 in the rendering core array 230;

corresponding to the idle-state general rendering core 208, scheduling, by the scheduler 207, the tile with the same tile mark as the tile mark corresponding to the idle-state general rendering core 208 in the tile covered by the primitive which has completed the rasterization operation to the idle-state general rendering core 208 for fragment coloring;

and after the general rendering core 208 in the busy-green state waits for the general rendering core 208 in the busy-green state to be converted into the idle state by the scheduler 207, scheduling the tile with the same tile mark as the tile mark corresponding to the general rendering core 208 converted into the idle state in the tiles covered by the primitives after the rasterization operation is completed to the general rendering core 208 converted into the idle state for fragment coloring.

In some examples, the method further comprises:

Corresponding to the situation that all general rendering cores 208 are in a busy state after part of tiles in tiles covered by the primitives after the rasterization operation is completed are scheduled to the corresponding general rendering cores 208, determining a rejection operation or a maintenance operation according to a set maintenance policy by the scheduler 207; the removing operation includes removing the processed tile from the common rendering core 208 in a busy state to make the common rendering core 208 render the remaining tiles in the tiles covered by the primitive after the rasterization operation; the maintenance operation includes maintaining the current state unchanged, and scheduling the general rendering core 208 transferred to the idle state after the general rendering core 208 waiting for the busy state is transferred to the idle state.

For the above example, the maintenance policy includes at least one of:

priority processing tile with more primitives;

preferentially processing triangle-dense regions;

the tile at the boundary is discarded preferentially.

In one or more examples or examples described above, the described functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media may include computer data storage media or communication media including any medium that facilitates transfer of a computer program from one place to another. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described in this disclosure. By way of example, and not limitation, such computer-readable media can comprise U-disk, removable hard disk, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The code may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs) or other equivalent programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. . Thus, the terms "processor" and "processing unit" as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Additionally, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Moreover, the techniques may be fully implemented in one or more circuits or logic elements.

The techniques of embodiments of the present invention may be implemented in a wide variety of devices or apparatuses including a wireless handset, an Integrated Circuit (IC), or a set of ICs (i.e., a chipset). The various components, modules, or units are described in this disclosure in order to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Indeed, as described above, the various units may be combined in a codec hardware unit in combination with suitable software and/or firmware, or provided by a collection of interoperable hardware units, including one or more processors as described above.

Various aspects of the invention have been described. These and other embodiments are within the scope of the following claims. It should be noted that: the technical schemes described in the embodiments of the present invention may be arbitrarily combined without any collision.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An efficient render-ahead method, the method comprising:

2. The method according to claim 1, wherein the method further comprises:

after dividing each graphic element according to the size of tile by the cutting and dividing module, creating a graphic element relation table according to the graphic element which is divided, and storing the graphic element relation table into a system memory; the primitive relation table at least comprises all state information required by rendering the primitives which are completed in division, a tile which is covered by the primitives which are completed in division, a first zone bit which corresponds to each tile and is used for identifying whether the primitives which are completed in division are processed by a general rendering core, and a second zone bit which is used for identifying whether the primitives which are completed in division are completed in rendering.

3. The method of claim 2, wherein the scheduling, by the scheduler, the generic rendering core for rendering the rasterized primitive from the array of rendering cores according to the operational state of the generic rendering core in the array of rendering cores and tile covered by the rasterized primitive, comprises:

Reading a primitive relation table corresponding to the primitive which has completed the rasterization operation from the system memory through the scheduler;

distributing universal rendering cores according to the tiles covered by the primitives which are recorded in the primitive relation table and completed with the rasterization operation through the scheduler, and updating a scheduling table based on the distribution of the universal rendering cores, wherein the scheduling table is used for representing the identifiers of the universal rendering cores, the working states of the universal rendering cores, the tile identifiers covered by the primitives and the primitive identifiers covered by the tiles, which are included in the rendering core array, based on the corresponding relation between the universal rendering cores and the tiles;

and dispatching the general rendering core by the dispatcher according to the dispatching table and the working state of the current general rendering core so as to carry out fragment coloring operation on the primitives subjected to the rasterization operation.

4. A method according to claim 3, wherein said scheduling, by the scheduler, the general purpose rendering core to perform a fragment shading operation on the primitives for which the rasterization operation has been completed according to the schedule table and the working state of the current general purpose rendering core, comprises:

after the graphic element relation table is read by the scheduler, checking the working states of all the general rendering cores in the rendering core array;

Corresponding to the idle-state general rendering core, scheduling the tile with the same tile mark as the tile mark corresponding to the idle-state general rendering core in the tiles covered by the primitives subjected to the rasterization operation to the idle-state general rendering core through the scheduler to perform fragment coloring processing;

and waiting for the busy state general rendering core to be converted into an idle state by the scheduler, and scheduling the tile covered by the primitive which is subjected to the rasterization operation to the general rendering core which is converted into the idle state for fragment coloring processing in the tile with the same tile mark as the tile mark corresponding to the idle state general rendering core.

5. The method according to claim 4, wherein the method further comprises:

corresponding to the situation that all the general rendering cores are in a busy state after part of tiles in tiles covered by the primitives after the rasterization operation is completed are scheduled to the corresponding general rendering cores, determining a rejecting operation or a maintaining operation according to a set maintenance strategy by the scheduler; the eliminating operation comprises eliminating the processed tile from the general rendering core in a busy state to make the general rendering core to make the remaining tiles in the tiles covered by the primitive which has completed the rasterization operation render; the maintaining operation comprises maintaining the current state unchanged, and scheduling the general rendering core which is transferred into the idle state after the general rendering core which is waiting for the busy state is transferred into the idle state.

6. The method of claim 5, wherein the maintenance policy comprises at least one of:

priority processing tile with more primitives;

preferentially processing triangle-dense regions;

the tile at the boundary is discarded preferentially.

7. A graphics processor GPU, the GPU comprising: a clipping and dividing module, a rasterization module, a scheduler and a general rendering core; wherein,,

the general rendering core is configured to perform fragment coloring processing on the primitive which has completed the rasterization operation for the tile covered by the primitive which has completed the rasterization operation based on the scheduling of the scheduler.

8. The GPU of claim 7, wherein the clipping and partitioning module is further configured to create a primitive relation table from the partitioned primitives after partitioning according to tile size for each primitive, and store the primitive relation table in system memory; the primitive relation table at least comprises all state information required by rendering the primitives which are completed in division, a tile which is covered by the primitives which are completed in division, a first zone bit which corresponds to each tile and is used for identifying whether the primitives which are completed in division are processed by a general rendering core, and a second zone bit which is used for identifying whether the primitives which are completed in division are completed in rendering.

9. The GPU of claim 8, wherein the scheduler is configured to:

reading a primitive relation table corresponding to the primitive which has completed rasterization operation from the system memory; the method comprises the steps of,

allocating a general rendering core 208 according to the tile covered by the primitive which is recorded in the primitive relation table and has completed the rasterization operation, and updating a scheduling table based on the allocation of the general rendering core 208, wherein the scheduling table is used for representing the identifier of the general rendering core 208, the working state of the general rendering core 208, the tile identifier covered by the primitive and the primitive identifier covered by the tile which are included in the rendering core array 230 based on the corresponding relation between the general rendering core 208 and the tile; the method comprises the steps of,

And dispatching the general rendering core 208 according to the dispatching table and the working state of the current general rendering core 208 so as to perform fragment coloring operation on the primitive which has completed rasterization operation.

10. The GPU of claim 9, wherein the scheduler is configured to:

after the primitive relation table is read, checking the working states of all the general rendering cores in the rendering core array; the method comprises the steps of,

corresponding to the idle-state general rendering core, scheduling the tile with the same tile mark as the tile mark corresponding to the idle-state general rendering core in the tile covered by the primitive subjected to the rasterization operation to the idle-state general rendering core for fragment coloring; the method comprises the steps of,

and after the general rendering core in the busy state is converted into the idle state, scheduling the tile which is covered by the primitive which is subjected to the rasterization operation and has the same tile mark as the tile mark corresponding to the general rendering core converted into the idle state in the tile covered by the primitive which is subjected to the rasterization operation to the general rendering core 208 converted into the idle state for fragment coloring processing.

11. The GPU of claim 10, wherein the scheduler is further configured to:

Corresponding to the situation that all the general rendering cores are in a busy state after part of tiles in tiles covered by the primitives after the rasterization operation is completed are scheduled to the corresponding general rendering cores, determining a rejecting operation or a maintaining operation according to a set maintenance strategy; the eliminating operation comprises eliminating the processed tile from the general rendering core in a busy state to make the general rendering core to make the remaining tiles in the tiles covered by the primitive which has completed the rasterization operation render; the maintaining operation comprises maintaining the current state unchanged, and scheduling the general rendering core which is transferred into the idle state after the general rendering core which is waiting for the busy state is transferred into the idle state.

12. A computer storage medium storing an efficient render-ahead program that, when executed by at least one processor, implements the steps of the efficient render-ahead method of any one of claims 1-7.