CN116012217A

CN116012217A - Graphics processor, method of operation, and machine-readable storage medium

Info

Publication number: CN116012217A
Application number: CN202310084761.9A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Biren Intelligent Technology Co Ltd
Current assignee: Shanghai Biren Intelligent Technology Co Ltd
Priority date: 2023-01-18
Filing date: 2023-01-18
Publication date: 2023-04-25

Abstract

A Graphics Processor (GPU), method of operation, and machine-readable storage medium are provided. The GPU includes a command processor circuit and a geometry pipeline circuit. The command processor circuit sends the current primitive block to the geometry pipeline circuit for geometry processing. The geometry pipeline circuitry decides whether to enable coarse-grain depth testing or fine-grain depth testing based on the hardware descriptors. When the geometry pipeline circuit enables fine-granularity depth testing, the geometry pipeline circuit performs fine-granularity depth test rejection on a plurality of primitives of the current primitive block sent to the geometry pipeline circuit by the command processor circuit, stores fine-granularity depth test results in a fine-granularity depth buffer, and discards drawing on the current primitive block.

Description

Graphics processor, method of operation, and machine-readable storage medium

Technical Field

The present invention relates to electronic devices, and more particularly, to a graphics processor, method of operation, and machine readable storage medium.

Background

A graphics processor (Graphics Processing Unit, GPU), also known as a display core (display core), a visual processor (video processor), a display chip (display chip), or a graphics chip (graphics chip), is a microprocessor that performs drawing operations (e.g., rendering) on a personal computer, workstation, game console, and some mobile devices (e.g., tablet, smart phone, etc.). In order to enhance the realism of rendered scenes, the drawing of shadows is an important technique.

One typical shadow rendering method is shadow mapping (shadow mapping). Shadow mapping is split into two actions (pass). The first action renders the scene once with the light source as the view angle, generating depth information of the scene object into a depth buffer (depthbuffer). Typically, the rendered scene contains a large number of light sources, so the first action requires one depth buffer to be generated for each light source. More complex algorithms would produce different resolution depth buffers for each light source to increase the memory efficiency of the second motion coloring (shading). The second action is to render the whole scene again by taking an actual scene camera (camera) as a view angle, and then determine whether an object part corresponding to each pixel (pixel) is in shadow or not according to the depth information generated by the first action.

According to the standard GPU's application program interface (applicationprogramming interface, API), a rendering action (pass) needs to include at least a vertex shading phase (Vertex Shading stage), a rasterization phase (Rasterization stage), and a pixel shading phase (Pixel Shading stage). The main stream architecture includes immediate mode rendering (Immediate Mode Rendering, IMR) and tile-based deferred rendering (led-Based Deferred Rendering, TBDR), which map to the hardware architecture of the GPU. Both architectures are fundamental concepts in the industry and are not discussed in detail herein. The TBDR architecture is typically split into two pipeline (pipeline) hardware: a merge pipeline (binning pipeline) and a render pipeline (rendering pipeline). The merging pipeline mainly completes vertex coloring and some primary primitive (primary) rejection technologies, such as back-face rejection (back-face culling), coarse/fine depth culling (coarse/fine depth culling), and the like. The surviving primitives are stored in memory (memory) at tile granularity as input to the rendering pipeline. The image block refers to a minimum unit of the rendering screen cut according to a certain granularity. For example, a 512x512 screen is split according to 64x64 granularity, each 64x64 we call a tile. When the rendering geometry falls into this tile, post-vertex information is stored into the tile's memory. The rendering pipeline performs rasterization (rasterization) and pixel shading (pixel shading) at tile granularity. The rendering pipeline takes the result of vertex shading (vertex shading) as input from the memory according to the granularity of the image blocks, and performs rasterization and pixel shading. The mode of taking the image blocks as granularity can greatly reduce the read-write times of the depth buffer to an external Double Data Rate (DDR) memory, and has very good effect of reducing the external bandwidth requirement and the power consumption. Although there are advantages as described above, the TBDR architecture increases the rendering time of one valid primitive because there are two actions.

The first action (pass) in shadow mapping is typically a depth-only action (pass). That is, in most scenes, the corresponding pixel shading (passthrough) portion is transparent, i.e., the pixel shading portion does not operate in the shadow mapping operation. Nevertheless, in the TBDR architecture, depth-only actions still need to go through the merge pipeline and the render pipeline (rendering pipeline) in their entirety, with significant drawbacks relative to the IMR architecture. Depth-only actions occupy a very heavy workload (workload) in scene rendering. Although in many cases, the depth-actuated pixel shader (pixel shader) is typically transparent. For mainstream tile-based GPU (tile-based GPU) architecture, current depth actions always need to go through the merge pipeline to execute the vertex shader (vertex shader) to the render pipeline to perform triangle setup (triangle setup), rasterization (rasterization), depth testing (Z-testing), and pixel shader. The time of the whole path of the depth-only action is greatly prolonged, and the performance of the rendering algorithm is greatly influenced.

Disclosure of Invention

The present invention provides a graphics processor (graphics processing unit, GPU) and method of operation thereof, as well as machine-readable storage medium, to reduce depth-only motion (depth-only) time.

In an embodiment according to the invention, the graphics processor includes a command processor (command processor, CP) circuit and a Geometry Pipeline (GP) circuit. The command processor circuit is configured to segment a task chain (job chain) into rendering tasks (rendering tasks) for a plurality of Primitive Blocks (PB), wherein each primitive block includes a plurality of primitives (primitives). The geometry pipeline circuitry is coupled to the command processor circuitry. The command processor circuit sends one of the plurality of primitive blocks to the geometry pipeline circuit for geometry processing (geometry processing). The geometry pipeline circuitry decides whether to enable coarse-granularity depth testing (coarse Z test) or fine-granularity depth testing (fine Z test) based on the hardware descriptor (hardware descriptor). When the geometry pipeline circuit enables fine-granularity depth testing, the geometry pipeline circuit performs fine-granularity depth test culling (fine depth test culling) on a plurality of primitives of a current primitive block that the command processor circuit sends to the geometry pipeline circuit, stores fine-granularity depth test results in a fine-granularity depth buffer (fine depth buffer), and discards drawing (discard) on the current primitive block.

In an embodiment according to the invention, the method of operation comprises: splitting a task chain into rendering tasks for a plurality of primitive blocks, wherein each primitive block comprises a plurality of primitives; transmitting one of the plurality of primitive blocks to a geometry pipeline circuit of the graphics processor for geometry processing; determining, by the geometry pipeline circuitry, whether to enable coarse-grain depth testing or fine-grain depth testing based on a hardware descriptor; and when the geometric pipeline circuit enables the fine-granularity depth test, performing fine-granularity depth test rejection on a plurality of primitives of the current primitive block sent to the geometric pipeline circuit by the geometric pipeline circuit, storing fine-granularity depth test results in a fine-granularity depth buffer, and discarding drawing on the current primitive block.

In an embodiment according to the invention, the machine-readable storage medium is for storing non-transitory machine-readable instructions. The method of operation of the graphics processor may be implemented when the non-transitory machine readable instructions are executed by a computer.

Based on the above, the geometry pipeline circuit completes the coloring task operation according to normal actions after receiving the task. The geometry pipeline circuitry may dynamically decide to enable coarse-grain depth testing or fine-grain depth testing based on the hardware descriptors. If the geometry pipeline circuitry enables fine-grained depth testing, the geometry pipeline circuitry performs fine-grained depth test culling on the primitives and stores/updates the fine-grained depth test results in a fine-grained depth buffer, then discards the drawing (does not send pixel tasks to the stream processor cluster circuitry), and returns a task done signal telling the command processor that the circuit tasks have been completed. Depth-only action (depth-only) may eliminate the need for pass-through stream processor cluster circuitry (saving processing time for pixel tasks). Thus, the graphics processor may reduce the time for depth-only actions.

Drawings

FIG. 1 is a schematic diagram of a circuit block (circuit block) of a Graphics Processor (GPU) in accordance with an embodiment of the present invention.

FIG. 2 is a schematic diagram of a rendered screen divided into tiles according to one embodiment of the present invention.

FIG. 3 is a schematic block diagram of a geometry pipeline circuit and a stream processor cluster circuit in a GPU according to an embodiment of the present invention

Fig. 4 is a flow chart of a method of operating a GPU according to an embodiment of the present invention.

Description of the reference numerals

100: graphic Processor (GPU)

2DP0: two-dimensional pipeline

CDB0: coarse grain depth buffering

CMP0: computing pipeline

CP1: command processor circuit

CU0: calculation unit

EZ0: enhanced depth test module

FDB0: fine granularity depth buffering

GP1_0, GP1_1, GP1_N: geometric pipeline circuit

GPBE0: geometric pipeline rear end

GPFE0: front end of geometric pipeline

MEM1: memory

PP0: pixel pipeline

RTC0: ray tracing kernel

Spc1_0, spc1_4, spc1_n×4: stream processor cluster circuit

S410, S420, S430, S441, S442, S443, S451, S452, S453, S454: step (a)

T0, T1, T2, T3: primitive(s)

TC0: tensor kernel

TIL0: picture block module

TILE0, TILE1, TILE2, TILE3, TILE4, TILE5, TILE6, TILE7, TILE8, TILE9, TILE10, TILE11, TILE12, TILE13, TILE14, TILE15: image block

VPT0: vision port conversion module

VTG0: coloring module

Detailed Description

Reference will now be made in detail to the exemplary embodiments of the present invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts.

The term "coupled" as used throughout this specification (including the claims) may refer to any direct or indirect connection. For example, if a first device couples (or connects) to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections. The terms first, second and the like in the description (including the claims) are used for naming components, and are not used for limiting the number of components, i.e. upper or lower, or the order of the components. In addition, wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts. The components/elements/steps in different embodiments using the same reference numerals or using the same terminology may be referred to with respect to each other.

Fig. 1 is a schematic diagram of circuit blocks (circuit blocks) of a graphics processor (graphics processing unit, GPU) 100 in accordance with an embodiment of the present invention. The GPU 100 shown in fig. 1 includes a Command Processor (CP) circuit CP1, a plurality of geometry pipeline circuits (e.g., gp1_0, gp1_1, and gp1_n shown in fig. 1), a plurality of stream processor cluster (streaming processor cluster, SPC) circuits (e.g., spc1_0, spc1_4, and spc1_n×4 shown in fig. 1), and a memory (memory) MEM1. The memory MEM1 is a broad concept and may include various levels of on-chip (on chip) memory, such as Cache (Cache), high bandwidth memory (High Bandwidth Memory, HBM), double Data Rate (DDR) memory, and the like. The number of geometry pipeline circuits and the number of stream processor cluster circuits may be determined according to the actual design. The command processor circuit, geometry pipeline circuit, stream processor cluster circuit, memory, etc. may be connected by any interconnection means to communicate information. For example, one implementation of interconnections between subsystems is a Network On Chip (NOC). In some embodiments, the command processor circuit, geometry pipeline circuit, and/or stream processor cluster circuit may be implemented as hardware (hardware) circuits, depending on the design requirements. In other embodiments, the command processor circuit, geometry pipeline circuit, and/or stream processor cluster circuit may be implemented in firmware (firmware), software (i.e., program), or a combination of the two. In still other embodiments, the implementation of the command processor circuit, geometry pipeline circuit, and/or stream processor cluster circuit may be in the form of a combination of multiple ones of hardware, firmware, and software.

In hardware, the command processor circuit, geometry pipeline circuit, and/or stream processor cluster circuit may be implemented as logic circuits on an integrated circuit (integrated circuit). For example, the functions associated with the command processor circuit, geometry pipeline circuit, and/or stream processor cluster circuit may be implemented in various logic blocks, modules, and circuits in one or more controllers, microcontrollers (microcontrollers), microprocessors (Application-specific integrated circuits (ASICs), digital signal processors (digital signal processor, DSPs), field programmable logic gate arrays (Field Programmable Gate Array, FPGAs), central processing units (Central Processing Unit, CPUs), and/or other processing units. The relevant functions of the command processor circuit, geometry pipeline circuit, and/or stream processor cluster circuit may be implemented as hardware circuits, such as various logic blocks, modules, and circuits in an integrated circuit, using a hardware description language (hardware description languages, such as Verilog HDL or VHDL) or other suitable programming language.

The functions associated with the command processor circuitry, geometry pipeline circuitry, and/or stream processor cluster circuitry described above may be implemented as programming code (programming codes) in software and/or firmware. For example, the command processor circuitry, geometry pipeline circuitry, and/or stream processor cluster circuitry may be implemented using a general programming language (programming languages, e.g., C, C ++ or assembly language) or other suitable programming language. The programming code may be recorded/deposited on a non-transitory machine-readable storage medium (non-transitory machine-readable storage medium). In some embodiments, the machine-readable storage medium includes, for example, semiconductor memory and/or storage devices. The semiconductor memory includes a memory card, a Read Only Memory (ROM), a FLASH memory (FLASH memory), a programmable logic circuit, or other semiconductor memory. The storage device includes a tape (tape), a disk (disk), a hard disk (HDD), a Solid-state drive (SSD), or other storage devices. An electronic device (e.g., a CPU, controller, microcontroller, or microprocessor) may read and execute the programming code from the machine-readable storage medium to implement the relevant functions of the command processor circuit, geometry pipeline circuit, and/or stream processor cluster circuit. Alternatively, the programming code may be provided to the electronic device via any transmission medium, such as a communication network or broadcast waves, etc. Such as the Internet (Internet), a wired communication (wired communication) network, a wireless communication (wireless communication) network, or other communication medium.

The main function of the command processor circuit CP1 is to issue task commands to other subsystems in the GPU 100 for execution. Based on the actual design, command processor circuit CP1 may also include global synchronization, task scheduling, and/or other functions. The geometry pipeline circuits GP1_0 to GP1_N are coupled to the command processor circuit CP1. The command processor circuit CP1 may segment a task chain (job chain) into rendering tasks (rendering tasks) for a plurality of Primitive Blocks (PB), each primitive block comprising a plurality of primitives (primitives). The command processor circuit CP1 sends the primitive blocks to the geometry pipeline circuits gp1_0 to gp1_n for geometry processing (geometry processing) in the order they are in the task chain. The number of primitives per primitive block may be determined according to the actual design. In some embodiments, each primitive block includes a draw call (draw call). The command processor circuit CP1 sends tasks in parallel to the geometry pipeline circuits gp1_0 to gp1_n according to the granularity of the draw call. For example, a first draw call is sent to geometry pipeline circuit GP1_0 and a second draw call is sent to geometry pipeline circuit GP1_1. And so on, the n+1th draw call is sent to the geometry pipeline circuit GP1_N. In other embodiments, each primitive block has the same number of primitives. The command processor circuit CP1 slices the number of primitives to granularity to send tasks in parallel to the geometry pipeline circuits gp1_0 to gp1_n. For example, a task chain or a draw call is split into primitive blocks (primitive blocks) at granularity of 512 primitives (or other number of primitives, as determined by the actual design). The first primitive block is sent to geometry pipeline circuit GP1_0 and the second primitive block is sent to geometry pipeline circuit GP1_1. And so on, the n+1th primitive block is sent to geometry pipeline circuit GP1_N.

The geometry pipeline circuits GP1_0 through GP1_N are responsible for geometry-dependent task processing, such as vertex shading (vertex shading) tasks, geometry shading (geometry shading) tasks, task constructors (task constructors) of tessellation shading (tessellation shading) tasks, and so on. The geometry pipeline circuits GP1_0 through GP1_N are also responsible for generating geometry processing results to pixel pipelines (not shown) in the stream processor cluster circuit. Each stream processor cluster circuit mainly includes an arithmetic logic unit (arithmetic logic unit, ALU), a special function unit (special function unit), a load store unit, a tensor core (tensor core), and other circuits (not shown). The geometry processing task (geometryprocessing task) and the pixel processing task (pixel processing task) are typically included in a draw call (draw call). The command processor CP1 sends the geometry processing tasks to the geometry pipeline circuits gp1_0 to gp1_n for execution, and then the geometry pipeline circuits gp1_0 to gp1_n are divided according to a block (tile) to generate pixel processing tasks for the stream processor cluster circuits (e.g., spc1_0, spc1_4 and spc1_n×4 shown in fig. 1). The stream processor cluster circuit performs pixel processing (pixel processing) with granularity of one or more tiles.

FIG. 2 is a schematic diagram of a rendered screen divided into tiles according to one embodiment of the present invention. The granularity of each tile may be 32x32 pixels, 64x64 pixels, or other numbers of pixels based on the actual design. In general, the size of tiles may be set according to architecture, with larger or smaller tiles being possible implementations. In the embodiment shown in FIG. 2, a screen is split into 4x4 TILEs TILE0, TILE1, TILE2, TILE3, TILE4, TILE5, TILE6, TILE7, TILE8, TILE9, TILE10, TILE11, TILE12, TILE13, TILE14, and TILE15. The primitive (private) is eventually rendered onto the screen. The primitives may fall within a tile or across multiple tiles. FIG. 2 shows primitives T0, T1, T2, and T3. Wherein, the graphic element T0 covers the graphic blocks TILE2, TILE3, TILE8 and TILE9, the graphic element T1 covers the graphic blocks TILE3 and TILE9, the graphic element T2 covers the graphic block TILE0, and the graphic element T3 covers the graphic block TILE1.

FIG. 3 is a schematic block diagram of a geometry pipeline circuit (e.g., GP1_0) and a stream processor cluster circuit (e.g., SPC1_0) in GPU 100 according to an embodiment of the present invention. The geometry pipeline circuit gp1_0 and the stream processor cluster circuit spc1_0 shown in fig. 3 can be used as one of numerous embodiments of the geometry pipeline circuit gp1_0 and the stream processor cluster circuit spc1_0 shown in fig. 1. The other geometry pipeline circuits gp1_1 to gp1_n shown in fig. 1 can refer to the related description of the geometry pipeline circuit gp1_0 and so on, and the other stream processor cluster circuits shown in fig. 1 can refer to the related description of the stream processor cluster circuit spc1_0 and so on, so that the description thereof will not be repeated. The command processor circuit CP1 sends a draw call (draw call) or a Primitive Block (PB) to a geometry pipeline circuit gp1_0 for geometry processing (geometry processing). The geometry pipeline circuit GP1_0 includes a geometry pipeline front end (GPfront end) GPFE0 and a geometry pipeline back end (GPback end) GPBE0. The front end and the back end are conceptually clearer from the functional division, and may be referred to as others in practice. Geometry processing by the geometry pipeline front-end GPFE0 includes building shading tasks (shading tasks) to send the shading tasks to the stream processor cluster circuits (e.g., SPC1_0 or other stream processor cluster circuits) of the GPU 100 for execution. Based on the actual design, the shading tasks include vertex shading (vertex shading) tasks, geometry shading (geometry shading) tasks, and tessellation shading (tessellation shading) tasks. The geometry pipeline front end GPFE0 retrieves post-vertex (post-vertex) information for further viewport transformation (viewport transformation). Post-vertex information refers to vertex information that has been processed by a shader (e.g., a vertex shader or other shader) of the stream processor cluster circuitry, and typically includes Position (Position) information and other vertex attributes (e.g., normal vectors, texture coordinates, etc.). Based on the actual design, the viewport transformation includes one or more primitive culling circuits, such as back-face culling (small triangle culling), small primitive culling (small triangle culling), and so forth. For the primitives meeting various rejection conditions, the primitives are not issued to the back end of the geometry processor.

As an example, geometry pipeline front end GPFE0 includes a shading module VTG0 and a viewport transformation (viewport transformation) module VPT0. The shading module VTG0 composes the vertex shader (vertex shader), tessellation shader (tessellation shader), and geometry shader (geometry shader) task constructor tasks for vertices corresponding to primitives in the received draw call (or primitive block) and sends them to the stream processor cluster circuitry (e.g., spc1_0 or other stream processor cluster circuitry) for execution. The viewport transformation module VPT0 sequentially reads back post-vertex information corresponding to the drawing call (or primitive block) from the memory MEM1, so as to perform standard graphics operations such as further occlusion surface rejection, perspective transformation, and the like.

As an example, the geometry pipeline back-end GPBE0 includes an enhanced depth test (enhanced Z test) module EZ0 and a tiling (tiling) module TIL0. In the geometry pipeline back-end GPBE0, the enhanced depth test module EZ0 may decide to enable coarse-granularity depth testing (coarse Z test) or fine-granularity depth testing (fine Z test) based on the hardware descriptor (hardware descriptor). For example, when the hardware descriptor indicates that the task for the current primitive includes depth-only action (or depth-only task), enhancement depth test module EZ0 selects to enable the fine-granularity depth test, otherwise enhancement depth test module EZ0 selects to enable the coarse-granularity depth test. As an example, the fine-grain depth test is a depth test in pixels (pixels). GPU 100 may maintain one (or more) depth values for each pixel in fine-grained depth buffer FDB0. The coarse grain depth test is a depth test in units of pixel blocks (pixel blocks) with respect to the fine grain depth test. For example, but not limited to, one pixel block includes 4*4 pixels. GPU 100 maintains a maximum depth value in coarse-granularity depth buffer CDB0 for each pixel block.

When the enhanced depth test module EZ0 decides to enable the fine-grained depth test, the enhanced depth test module EZ0 selectively performs fine-grained depth test culling (fine depth test culling) on the current primitive (the processing result of the geometric processing of the geometric pipeline front end GPFE 0), stores the fine-grained depth test result in a fine-grained depth buffer (FDB 0), and decides to discard the drawing (discard drawing) on the current primitive block. When geometry pipeline circuit GP1_0 decides to discard the drawing for the current primitive block, geometry pipeline circuit GP1_0 does not send a pixel processing task to any stream processor cluster circuit, and geometry pipeline circuit GP1_0 returns a task complete (jobdone) signal to command processor circuit CP1 indicating that the current primitive block has been completed. After the command processor circuit CP1 receives the task completion signal returned by the geometry pipeline circuit gp1_0, the command processor circuit CP1 issues a task that is dependent on the fine-grained depth buffer FDB0.

When the enhanced depth test module EZ0 decides to enable coarse-granularity depth test, the enhanced depth test module EZ0 selectively performs coarse-granularity depth test culling (coarse depth test culling) on primitives (processing results of geometric processing of the geometric pipeline front end GPFE 0), and stores the coarse-granularity depth test results in coarse-granularity depth buffer (coarse depth buffer) CDB 0. When the enhanced depth test module EZ0 enables coarse-granularity depth testing, the pixel pipeline PP0 in the stream processor cluster circuitry (e.g., spc1_0 or other stream processor cluster circuitry) of the GPU 100 takes the coarse-granularity depth test result from the coarse-granularity depth buffer CDB0 for fine-granularity depth test culling (fine depth test culling), and stores the fine-granularity depth test result in the fine-granularity depth buffer FDB0. The tile module TIL0 will put the primitives passing through the coarse-granularity depth test module CZ0 on the corresponding tile list (tile list) so as to facilitate the subsequent pixel processing (pixel processing) of the stream processor cluster circuit. As an example, fig. 3 shows a block list diagram of blocks TILE0, TILE1, …, TILE15.

The stream processor cluster spc1_0 includes a plurality of pixel pipeline (pixel pipeline) PP0 for processing the geometry processing results of the geometry pipeline circuit (e.g., gp1_0 or other geometry pipeline circuits). For example, pixel pipeline PP0 may further perform triangle setup (triangle setup), rasterization (rasterization), fine Z test, and pixel shader task construction (pixel shader task construction) on primitives on tile (tile). The pixel pipeline PP0 performs a fine Z test to output the pixel depth value to a fine depth buffer (FDB 0). The fine-grained depth buffer FDB0 is read for use in a subsequent algorithm of shadow mapping (shadow mapping). The stream processor cluster circuit spc1_0 may also contain a computation pipeline (computer pipeline) CMP0 for processing computation tasks. The stream processor cluster spc1_0 may also contain a two-dimensional pipeline (2D pipeline) 2DP0 for handling operations such as 2D blit functions, memory copy, etc. The stream processor cluster circuit spc1_0 includes a plurality of computation units (computation units) CU0. The computation unit CU0 includes circuits (not shown) such as a numeric computation unit, an arithmetic logic unit (arithmetic logic unit, ALU), a special function unit (special function unit), and a load store unit, which are main components of the GPU 100. Each pixel task, two-dimensional task and/or computation task may be issued to the computation unit CU0 for execution. The stream processor cluster spc1_0 may further include a ray tracing core (RTC 0) for accelerating ray tracing task processing. The stream processor cluster circuit spc1_0 may also contain a tensor core TC0 for accelerating matrix and convolution operations.

FIG. 4 is a flow chart of a method of operation of a Graphics Processor (GPU) in accordance with an embodiment of the present invention. In some embodiments, the method of operation illustrated in FIG. 4 may be implemented in firmware or software (i.e., a program). For example, the operations associated with the method of operation illustrated in FIG. 4 may be implemented as non-transitory machine-readable instructions (programming code or program) that may be stored on a machine-readable storage medium. The method of operation illustrated in fig. 4 may be implemented when non-transitory machine readable instructions are executed by a computer. In other embodiments, the method of operation illustrated in FIG. 4 may be implemented in hardware, such as GPU 100 illustrated in FIG. 1.

Please refer to fig. 3 and fig. 4. In step S410, the command processor circuit CP1 splits a task chain (job chain) into rendering tasks (rendering tasks) for a plurality of Primitive Blocks (PB), each of which includes a plurality of primitives (primitives). In step S420, the command processor circuit CP1 sends one of the plurality of primitive blocks (hereinafter referred to as the current primitive block) to the geometry pipeline circuit gp1_0 for geometry processing. In step S430, the geometry pipeline circuit gp1_0 decides whether to enable coarse-granularity depth test (coarse Z test) or fine-granularity depth test (fine Z test) according to the hardware descriptor (hardware descriptor). For example, when the hardware descriptor indicates that the task for the current primitive includes depth-only action (or depth-only task), geometry pipeline circuit GP1_0 selects to enable the fine-granularity depth test, otherwise geometry pipeline circuit GP1_0 selects to enable the coarse-granularity depth test.

For depth-only pass, or depth-only tasks, the corresponding pixel shader is pass through, i.e., the pixel shader does not perform substantial operations in shadow mapping operations. Thus, when a Compiler (Compiler) compiles a pixel shader (pixel loader), the Compiler can identify a transparent pixel shader. In a typical depth motion (depthpass), a compiler can identify features of pass through (passthrough) from the write of the pixel shader. For example, a main function of a pixel shader performs a function is a null function, indicating that the pixel shader does nothing, and is therefore a transparent shader (passthrough shader). The compiler may pass information to the driver at this point and the driver notifies the GPU 100 via the hardware descriptor to select to enable the fine Z test (fine Z test) function of the geometry pipeline enabled circuit gp1_0. Alternatively, the driver may recognize that the corresponding rendering pass is not binding a pixel shader (pixel shader), and thus the driver notifies the GPU 100 via a hardware descriptor to select to enable the fine-grained depth test function of the geometry pipeline circuit GP1_0. After the geometry pipeline circuit gp1_0 receives a draw call (draw call) task, operations such as vertex shading (vertex shading) task, geometry shading (geometry shading) task, tessellation shading (tessellation shading) task and the like are completed according to normal actions, and a strong depth test module EZ0 of a geometry pipeline back end GPBE0 determines to enable coarse-granularity depth test (coarse Z test) or fine-granularity depth test (fine Z test) according to a hardware descriptor.

When the enhanced depth test module EZ0 of the geometry pipeline circuit gp1_0 enables the fine-granularity depth test according to the hardware descriptor, the geometry pipeline circuit gp1_0 performs fine-granularity depth test culling on the plurality of primitives of the current primitive block (step S441), stores the fine-granularity depth test result in the fine-granularity depth buffer FDB0 (step S442), and discards the drawing on the current primitive block (step S443). When geometry pipeline circuit GP1_0 discards drawing the current primitive block, geometry pipeline circuit GP1_0 does not send pixel processing tasks to any stream processor cluster circuits, and geometry pipeline circuit GP1_0 returns a task complete (jobdone) signal to command processor circuit CP1 indicating that the current primitive block task has been completed. After the command processor circuit CP1 receives the task completion signal returned by the geometry pipeline circuit gp1_0, the command processor circuit CP1 issues a task having a dependency on the fine-grained depth buffer FDB0 to the corresponding subsystem in the GPU 100 for execution. The geometry pipeline circuit GP1_0 can complete execution of depth-only actions (depth-only pass), unnecessary workload (workload) is avoided from being executed in the stream processor cluster circuit, and execution time of the depth-only actions is greatly shortened. In comparison with steps S451 to S454, steps S441 to S443 advance the fine-grain depth test from the pixel pipeline of the stream processor cluster circuit (e.g., the pixel pipeline PP0 of the stream processor cluster circuit spc1_0) to the geometric processing (geometry processing) stage.

When the enhanced depth test module EZ0 of the geometry pipeline circuit gp1_0 enables coarse-granularity depth test according to the hardware descriptor, the geometry pipeline circuit gp1_0 performs coarse-granularity depth test culling on the processing result of the geometry processing (step S451), and stores the coarse-granularity depth test result in the coarse-granularity depth buffer CDB0 (step S452) to facilitate fine-granularity depth test culling on the pixel pipeline PP0 in the stream processor cluster circuit (e.g., spc1_0 or other stream processor cluster circuits). When the geometry pipeline circuit GP1_0 enables coarse grain depth testing, the pixel pipeline PP0 fetches the coarse grain depth test results from the coarse grain depth buffer CDB0 for fine grain depth test culling (step S453), and stores the fine grain depth test results in the fine grain depth buffer FDB0 (step S454).

In summary, the geometry pipeline circuit GP1_0 completes the coloring task operation according to the normal actions after receiving the task. The geometry pipeline circuit GP1_0 may dynamically decide to enable coarse-grain depth testing or fine-grain depth testing based on hardware descriptors. If geometry pipeline circuit GP1_0 enables fine-granularity depth testing, geometry pipeline circuit GP1_0 performs fine-granularity depth test culling on the primitives and stores/updates the fine-granularity depth test results in fine-granularity depth buffer FDB0, then discards the drawing (no pixel tasks are sent to the stream processor cluster circuit), and returns a task done signal telling command processor circuit CP1 that the primitive block task is currently complete. Depth-only action (depth-only) may eliminate the need for pass-through stream processor cluster circuitry (saving processing time for pixel tasks). Thus, GPU 100 may shorten the depth-only action time.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A graphics processor, the graphics processor comprising:

a command processor circuit for splitting a task chain into rendering tasks for a plurality of primitive blocks, wherein each primitive block comprises a plurality of primitives; and

a geometry pipeline circuit coupled to the command processor circuit, wherein the command processor circuit sends one of the plurality of primitive blocks to the geometry pipeline circuit for geometry processing, the geometry pipeline circuit determining whether to enable coarse-grain depth testing or fine-grain depth testing based on hardware descriptors, and

when the geometry pipeline circuit enables the fine-grain depth test, the geometry pipeline circuit performs fine-grain depth test culling on the plurality of primitives of a current primitive block that the command processor circuit sends to the geometry pipeline circuit and stores fine-grain depth test results in a fine-grain depth buffer, and the geometry pipeline circuit discards drawing on the current primitive block.

2. The graphics processor of claim 1 wherein the geometry pipeline circuitry does not send pixel processing tasks to stream processor cluster circuitry of the graphics processor when the geometry pipeline circuitry discards drawing on the current primitive block, and wherein the geometry pipeline circuitry returns a task complete signal to the command processor circuitry indicating that the current primitive block has been completed.

3. The graphics processor of claim 1, wherein the geometry pipeline circuitry selects to enable the fine-granularity depth test when the hardware descriptor indicates that the task for the current primitive comprises a depth-only task, and otherwise the geometry pipeline circuitry selects to enable the coarse-granularity depth test.

4. The graphics processor of claim 1, wherein the geometric processing comprises:

constructing a shading task to send the shading task to a stream processor cluster circuit of the graphics processor for execution; and

and retrieving vertex information processed by the stream processor cluster circuit to further perform viewport transformation.

5. The graphics processor of claim 4, wherein the shading tasks include a vertex shading task, a geometry shading task, and a tessellation shading task.

6. The graphics processor of claim 4, wherein the viewport transformation comprises a back-face-oriented culling or a small primitive culling.

7. The graphics processor of claim 1, wherein when the geometry pipeline circuitry enables the coarse-granularity depth test, the geometry pipeline circuitry performs coarse-granularity depth test culling on the processing results of the geometry processing and stores coarse-granularity depth test results in a coarse-granularity depth buffer to facilitate fine-granularity depth test culling of pixel pipelines in a stream processor cluster circuit of the graphics processor.

8. The graphics processor of claim 7, wherein when the geometry pipeline circuitry enables the coarse-grain depth test, the pixel pipeline fetches the coarse-grain depth test results from the coarse-grain depth buffer for the fine-grain depth test culling, and stores fine-grain depth test results in a fine-grain depth buffer.

9. The graphics processor of claim 1 wherein the command processor circuit issues a task that is dependent on the fine-grained depth buffer after the command processor circuit receives the task completion signal.

10. A method of operation of a graphics processor, the method of operation comprising:

splitting a task chain into rendering tasks for a plurality of primitive blocks, wherein each primitive block comprises a plurality of primitives;

transmitting one of the plurality of primitive blocks to a geometry pipeline circuit of the graphics processor for geometry processing;

determining, by the geometry pipeline circuitry, whether to enable coarse-grain depth testing or fine-grain depth testing based on a hardware descriptor; and

when the geometry pipeline circuit enables the fine-grained depth test, performing fine-grained depth test culling on the plurality of primitives of a current primitive block sent to the geometry pipeline circuit by the geometry pipeline circuit, storing fine-grained depth test results in a fine-grained depth buffer, and discarding drawings on the current primitive block.

11. The method of operation of claim 10, further comprising:

when the geometry pipeline circuit discards drawing the current primitive block, the geometry pipeline circuit does not send a pixel processing task to a stream processor cluster circuit of the graphics processor, and the geometry pipeline circuit returns a task complete signal to the command processor circuit indicating that the current primitive block has been completed.

12. The method of operation of claim 10, further comprising:

the geometry pipeline circuitry selects to enable the fine-granularity depth test when the hardware descriptor indicates that the task for the current primitive includes a depth-only task, otherwise the geometry pipeline circuitry selects to enable the coarse-granularity depth test.

13. The method of operation of claim 10, wherein the geometric processing comprises:

14. The method of operation of claim 13, wherein the shading tasks comprise a vertex shading task, a geometry shading task, and a tessellation shading task.

15. The method of operation of claim 13, wherein the viewport transformation comprises a back-face-oriented culling or a small primitive culling.

16. The method of operation of claim 10, further comprising:

when the geometric pipeline circuit starts the coarse-granularity depth test, the geometric pipeline circuit performs coarse-granularity depth test rejection on the processing result of the geometric processing, and stores the coarse-granularity depth test result in a coarse-granularity depth buffer so as to facilitate fine-granularity depth test rejection on a pixel pipeline in a stream processor cluster circuit of the graphics processor.

17. The method of operation of claim 16, further comprising:

when the geometry pipeline circuitry enables the coarse-grain depth test, the pixel pipeline takes the coarse-grain depth test results from the coarse-grain depth buffer for the fine-grain depth test culling, and stores fine-grain depth test results in a fine-grain depth buffer.

18. The method of operation of claim 10, further comprising:

after the command processor circuit receives the task completion signal, the command processor circuit issues a task that is dependent on the fine-grained depth buffer.

19. A machine-readable storage medium storing non-transitory machine-readable instructions which, when executed by a computer, implement the method of operation of a graphics processor of any one of claims 10-18.