TWI425440B

TWI425440B - Hybrid multisample/supersample antialiasing

Info

Publication number: TWI425440B
Application number: TW098122471A
Authority: TW
Inventors: Cass W Everitt; Steven E Molnar
Original assignee: Nvidia Corp
Priority date: 2008-07-03
Filing date: 2009-07-02
Publication date: 2014-02-01
Also published as: KR20100004890A; JP4744624B2; TW201007610A; KR101009557B1; JP2010020764A

Description

Composite multiple sample/supersample anti-frequency stack

本發明之具體實施例概略關於繪圖處理的抗頻疊技術，更特定而言係關於動態地調整每個像素片段所遮光之樣本數目。DETAILED DESCRIPTION OF THE INVENTION The present invention is generally directed to anti-aliasing techniques for mapping processing, and more particularly to dynamically adjusting the number of samples that are obscured by each pixel segment.

以往繪圖處理器係組態成藉由多重取樣或超取樣來執行抗頻疊。在多重取樣中，每個像素片段被遮光一次，而所得到的色彩值複製到所有覆蓋的次像素樣本。在超取樣中，每個像素片段被遮光N次，每個覆蓋的次像素樣本各一次。Previously, the graphics processor was configured to perform anti-aliasing by multi-sampling or oversampling. In multi-sampling, each pixel segment is shaded once and the resulting color values are copied to all covered sub-pixel samples. In oversampling, each pixel segment is shaded N times, one for each covered sub-pixel sample.

多重取樣可良好地適用於抗頻疊基元邊緣，因為重要地是那些樣本被引入的基元所覆蓋。圖紋基本上被預先過濾掉，所以遮光的色彩值具有足夠低的空間頻率，因此每個像素遮光一次即為適當。但是，有一些效應，例如圖紋的艾爾發透明度及高頻反射強光，即可具有高於像素的頻率，並需要在高於像素的頻率之下完成遮光，以避免頻疊人工因素。超取樣基本上需要避免這些種類的頻疊。但是，遮光該像素中每個樣本會相當昂貴，因為遮光基本上為顯像中最為昂貴的作業。同時，某些超取樣實施例需要輸入基元被處理許多次，對於每個次像素樣本各一次，即會增加額外的無效率。大於每個像素一次但小於每一個樣本之遮光率即足以減輕上述頻疊的原因。Multiple sampling is well suited for anti-stacking element edges because it is important that those samples are covered by the introduced primitives. The pattern is basically filtered out in advance, so the color value of the shading has a sufficiently low spatial frequency, so that it is appropriate to shield each pixel once. However, there are some effects, such as the transparency of the pattern and the high-frequency reflection of the glare, which can have a higher frequency than the pixel, and need to complete the shading at a frequency higher than the pixel to avoid the artificial factor of the stack. Oversampling basically needs to avoid these kinds of frequency stacks. However, shading each sample in the pixel can be quite expensive because shading is essentially the most expensive job in imaging. At the same time, some oversampling embodiments require input primitives to be processed many times, each time for each sub-pixel sample, which adds additional inefficiency. A shading rate greater than once per pixel but less than each sample is sufficient to alleviate the above-mentioned frequency stack.

因此，本技藝需要一種系統及方法，用於使用適合於目前正在顯像的幾何之像素遮光率。該遮光率可被降低來改善影像品質，或被降低來改善遮光效能。Accordingly, the present technology requires a system and method for using pixel shading rates that are suitable for the geometry currently being developed. The shading rate can be lowered to improve image quality or reduced to improve shading performance.

一種在基元遮光期間動態地調整該像素取樣率之系統及方法能夠改善影像品質或增加遮光效能。該遮光率由每個像素一次(多重取樣)到每個樣本一次(抄取樣)之間變化，或在其間變化，藉以改善影像品質或增加遮光效能。對於一顯像目標(影像緩衝器)給定每個像素之指定的取樣數目，即可動態地選擇遮光器通過的數目。超樣本及多重樣本抗頻疊之結合係使用於當一次像素樣本的叢集(多重樣本)對於每次通過一片段遮光器時進行處理。該等超樣本叢集對於每個像素來結合以產生一抗頻疊的像素。A system and method for dynamically adjusting the pixel sampling rate during priming of a primitive can improve image quality or increase shading performance. The shading rate varies from one pixel (multiple sampling) to one sample per sample (sampled) or varies between them to improve image quality or increase shading performance. For a development target (image buffer) given the number of samples per pixel, the number of shutter passes can be dynamically selected. The combination of the supersample and the multiple sample antiband stack is used when the cluster of one pixel samples (multiple samples) is processed each time a fragment shutter is passed. The supersample clusters are combined for each pixel to produce an anti-aliased pixel.

本發明中用於在組態成每個像素產生多重樣本之一運算裝置中使用複合式抗頻疊來遮光基元的方法之多種具體實施例包括接收一繪圖基元，並決定用於抗頻疊與該繪圖基元交切的每個像素之超樣本叢集的數目。該繪圖基元使用多重通過該運算裝置內一片段遮光單元來遮光，其中用於產生交切該繪圖基元之每個複合式抗頻疊的像素之多重通過之數目係小於或等於該等超樣本叢集的數目。Various embodiments of the method for using a composite anti-frequency stacking shading primitive in an arithmetic unit configured to generate multiple samples per pixel include receiving a drawing primitive and determining for use in anti-frequency The number of supersample clusters of each pixel that intersects the drawing primitive. The drawing primitive is opaquely multiplexed by a segment shading unit in the computing device, wherein the number of multiple passes for generating each composite anti-frequency stack of the drawing primitive is less than or equal to the super The number of sample clusters.

本發明之多種具體實施例包括組態成使用複合式抗頻疊來遮光繪圖基元的一運算裝置。該運算裝置包括一掃描場解析器，其耦合於一片段遮光單元。該掃描場解析器包括一複合式抗頻疊控制單元，其組態成接收該等繪圖基元，並決定用於抗頻疊交切該等繪圖基元之每個像素的超樣本叢集之數目。該片段遮光單元係組態成使用多重通過遮光該等繪圖基元，其中用於產生交切一繪圖基元之每個複合式抗頻疊的像素之多重通過之數目係小於或等於該等超樣本叢集的數目。Various embodiments of the present invention include an arithmetic device configured to illuminate a drawing primitive using a composite anti-frequency stack. The computing device includes a field decoder coupled to a segment shading unit. The field resolver includes a composite anti-stack control unit configured to receive the drawing primitives and determine a number of supersample clusters for each of the pixels of the mapping primitives . The segment shading unit is configured to block the drawing primitives by using multiple passes, wherein the number of multiple passes of the pixels for generating each composite anti-frequency stack of the intersecting one drawing primitive is less than or equal to the super The number of sample clusters.

在以下的說明中，許多特定細節即被提出來提供對於本發明之更為完整的瞭解。但是本技藝專業人士將可瞭解到本發明可不利用一或多個這些特定細節來實施。在其它實例中，並未說明熟知的特徵，藉以避免混淆本發明。In the following description, numerous specific details are set forth to provide a more complete understanding of the invention. However, it will be apparent to those skilled in the art that the present invention may be practiced without one or more of these specific details. In other instances, well-known features are not described in order to avoid obscuring the invention.

系統概述System Overview

第一圖為組態成實施本發明一或多種態樣之運算系統100的方塊圖。電腦系統100包括一中央處理單元(Central processing unit,CPU)102與一系統記憶體104，其經由包括一記憶體橋接器105的匯流排路徑進行通訊。記憶體橋接器105可為例如一北橋晶片，其經由一匯流排或其它通訊路徑106(例如HyperTransport鏈結)連接到一I/O(輸入/輸出)橋接器107。I/O橋接器107可為例如一南橋晶片，其接收來自一或多個使用者輸入裝置108(例如鍵盤、滑鼠)的使用者輸入，並經由路徑106及記憶體橋接器105轉送該輸入到CPU 102。一平行處理子系統112經由一匯流排或其它通訊路徑113(例如PCI Express、加速繪圖埠或HyperTransport鏈結)耦合至記憶體橋接器105；在一具體實施例中，平行處理子系統112為一繪圖子系統，其傳遞像素到一顯示裝置110(例如一習用CRT或LCD式的監視器)。儲存在系統記憶體104中的一裝置驅動器103，其聯繫於CPU 102執行的程序之間，例如應用程式，以及平行處理子系統112，轉譯程式指令，其視需要由平行處理子系統112來執行。The first figure is a block diagram of an operational system 100 configured to implement one or more aspects of the present invention. The computer system 100 includes a central processing unit (CPU) 102 and a system memory 104 that communicate via a bus path including a memory bridge 105. The memory bridge 105 can be, for example, a north bridge wafer that is connected to an I/O (input/output) bridge 107 via a bus or other communication path 106 (e.g., a HyperTransport link). The I/O bridge 107 can be, for example, a south bridge wafer that receives user input from one or more user input devices 108 (eg, a keyboard, mouse) and forwards the input via path 106 and memory bridge 105. Go to the CPU 102. A parallel processing subsystem 112 is coupled to the memory bridge 105 via a bus or other communication path 113 (e.g., PCI Express, accelerated graphics port or HyperTransport link); in one embodiment, the parallel processing subsystem 112 is a A graphics subsystem that passes pixels to a display device 110 (eg, a conventional CRT or LCD type of monitor). A device driver 103 stored in system memory 104 is coupled between programs executed by CPU 102, such as an application, and parallel processing subsystem 112, which translates program instructions, which are executed by parallel processing subsystem 112 as needed. .

一系統碟114亦連接至I/O橋接器107。一開關116提供I/O橋接器107與其它像是網路轉接器118與多種嵌入卡120、121之其它組件之間的連接。其它組件(未明確顯示)，包括有USB或其它埠連接、CD驅動器、DVD驅動器、薄膜記錄裝置及類似者，其亦可連接至I/O橋接器107。互連接於第一圖中多種組件的通訊路徑可使用任何適當的協定來實施，例如PCI(周邊組件互連,"Peripheral Component Interconnect")、PCI Express(PCI-E)、AGP(加速繪圖埠,"Accelerated Graphics Port")、HyperTransport，或任何其它匯流排或點對點通訊協定，及不同裝置之間的連接，皆可使用如本技藝中所知的不同協定。A system disk 114 is also coupled to the I/O bridge 107. A switch 116 provides a connection between the I/O bridge 107 and other components such as the network adapter 118 and the various embedded cards 120, 121. Other components (not explicitly shown), including USB or other port connections, CD drives, DVD drives, thin film recording devices, and the like, may also be coupled to I/O bridge 107. The communication paths interconnected to the various components in the first figure can be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI Express (PCI-E), AGP (Accelerated Drawing, "Accelerated Graphics Port"), HyperTransport, or any other bus or peer-to-peer protocol, and connections between different devices, may use different protocols as are known in the art.

一平行處理子系統112之具體實施例顯示於第二圖。平行處理子系統112包括一或多個平行處理單元(Parallel processing unit,PPU)202，其每一者耦合於一本地平行處理(Parallel processing,PP)記憶體204。概言之，一平行處理子系統包括一個數目(U)的PPU，其中U≧1。(在此處類似物件的多個實例標示為辨識該物件之參考編號，而括號中的數目辨識所需要的實例)。PPU 202及PP記憶體204可以例如使用一或多個積體電路裝置來實施，例如可程式化處理器，特殊應用積體電路(Application specific integrated circuits,ASIC)及記憶體裝置。A specific embodiment of a parallel processing subsystem 112 is shown in the second figure. Parallel processing subsystem 112 includes one or more Parallel Processing Units (PPUs) 202, each coupled to a local parallel processing (PP) memory 204. In summary, a parallel processing subsystem includes a number (U) of PPUs, where U ≧ 1. (Several instances of the analog component are labeled here to identify the reference number of the object, and the number in parentheses identifies the desired instance). The PPU 202 and the PP memory 204 can be implemented, for example, using one or more integrated circuit devices, such as a programmable processor, an application specific integrated circuit (ASIC), and a memory device.

如PPU 202(0)所示之細節，每個PPU 202包括一主控介面206，其經由通訊路徑113與系統100的其它部份進行通訊，其連接至記憶體橋接器105(或在一替代具體實施例中直接連接至CPU 102)。在一具體實施例中，通訊路徑113為一PCI-E鏈結，其中如本技藝中所熟知者有專屬的線路會分配給每個PPU 202。其亦可使用其它通訊路徑。主控介面206產生封包(或其它信號)在通訊路徑113上傳輸，亦會自通訊路徑113接收所有進入的封包(或其它信號)，並將它們導引到PPU 202之適當的組件。例如，關於處理工作的命令可以導引到一前端單元212，而關於記憶體作業的命令(例如自PP記憶體204讀取或寫入)可被導引到一記憶體介面214。主控介面206、前端單元212及記憶體介面214概略可為習用設計，因其並非本發明之關鍵所以省略其詳細說明。As with the details shown in PPU 202(0), each PPU 202 includes a master interface 206 that communicates with other portions of system 100 via communication path 113, which is coupled to memory bridge 105 (or in an alternative The specific embodiment is directly connected to the CPU 102). In one embodiment, the communication path 113 is a PCI-E link in which a dedicated line is assigned to each PPU 202 as is known in the art. It can also use other communication paths. The master interface 206 generates packets (or other signals) for transmission over the communication path 113, and also receives all incoming packets (or other signals) from the communication path 113 and directs them to the appropriate components of the PPU 202. For example, commands regarding processing operations may be directed to a front end unit 212, and commands for memory operations (eg, read or written from PP memory 204) may be directed to a memory interface 214. The master interface 206, the front end unit 212, and the memory interface 214 may be generally designed, and since they are not critical to the present invention, detailed description thereof will be omitted.

每個PPU 202較佳地是實施一高度平行的處理器。如PPU 202(0)之細節所示，一PPU 202包括C個核心208，其中C ≧1。每個處理器核心208能夠同步執行大量(例如數十或數百)的執行緒，其中每個執行緒為一程式的實例；一多執行緒處理核心208的具體實施例說明如下。核心208經由一網路分配單元210接收到要執行的處理工作，其自一前端單元212接收定義了處理工作的命令。工作分配單元210可以實施多種用於分配工作的演算法。例如在一具體實施例中，工作分配單元210自每個核心208接收一「預備好」(READY)信號，代表該核心是否具有充足的資源來接受一新的處理工作。當一新的處理工作到達時，工作分配單元210指定該工作給一核心208而確立該預備好信號；如果並無核心208確立該預備好信號，工作分配單元210保持該新的處理工作，直到一預備好信號被一核心208確立。本技藝專業人士將可瞭解到亦可使用其它的演算法，且工作分配單元210分配進入的處理工作之特定方法並非本發明之關鍵。Each PPU 202 preferably implements a highly parallel processor. As shown in the details of PPU 202(0), a PPU 202 includes C cores 208, where C ≧ 1. Each processor core 208 is capable of executing a large number (e.g., tens or hundreds) of threads simultaneously, each of which is an instance of a program; a specific embodiment of a multi-thread processing core 208 is described below. Core 208 receives processing operations to be performed via a network allocation unit 210, which receives commands from a front end unit 212 that define processing operations. The work distribution unit 210 can implement a variety of algorithms for assigning work. For example, in one embodiment, the work distribution unit 210 receives a "READY" signal from each core 208 indicating whether the core has sufficient resources to accept a new processing job. When a new processing job arrives, the work distribution unit 210 assigns the work to a core 208 to establish the ready signal; if no core 208 establishes the ready signal, the work distribution unit 210 maintains the new processing until A ready signal is established by a core 208. Those skilled in the art will appreciate that other algorithms may be used and that the particular method by which work distribution unit 210 allocates incoming processing work is not critical to the invention.

核心208與記憶體介面214進行通訊以自多個外部記憶體裝置讀取或寫入其中。在一具體實施例中，記憶體介面214包括可調整來與本地PP記憶體204進行通訊、以及連接至主控介面206之介面，藉此核心208可與系統記憶體104或並非位在PPU 202本地之其它記憶體進行通訊。記憶體介面214概略可為習用的設計，並省略其詳細說明。Core 208 is in communication with memory interface 214 for reading or writing from a plurality of external memory devices. In one embodiment, the memory interface 214 includes an interface that is adjustable to communicate with the local PP memory 204 and to the host interface 206, whereby the core 208 can be interfaced with the system memory 104 or not at the PPU 202. Other local memory communicates. The memory interface 214 may be a conventional design and a detailed description thereof will be omitted.

核心208可被程式化來執行關於許多種應用之處理工作，其中包括但不限於線性及非線性資料轉換，影片及/或聲音資料的過濾，模型化作業(例如應用物理定律來決定物體的位置、速度及其它屬性)，影像顯像作業(例如頂點遮光器、幾何遮光器及/或像素遮光器程式)等等。PPU 202可將來自系統記憶體104及/或本地PP記憶體204之資料轉移到內部(晶片上)記憶體、處理該資料及將結果資料寫回到系統記憶體104及/或本地PP記憶體204，其中這些資料可由其它系統組件存取，其包括例如CPU 102或另一個平行處理子系統112。The core 208 can be programmed to perform processing for a wide variety of applications, including but not limited to linear and non-linear data conversion, filtering of film and/or sound data, modeling operations (eg, applying physical laws to determine the position of an object). , speed and other properties), image development jobs (such as vertex shaders, geometry shutters and/or pixel shader programs) and more. The PPU 202 can transfer data from the system memory 104 and/or the local PP memory 204 to internal (on-wafer) memory, process the data, and write the resulting data back to the system memory 104 and/or local PP memory. 204, wherein the material is accessible by other system components, including, for example, CPU 102 or another parallel processing subsystem 112.

請再次參照第一圖，在一些具體實施例中，平行處理子系統112中部份或所有的PPU 202為繪圖處理器，其具有顯像管線，其能夠組態成執行關於自CPU 102及/或系統記憶體104經由記憶體橋接器105及匯流排113所供應的繪圖資料產生像素資料的多種工作，與本地PP記憶體204互動(其能夠做為繪圖記憶體，包括例如一習用像框緩衝器)，以儲存及更新像素資料，傳遞像素資料到顯示裝置110及類似者。在一些具體實施例中，平行處理子系統112可以包括一或多個PPU 202，其可操作為繪圖處理器，及一或多個其它PPU 202，其可用於通用型運算。PPU 202可為相同或不同，且每個PPU 202可以具有其本身專屬的(多個)PP記憶體裝置204或無專屬的(多個)PP記憶體裝置。Referring again to the first figure, in some embodiments, some or all of the PPUs 202 in the parallel processing subsystem 112 are graphics processors having a visualization pipeline that can be configured to execute with respect to the CPU 102 and/or The system memory 104 generates various operations of the pixel data via the drawing data supplied from the memory bridge 105 and the bus bar 113, and interacts with the local PP memory 204 (which can be used as a drawing memory, including, for example, a conventional image frame buffer). To store and update pixel data, pass pixel data to display device 110 and the like. In some embodiments, parallel processing subsystem 112 can include one or more PPUs 202 that can operate as a graphics processor and one or more other PPUs 202 that can be used for general purpose operations. The PPUs 202 can be the same or different, and each PPU 202 can have its own proprietary PP memory device(s) 204 or no proprietary (multiple) PP memory devices.

在操作上，CPU 102為系統100的主控處理器，其控制及協調其它系統組件的作業。特別是，CPU 102發出命令來控制PPU 202的作業。在一些具體實施例中，CPU 102寫入每個PPU之一命令串流到一推入緩衝器(pushbuffer)(第一圖中未明確示出)，其可位在系統記憶體104、PP記憶體204或另一可同時由CPU 102及PPU 202存取的儲存位置。PPU 202自該推入緩衝器讀取該命令串流，並與CPU 102的作業非同步地執行命令。因此，PPU 202可組態成分散CPU 102之處理的負載，以增加系統100之處理流量及/或效能。In operation, CPU 102 is the master processor of system 100 that controls and coordinates the operations of other system components. In particular, CPU 102 issues commands to control the operation of PPU 202. In some embodiments, CPU 102 writes one of each PPU command stream to a pushbuffer (not explicitly shown in the first figure), which can be located in system memory 104, PP memory. The body 204 or another storage location that is simultaneously accessible by the CPU 102 and the PPU 202. The PPU 202 reads the command stream from the push-in buffer and executes the command asynchronously with the job of the CPU 102. Accordingly, PPU 202 can be configured to distribute the processing of CPU 102 to increase the processing traffic and/or performance of system 100.

其將可瞭解到此處所示的系統僅為例示性，其有可能有多種變化及修正。該連接拓樸，包括橋接器的數目及配置等，皆可視需要修改。例如，在一些具體實施例中，系統記憶體104直接連接至CPU 102而非透過一橋接器連接，而其它裝置透過記憶體橋接器105及CPU 102與系統記憶體104進行通訊。在其它替代的拓樸中，平行處理子系統112連接至I/O橋接器107或直接連接至CPU 102，而非連接至記憶體橋接器105。在又其它的具體實施例中，I/O橋接器107及記憶體橋接器105可被整合到一單一晶片當中。此處所示的該等特定組件為選擇性；例如，其可支援任何數目的嵌入卡或周邊裝置。在一些具體實施例中，開關116被省略，且網路轉接器118及嵌入卡120、121直接連接至I/O橋接器107。It will be appreciated that the systems shown herein are merely illustrative and that there are many variations and modifications possible. The connection topology, including the number and configuration of the bridges, can be modified as needed. For example, in some embodiments, system memory 104 is directly coupled to CPU 102 rather than through a bridge, while other devices communicate with system memory 104 via memory bridge 105 and CPU 102. In other alternative topologies, parallel processing subsystem 112 is coupled to I/O bridge 107 or directly to CPU 102 rather than to memory bridge 105. In still other embodiments, I/O bridge 107 and memory bridge 105 can be integrated into a single wafer. The particular components shown herein are optional; for example, they can support any number of embedded cards or peripheral devices. In some embodiments, switch 116 is omitted and network adapter 118 and embedded cards 120, 121 are directly connected to I/O bridge 107.

PPU 202與系統100的其它部份之連接亦可改變。在一些具體實施例中，PP系統112係實施成一嵌入卡，其可被插入到系統100的擴充槽中。在其它具體實施例中，PPU 202可利用一匯流排橋接器整合在一單一晶片上，例如記憶體橋接器105或I/O橋接器107。在又其它的具體實施例中，PPU 202之部份或所有元件可與CPU 102整合在一單一晶片上。The connection of PPU 202 to other portions of system 100 can also vary. In some embodiments, the PP system 112 is implemented as an embedded card that can be inserted into an expansion slot of the system 100. In other embodiments, PPU 202 can be integrated on a single wafer, such as memory bridge 105 or I/O bridge 107, using a bus bridge. In still other embodiments, some or all of the components of PPU 202 may be integrated with CPU 102 on a single wafer.

一PPU可具有任何數量的本地PP記憶體，並不包括本地記憶體，並可用任何的組合來使用本地記憶體及系統記憶體。例如，一PPU 202在一統一記憶體架構(Unified memory architecture,UMA)具體實施例中可為一繪圖處理器；在這些具體實施例中，僅提供少許或並不提供專屬的繪圖(PP)記憶體，而PPU 202將獨佔性或幾乎獨佔性地使用系統記憶體。在UMA具體實施例中，一PPU 202可被整合到一橋接器晶片中或處理器晶片中，或提供成具有一高速鏈結(例如PCI-E)之離散晶片，其例如透過一橋接器晶片連接該PPU到系統記憶體。A PPU can have any number of local PP memories, does not include local memory, and can use local memory and system memory in any combination. For example, a PPU 202 can be a graphics processor in a unified memory architecture (UMA) embodiment; in these embodiments, only a little or no dedicated graphics (PP) memory is provided. Body, while PPU 202 will use system memory exclusively or almost exclusively. In a UMA embodiment, a PPU 202 can be integrated into a bridge wafer or processor chip, or provided as a discrete wafer having a high speed link (eg, PCI-E), such as through a bridge wafer Connect the PPU to the system memory.

如上所述，其可包括任何數目的PPU 202在一平行處理子系統中。例如，多個PPU 202可提供在一單一嵌入卡上，或多個嵌入卡可連接至通訊路徑113，或一或多個PPU 202可被整合到一橋接器晶片中。在一多PPU系統中的該等PPU可以彼此相同或不同；例如，不同的PPU可以具有不同數目的核心、不同數量的本地PP記憶體等等。當存在有多個PPU 202時，它們可平行地作業而以高於一單一PPU 202所可能的流量來處理資料。加入有一或多個PPU 202之系統可用多種組態及型式因子來實施，其中包括桌上型、膝上型、或掌上型個人電腦、伺服器、工作站、遊戲主機、嵌入式系統及類似者。As noted above, it can include any number of PPUs 202 in a parallel processing subsystem. For example, multiple PPUs 202 can be provided on a single embedded card, or multiple embedded cards can be connected to communication path 113, or one or more PPUs 202 can be integrated into a bridge wafer. The PPUs in a multi-PPU system may be the same or different from each other; for example, different PPUs may have different numbers of cores, different numbers of local PP memories, and the like. When there are multiple PPUs 202, they can operate in parallel to process data at a higher traffic than is possible with a single PPU 202. Systems incorporating one or more PPUs 202 can be implemented with a variety of configurations and style factors, including desktop, laptop, or palm-sized personal computers, servers, workstations, game consoles, embedded systems, and the like.

核心概述Core overview

第三圖為根據本發明一或多種態樣中第二圖之平行運算子系統112的核心208之方塊圖。PPU 202包括一核心208(或多重核心208)，其組態成平行地執行大量的執行緒，其中術語「執行緒」(thread)係指一上下文的實例，即在一特定輸入資料集上執行的一特定程式。在一些具體實施例中，使用單一指令、多重資料(Single-instru ction,multiple-data,SIMD)指令發行技術來支援大量執行緒之平行執行，而不需要提供多個獨立指令單元。The third diagram is a block diagram of the core 208 of the parallel computing subsystem 112 of the second diagram in accordance with one or more aspects of the present invention. PPU 202 includes a core 208 (or multiple cores 208) configured to execute a large number of threads in parallel, wherein the term "thread" refers to an instance of a context that is executed on a particular input data set. a specific program. In some embodiments, a single instruction, multiple-data (multiple-data, SIMD) instruction issuance technique is used to support parallel execution of a large number of threads without the need to provide multiple independent instruction units.

在一具體實施例中，每個核心208包括P個(例如8或16等)平行處理引擎302之陣列，其組態成自一單一指令單元312接收SIMD指令。每個處理引擎302較佳地是包括一相同組合的功能單元(例如算術邏輯單元等)。該等功能單元可被管線化，允許在一先前指令已經完成之前發行一新的指令，如本技藝中所熟知。其可提供功能單元的任何組合。在一具體實施例中，該等功能單元支援多種運算，其中包括整數及浮點數算術(例如加法及乘法)、比較運算、布林運算(AND、OR、XOR)、位元偏位及多種代數函數的運算(例如平面內插、三角函數、指數、及對數函數等)；並且相同的功能單元硬體可被利用來執行不同的運算。In one embodiment, each core 208 includes an array of P (eg, 8 or 16 etc.) parallel processing engines 302 configured to receive SIMD instructions from a single instruction unit 312. Each processing engine 302 is preferably a functional unit (e.g., an arithmetic logic unit, etc.) that includes an identical combination. The functional units can be pipelined to allow a new instruction to be issued before a previous instruction has been completed, as is well known in the art. It can provide any combination of functional units. In a specific embodiment, the functional units support a variety of operations, including integer and floating point arithmetic (eg, addition and multiplication), comparison operations, Boolean operations (AND, OR, XOR), bit offsets, and various Algebraic functions (such as plane interpolation, trigonometric functions, exponentials, and logarithmic functions, etc.); and the same functional unit hardware can be utilized to perform different operations.

每個處理引擎302使用在一本地暫存器檔案(Local register file,LRF)304中的空間，用於儲存其本地輸入資料、中間結果及類似者。在一具體實施例中，本地暫存器檔案304被實體或邏輯性地區分成P條線路，其每一者具有某些數目的項目(其中每個項目可以儲存例如32位元的字元)。一條線路被指定給每個處理引擎302，且在不同線路中的相對應項目可呈現執行相同程式的不同執行緒之資料來實施SIMD執行。在一些具體實施例中，每個處理引擎302僅可在指定給它的線路中存取LRF項目。在本地暫存器檔案304中項目的總數較佳地係足夠大而可支援每個處理引擎302之多個同步執行緒。Each processing engine 302 uses a space in a local register file (LRF) 304 for storing its local input data, intermediate results, and the like. In one embodiment, the local register file 304 is divided into P lines by physical or logical regions, each of which has a certain number of items (where each item can store, for example, a 32-bit character). A line is assigned to each processing engine 302, and corresponding items in different lines can present data of different threads executing the same program to implement SIMD execution. In some embodiments, each processing engine 302 can only access LRF items in the line assigned to it. The total number of items in the local register file 304 is preferably sufficiently large to support multiple synchronization threads per processing engine 302.

每個處理引擎302亦可存取到一晶片上共享記憶體306，其於核心208中所有的處理引擎302之間共享。共享的記憶體306可依需要來擴大，且在一些具體實施例中，任何處理引擎302可自共享記憶體306中的任何位置以同等的低遲滯性而讀取或寫入(例如相較於存取本地暫存器檔案304)。在一些具體實施例中，共享的記憶體306係實施成一共享的暫存器檔案；在其它具體實施例中，共享的記憶體306可使用共享的快取記憶體來實施。Each processing engine 302 also has access to a on-wafer shared memory 306 that is shared among all processing engines 302 in core 208. The shared memory 306 can be expanded as needed, and in some embodiments, any processing engine 302 can read or write from any location in the shared memory 306 with equal low hysteresis (eg, as compared to Access local register file 304). In some embodiments, the shared memory 306 is implemented as a shared scratchpad file; in other embodiments, the shared memory 306 can be implemented using shared cache memory.

除了共享記憶體306之外，一些具體實施例亦可提供額外的晶片上參數記憶體及/或(多個)快取308，其可實施成例如一習用的RAM或快取。參數記憶體/快取308可以用於例如保存狀態參數及/或其它資料(例如多種常數)，其會由多個執行緒所需要。處理引擎302亦可透過記憶體介面214存取到晶片外「通用」記憶體，其可包括例如PP記憶體204及/或系統記憶體104，系統記憶體104可經由主控介面206進行存取。其應瞭解到PPU 202之外的任何記憶體可做為通用記憶體。In addition to shared memory 306, some embodiments may provide additional on-wafer parametric memory and/or cache(s) 308 that may be implemented, for example, as a conventional RAM or cache. The parameter memory/cache 308 can be used, for example, to save state parameters and/or other data (e.g., various constants) that would be required by multiple threads. The processing engine 302 can also access the off-chip "universal" memory through the memory interface 214, which can include, for example, the PP memory 204 and/or the system memory 104, and the system memory 104 can be accessed via the host interface 206. . It should be understood that any memory other than PPU 202 can be used as general purpose memory.

在一具體實施例中，每個處理引擎302為多執行緒化，並可同步地執行最多某個數目G(例如24)個執行緒，例如藉由維持在本地暫存器檔案304中其被指定的線路之不同部份中關聯於每個執行緒之目前狀態資訊。處理引擎302較佳地是設計成快速地由一執行緒切換到另一個，使得來自不同執行緒之指令可用任何順序發出，而不會損失效率。因為每個執行緒可以對應於一不同的上下文，多個上下文可在當不同的執行緒對於每個循環發行時即可在多個循環之上處理。In one embodiment, each processing engine 302 is multi-threaded and can execute up to a certain number of G (e.g., 24) threads, for example by maintaining it in the local register file 304. The current status information associated with each thread is associated with different parts of the specified line. Processing engine 302 is preferably designed to quickly switch from one thread to another so that instructions from different threads can be issued in any order without loss of efficiency. Because each thread can correspond to a different context, multiple contexts can be processed on multiple loops when different threads are issued for each loop.

指令單元312係組態以使得對於任何給定的處理循環，其發出一指令(INSTR)到每個P處理引擎302。每個處理引擎302可在當多個上下文同時在處理時接收任何給定處理循環的不同指令。當所有P個處理引擎302處理一單一上下文時，核心208實施P向SIMD微型架構。因為每個處理引擎302亦為多重執行緒化，其最高同步支援G個執行緒，在此具體實施例中的核心208最多可具有P*G個同步執行的執行緒。例如，如果P=16及G=24，則核心208對於一單一上下文或每個上下文之N*24個同步執行緒支援最多到384個同步執行緒，其中N為分配給該上下文之處理引擎302的數目。Instruction unit 312 is configured such that for any given processing loop, it issues an instruction (INSTR) to each P processing engine 302. Each processing engine 302 can receive different instructions for any given processing loop while multiple contexts are simultaneously processing. When all P processing engines 302 process a single context, core 208 implements a P-to-SIMD micro-architecture. Since each processing engine 302 is also multi-threaded, its highest synchronization supports G threads, and the core 208 in this embodiment may have up to P*G synchronous execution threads. For example, if P=16 and G=24, the core 208 supports up to 384 synchronization threads for a single context or N*24 synchronization threads for each context, where N is the processing engine 302 assigned to the context. Number of.

核心208的操作較佳地是經由一工作分配單元200來控制。在一些具體實施例中，工作分配單元200接收要被處理之資料的指標(例如基元資料、頂點資料及/或像素資料)，以及包含有定義該資料如何要被處理之資料或指令(例如那一個程式要被執行)的推入緩衝器之位置。工作分配單元210可載入要被處理之資料到共享記憶體306中，及載入參數到參數記憶體308中。工作分配單元210亦初始化指令單元312中每個新的上下文，然後發信號給指令單元312來開始執行該上下文。指令單元312讀取指令推入緩衝器，並執行該等指令來產生處理過的資料。當完成一上下文的執行時，核心208較佳地是通知工作分配單元210。然後工作分配單元210可初始化其它程序，例如接收來自共享記憶體306之輸出資料，及/或預備核心208來執行額外的上下文。The operation of core 208 is preferably controlled via a work distribution unit 200. In some embodiments, the work distribution unit 200 receives metrics (eg, primitive data, vertex data, and/or pixel data) of the material to be processed, and includes data or instructions that define how the material is to be processed (eg, The position of the program that is to be executed is pushed into the buffer. The work distribution unit 210 can load the data to be processed into the shared memory 306 and load the parameters into the parameter memory 308. Work allocation unit 210 also initializes each new context in instruction unit 312 and then signals instruction unit 312 to begin execution of the context. Instruction unit 312 reads the instructions into the buffer and executes the instructions to produce the processed data. When the execution of a context is completed, the core 208 is preferably notified to the work distribution unit 210. Work distribution unit 210 can then initialize other programs, such as receiving output data from shared memory 306, and/or preparing core 208 to perform additional context.

其將可瞭解到此處所述的平行處理單元及核心架構僅為例示性，其有可能有多種變化及修正。其可包括任何數目的處理引擎。在一些具體實施例中，每個處理引擎302具有其本身的本地暫存器檔案，且每個執行緒之本地暫存器檔案項目之分配可為固定或依需要設置。特別是，本地暫存器檔案304之項目可分配來處理每個上下文。再者，當顯示僅有一個核心208時，一PPU 202可以包括任何數目的核心208，其較佳地是為彼此相同的設計，使得執行行為並不會根據哪一個核心208接收一特地處理工作。每個核心208較佳地是獨立於其它核心208來作業，並具有其本身的處理引擎、共享的記憶體等等。It will be appreciated that the parallel processing units and core architectures described herein are merely illustrative and that many variations and modifications are possible. It can include any number of processing engines. In some embodiments, each processing engine 302 has its own local register file, and the allocation of local register file entries for each thread can be fixed or set as desired. In particular, items of local register file 304 can be allocated to handle each context. Moreover, when only one core 208 is shown, a PPU 202 can include any number of cores 208, which are preferably of the same design as each other such that the execution behavior does not receive a special processing job depending on which core 208 receives. . Each core 208 preferably operates independently of the other cores 208 and has its own processing engine, shared memory, and the like.

繪圖管線架構Drawing pipeline architecture

第四圖為根據本發明之一或多種態樣之一繪圖處理管線400之概念圖。PPU 202可組態成形成一繪圖處理管線400。例如，核心208可組態成執行一或多個頂點處理單元444、幾何處理單元448及一片段處理單元460之功能。資料組成器442、基元組成器446、掃描場解析器445及掃描場作業單元465中多項功能亦可由核心208執行。另外，繪圖處理管線40可使用頂點處理單元444、幾何處理單元448、片段處理單元460、資料組成器442、基元組成器446、掃描場解析器455及掃描場作業單元465中一或多種之專屬處理單元來實施。The fourth figure is a conceptual diagram of a drawing processing pipeline 400 in accordance with one or more aspects of the present invention. PPU 202 can be configured to form a graphics processing pipeline 400. For example, core 208 can be configured to perform the functions of one or more vertex processing units 444, geometry processing unit 448, and a fragment processing unit 460. Multiple functions in data composer 442, primitive composer 446, scan field parser 445, and scan field work unit 465 may also be performed by core 208. In addition, the drawing processing pipeline 40 can use one or more of the vertex processing unit 444, the geometry processing unit 448, the fragment processing unit 460, the data composer 442, the primitive composer 446, the scan field parser 455, and the scan field work unit 465. A dedicated processing unit is implemented.

資料組成器442為一處理單元，其收集高階表面、基元及類似者之頂點資料，並輸出該頂點資料到頂點處理單元444。頂點處理單元444為一可程式化執行單元，其組態成執行頂點遮光器程式，依照該頂點遮光器程式所指定的來轉換頂點資料。例如，頂點處理單元444可被程式化來由一物件式座標表示(物件空間)轉換該頂點資料到另外一種座標系統，例如世界空間或正規劃的裝置座標(Normalized device coordinates,NDC)空間。頂點處理單元444可讀取儲存在PP記憶體204或系統記憶體104中的資料來用於處理該頂點資料。The data composer 442 is a processing unit that collects vertex data of high-order surfaces, primitives, and the like, and outputs the vertex data to the vertex processing unit 444. Vertex processing unit 444 is a programmable execution unit configured to execute a vertex shader program that converts vertex data as specified by the vertex shader program. For example, vertex processing unit 444 can be programmed to convert the vertex data from an object coordinate representation (object space) to another coordinate system, such as world space or a normalized device coordinate (NDC) space. Vertex processing unit 444 can read the data stored in PP memory 204 or system memory 104 for processing the vertex data.

基元組成器446自頂點處理單元444接收被處理過的頂點資料，並建構繪圖基元，例如點、線、三角形或類似者，用於由幾何處理單元448來處理。幾何處理單元448為一可程式化執行單元，其組態成執行幾何遮光器程式，依照該幾何遮光器程式所指定者來轉換自基元組成器446接收的繪圖基元。例如，幾何處理單元448可被程式化來次區分該等繪圖基元到一或多個新的繪圖基元，並計算參數，例如平面等式係數，其係用於光柵化新的繪圖基元。在本發明一些具體實施例中，幾何處理單元448亦可加入或刪除該幾何串流中的元件。幾何處理單元448輸出該等參數及頂點，其指定新的繪圖基元到掃描場解析器455或到記憶體介面214。幾何處理單元448可以讀取儲存在PP記憶體204或系統記憶體104中的資料來用於處理該幾何資料。The primitive composer 446 receives the processed vertex data from the vertex processing unit 444 and constructs drawing primitives, such as points, lines, triangles, or the like, for processing by the geometry processing unit 448. Geometry processing unit 448 is a programmable execution unit configured to execute a geometry shader program that converts graphics primitives received from primitive component 446 in accordance with the designation of the geometry shader program. For example, geometry processing unit 448 can be programmed to distinguish the drawing primitives from one or more new drawing primitives and calculate parameters, such as plane equation coefficients, which are used to rasterize the new drawing primitives. . In some embodiments of the invention, geometry processing unit 448 may also add or remove elements in the geometry stream. Geometry processing unit 448 outputs the parameters and vertices that specify a new drawing primitive to scan field parser 455 or to memory interface 214. Geometry processing unit 448 can read the data stored in PP memory 204 or system memory 104 for processing the geometry.

掃描場解析器455掃描轉換該等新的繪圖基元，並輸出片段及覆蓋範圍資料到片段處理單元260。當使用抗頻疊來產生影像資料時，掃描場解析器455組態成產生次像素樣本覆蓋範圍資料。當使用複合式抗頻疊，一複合式抗頻疊控制單元500，其可存在於掃描場解析器455中，其組態成決定通過片段處理單元460之數目，其用於處理每個基元，如配合第五C圖及第六圖所述。The field parser 455 scans and converts the new drawing primitives and outputs the fragment and coverage data to the fragment processing unit 260. When the anti-frequency stack is used to generate image data, the field resolver 455 is configured to generate sub-pixel sample coverage data. When a composite anti-aliasing stack is used, a composite anti-stack control unit 500, which may be present in the field decoder 455, is configured to determine the number of passing fragment processing units 460 for processing each primitive , as described in the fifth C and sixth figures.

片段處理單元460為一可程式化執行單元，其組態成執行片段遮光器程式，依該等片段遮光器程式所指定轉換自掃描場解析器455接收的片段。例如，片段處理單元460可被程式化來執行作業，例如個別修正、圖紋映射、遮光、混合及類似者，以產生要輸出到掃描場作業單元465之遮光的片段。片段處理單元460可讀取儲存在PP記憶體204或系統記憶體104中的資料來用於處理該片段資料。片段依據由該複合式抗頻疊控制單元所選擇的取樣率來在像素、樣本或超樣本叢集粒度處被遮光。Fragment processing unit 460 is a programmable execution unit configured to execute a fragment shader program that converts the segments received from field field parser 455 as specified by the fragment shader programs. For example, segment processing unit 460 can be programmed to perform jobs, such as individual corrections, pattern mapping, shading, blending, and the like, to generate segments that are to be output to the field of view unit 465. The fragment processing unit 460 can read the data stored in the PP memory 204 or the system memory 104 for processing the fragment data. The segments are shaded at the pixel, sample or supersample cluster granularity depending on the sampling rate selected by the composite anti-aliasing control unit.

記憶體介面214產生對於儲存在繪圖記憶體中的資料之讀取請求，並執行圖紋過濾作業，例如雙線性、三重線性、非等向性及類似者。在本發明一些具體實施例中，記憶體介面214可組態成解壓縮資料。特別是，記憶體介面214可組態成解壓縮固定長度的區塊編碼的資料，例如表示成DXT格式的壓縮資料。掃描場作業單元465為一處理單元，其執行掃描場解析器作業，例如模板、Z型測試及類似者，並輸出像素資料做為被處理的繪圖資料來儲存在繪圖記憶體中。該等處理過的繪圖資料可以儲存在繪圖記憶體中，例如PP記憶體204，及/或系統記憶體104，用於顯示在顯示裝置110上，或用於另由CPU 102或平行處理子系統112處理。在本發明一些具體實施例中，掃描場作業單元465組態成壓縮z或色彩資料，其被寫入到記憶體，並解壓縮自記憶體讀取的z或色彩資料。The memory interface 214 generates a read request for data stored in the drawing memory and performs a pattern filtering operation such as bilinear, triple linear, anisotropic, and the like. In some embodiments of the invention, the memory interface 214 can be configured to decompress data. In particular, the memory interface 214 can be configured to decompress fixed length block encoded material, such as compressed data represented in DXT format. The fieldwork unit 465 is a processing unit that performs field parser jobs, such as templates, Z-type tests, and the like, and outputs pixel data as processed drawing data for storage in the drawing memory. The processed graphics data may be stored in a graphics memory, such as PP memory 204, and/or system memory 104, for display on display device 110, or for further use by CPU 102 or a parallel processing subsystem. 112 processing. In some embodiments of the invention, the fieldwork unit 465 is configured to compress z or color material that is written to the memory and decompressed the z or color material read from the memory.

複合式抗頻疊Composite anti-frequency stack

如前所述，PPU 202可組態成以多種取樣率來執行遮光，以改善影像品質或改善遮光效能。一複合式抗頻疊控制單元決定用於遮光一基元內每個像素之一些遮光器通過。每個像素之一或多個多重樣本(次像素樣本)之超樣本叢集可由組態成每個通過之片段處理單元460之一核心208來處理，以產生對於在該超樣本叢集中所有多重樣本做複製之一單一遮光的色彩值。在顯像一場景之後，該等超樣本叢集之樣本被組合來產生一抗頻疊的影像。As previously mentioned, the PPU 202 can be configured to perform shading at various sampling rates to improve image quality or improve shading performance. A composite anti-stack control unit determines the passage of some of the shutters for each pixel in the shading-based cell. A supersample cluster of one or more multiple samples (sub-pixel samples) per pixel may be processed by one of the cores 208 configured to pass each of the fragment processing units 460 to generate for all multiple samples in the supersample cluster Do a copy of one of the single shaded color values. After developing a scene, the samples of the supersample clusters are combined to produce an image of the anti-alias stack.

每個基元之次像素樣本及遮光器通過之數目被增加來改善影像品質。次像素樣本之數目係在當該應用被啟用，並對於一顯像目標(影像緩衝器)之每個像素為持續時來決定。該複合式抗頻疊控制單元可動態地基於該顯像狀態來決定遮光通過之數目，例如艾爾發測試致能/除能、圖紋地圖內容、使用者提供的品質/效能控制，或類似者。The number of sub-pixel samples and shutter passes per primitive is increased to improve image quality. The number of sub-pixel samples is determined when the application is enabled and is persisted for each pixel of a development target (image buffer). The composite anti-stack control unit can dynamically determine the number of shading passes based on the development state, such as Alpha test enable/disable, pattern map content, user-provided quality/performance control, or the like. By.

第五A圖所示為根據本發明之一或多種態樣在一像素501內之超樣本叢集503、511及多重樣本502、504及513。當使用八個次像素樣本抗頻疊時，可以使用多種不同的多重樣本及超樣本叢集之組合來產生該等八個次像素樣本。在第五A圖所示的範例中，三個超樣本叢集503及超樣本叢集511之每一者包括像素501之總共八個次像素樣本位置之兩個多重樣本，例如超樣本叢集511中多重樣本502及504。其它八個次像素樣本組態包括最多八個超樣本叢集，其每個具有一多重樣本，或最少有一個超樣本叢集，其具有八個超樣本。遮光對於每個超樣本叢集執行一次，且該遮光值，例如色彩，其對於該超樣本叢集內所有該等多重樣本來儲存。Figure 5A shows supersample clusters 503, 511 and multiple samples 502, 504 and 513 in a pixel 501 in accordance with one or more aspects of the present invention. When eight sub-pixel samples are used for anti-aliasing, a combination of a plurality of different multiple samples and super-sample clusters can be used to generate the eight sub-pixel samples. In the example shown in FIG. 5A, each of the three hypersample clusters 503 and the supersample clusters 511 includes two multiple samples of a total of eight sub-pixel sample positions of the pixel 501, such as multiples in the supersample cluster 511. Samples 502 and 504. The other eight sub-pixel sample configurations include a maximum of eight hypersample clusters, each having one multiple sample, or at least one hypersample cluster with eight super samples. Shading is performed once for each supersample cluster, and the shading value, such as color, is stored for all of the multiple samples within the supersample cluster.

遮光器屬性可在該超樣本叢集中一特定多重樣本之位置處被取樣，或它們可在一些其它位置處或靠近該超樣本叢集處來取樣。例如，在第五A圖中片段屬性(色彩、圖紋座標及類似者)可在該實體多重樣本位置處被取樣，例如超樣本叢集511中的多重樣本502。再者，當片段僅部份覆蓋一超樣本叢集，其較佳地是調整該屬性被取樣之位置來位在該超樣本叢集中所覆蓋的多重樣本之區域內。此常稱之為形心取樣，雖然該術語在此處應用到超樣本叢集，而非應用到整個像素片段。The shutter attributes may be sampled at a location of a particular multiple sample in the supersample cluster, or they may be sampled at or near some other sample cluster. For example, in Section 5A the segment attributes (colors, texture coordinates, and the like) may be sampled at the entity multiple sample locations, such as multiple samples 502 in the oversample cluster 511. Furthermore, when the segment only partially covers a supersample cluster, it is preferred to adjust the location at which the attribute is sampled to be within the region of the multiple samples covered by the supersample cluster. This is often referred to as centroid sampling, although the term is applied here to the supersample cluster rather than to the entire pixel segment.

第五B圖所示為根據本發明之一或多種態樣在一超樣本叢集511內一片段509及一形心位置517。在本發明一些具體實施例中，形心取樣用於修改屬性被評估的位置，以較佳地對應於實際上被該片段覆蓋的該螢幕區域。在本發明一些具體實施例中，該樣本內插單元510可組態成在一特定多重樣本位置處或在一近似的形心位置處取樣每個超樣本叢集。Figure 5B shows a segment 509 and a centroid position 517 in a supersample cluster 511 in accordance with one or more aspects of the present invention. In some embodiments of the invention, centroid sampling is used to modify the location at which the attribute is evaluated to preferably correspond to the screen area that is actually covered by the segment. In some embodiments of the invention, the sample interpolation unit 510 can be configured to sample each hypersample cluster at a particular multiple sample location or at an approximate centroid location.

該形心可為該等覆蓋的多重樣本之幾何形心，或其可例如藉由選擇最靠近於該整個被覆蓋的超樣本叢集之形心的該超樣本叢集中該被覆蓋的多重樣本來近似。例如，一形心位置517為在超樣本叢集511之幾何中心處一運算的多重樣本位置，其係用於代表超樣本叢集511之取樣的色彩，因為多重樣本502之位置靠近一邊緣，而非靠近片段509的中心。一遮光值在形心位置517處運算，以更為準確地代表相較於多重樣本502之片段色彩。The centroid may be the geometric centroid of the plurality of covered samples, or it may be, for example, by selecting the covered multiple samples of the supersample cluster closest to the centroid of the entire covered supersample cluster approximate. For example, a centroid position 517 is a multiple sample position of an operation at the geometric center of the supersample cluster 511, which is used to represent the color of the sample of the supersample cluster 511 because the position of the multiple sample 502 is near an edge, rather than Near the center of segment 509. A shading value is computed at the centroid position 517 to more accurately represent the segment color compared to the multi-sample 502.

第五C圖為根據本發明一或多種態樣之繪圖處理管線400之一部份的方塊圖，其包括掃描場解析器455、片段處理單元460、及掃描場作業單元465。其它處理單元可包括在掃描場解析器455、片段處理單元460及掃描場作業單元465之內。那些其它處理單元並未示於第五C圖中，因為它們可為概略習用的設計，且因其並非本發明之關鍵，所以省略其詳細說明。FIG. 5C is a block diagram of a portion of a graphics processing pipeline 400 in accordance with one or more aspects of the present invention, including a field decoder 455, a segment processing unit 460, and a field operator unit 465. Other processing units may be included within scan field parser 455, segment processing unit 460, and scan field work unit 465. Those other processing units are not shown in the fifth C diagram, as they may be schematic designs, and since they are not critical to the invention, a detailed description thereof will be omitted.

掃描場解析器455接收來自幾何處理單元448之基元，並對於該基元所交切之每個像素產生一片段。一複合式抗頻疊控制單元500(視需要在掃描場解析器455內)可組態成基於該顯像狀態動態地決定用於處理每個基元之該等片段的遮光器通過之數目，例如艾爾發測試致能/除能，圖紋地圖內容，使用者提供的品質/效能控制，或類似者。The field parser 455 receives the primitives from the geometry processing unit 448 and produces a fragment for each pixel intersected by the primitive. A composite anti-stack control unit 500 (as needed within the field resolver 455) can be configured to dynamically determine the number of shutter passes for processing the segments of each primitive based on the development state, For example, Alfa test enables/disables, map map content, user-provided quality/performance controls, or the like.

複合式抗頻疊控制單元500藉由執行基元之更多的遮光通過來改善抗頻疊效率，其將可受惠於一較高遮光率，並降低其它元件的遮光率。複合式抗頻疊控制單元500可由該使用者、該應用或裝置驅動器103組態來以多種品質設定來操作。這些的範圍可由最低品質設定「永遠多重樣本」到一最高品質設定「永遠超樣本」。中間品質設定可考慮在決定遮光通過之數目時顯像管線狀態。例如，如果致能艾爾發測試或遮光器像素刪除，即需要更多的遮光通過。相反地，當指定高效能時，艾爾發測試及遮光器像素刪除被除能，該取樣率可由複合式抗頻疊控制單元500所降低。複合式抗頻疊控制單元500亦可考慮到該像素遮光器或圖紋取樣器設定之特性來決定該遮光通過之數目。本技藝專業人士將可瞭解到複合式抗頻疊控制單元500所使用的多種條件來決定該遮光通過之數目。在習用的繪圖系統中，該取樣率基於使用者提供或固定的設定來對於一場景中所有該等基元來決定。再者，該等習用系統之取樣受限於多重取樣或超取樣，而並非該等中間選擇。The composite anti-stack control unit 500 improves the anti-overlap efficiency by performing more shading of the primitives, which will benefit from a higher shading rate and lower the shading rate of other components. The hybrid anti-stack control unit 500 can be configured by the user, the application or device driver 103 to operate with a variety of quality settings. These ranges from the lowest quality setting "forever multiple samples" to the highest quality setting "forever sample". The intermediate quality setting can take into account the state of the visualization pipeline when determining the number of shading passes. For example, if the Alfa test or the shutter pixel deletion is enabled, more shading is required. Conversely, when high performance is specified, the Alpha test and the shutter pixel deletion are disabled, and the sampling rate can be reduced by the composite anti-stack control unit 500. The composite anti-overlap control unit 500 can also determine the number of shading passes in consideration of the characteristics of the pixel shutter or pattern sampler setting. Those skilled in the art will appreciate the various conditions used by the hybrid anti-stack control unit 500 to determine the number of shading passes. In conventional drawing systems, the sampling rate is determined for all of the primitives in a scene based on user-provided or fixed settings. Moreover, sampling of such conventional systems is limited to multiple sampling or oversampling, and is not an intermediate choice.

在一具體實施例中，掃描場解析器455產生2x2四方的像素片段，其由複合式抗頻疊積分器單元515接收。當複合式抗頻疊控制單元500設定通過=1時(即當多重取樣時)，複合式抗頻疊積分器單元515傳送未修正的四方塊到片段處理單元460。但是，當複合式抗頻疊控制單元500設定通過到N>1時，複合式抗頻疊積分器單元515多次輸出每個四方塊到片段處理單元460，其包括對應於該遮光器通過之通過數目。複合式抗頻疊積分器單元515可以遮罩傳送到片段處理單元460之覆蓋範圍，使得僅有對應於該目前通過之超樣本叢集內的多重樣本被致能。在其它具體實施例中，片段處理單元460可以基於由複合式抗頻疊積分器單元515提供給它的通過數目來遮罩覆蓋範圍。請注意到其它具體實施例可以在除了一2x2片段四方塊之外的一區域中積分，例如一單一像素、一4x4片段方塊或類似者。在基元之外的像素(四方塊)的區域當中積分具有好處，因為圖紋地圖資料有可能對於一特定四方塊之後續遮光器通過來重新使用，藉此在基元之上積分，其可為較大，而使得圖紋資料自記憶體(例如PP記憶體204或系統記憶體104)來重新取出。In one embodiment, the field decoder 455 generates a 2x2 square pixel segment that is received by the composite anti-aliasing integrator unit 515. When the composite anti-stack control unit 500 sets pass =1 (ie, when multi-sampling), the composite anti-stack integrator unit 515 transmits the uncorrected four-square to segment processing unit 460. However, when the composite anti-stack control unit 500 is set to pass N>1, the composite anti-stack integrator unit 515 outputs each of the four blocks to the segment processing unit 460 a plurality of times, which includes corresponding to the shutter. Pass the number. The composite anti-aliasing integrator unit 515 can mask the coverage transmitted to the segment processing unit 460 such that only multiple samples corresponding to the currently passed supersample cluster are enabled. In other embodiments, the segment processing unit 460 can mask the coverage based on the number of passes provided to it by the composite anti-aliasing integrator unit 515. It is noted that other embodiments may be integrated in an area other than a 2x2 segment four squares, such as a single pixel, a 4x4 segment square, or the like. It is advantageous to integrate in the region of pixels (four squares) outside the primitive, because the map map data may be reused for the passage of a specific four-block follower, thereby integrating over the primitive, which can To be larger, the pattern data is retrieved from the memory (eg, PP memory 204 or system memory 104).

重要地是，產生該等片段所需要的幾何運算對於每個遮光器通過並不重覆。相反地，使用一樣本遮罩到超樣本到一多重樣本緩衝器當中的習用系統基本上重複對於每個遮光器通過之幾何運算。請注意到在片段處理單元460中被取樣的基元屬性僅需要被運算一次，而無關於複合式抗頻疊通過之數目，因為它們將由後續積分的四方塊所參照，然後被忽略。Importantly, the geometric operations required to produce the segments are not repeated for each shutter pass. Conversely, a conventional system that uses the same mask to supersample to a multiple sample buffer substantially repeats the geometric operations for each shutter pass. Note that the primitive attributes sampled in the fragment processing unit 460 need only be computed once, regardless of the number of composite anti-aliasing passes, since they will be referenced by the four squares of the subsequent integration and then ignored.

在片段處理單元460中一樣本查詢表使用該等複合式抗頻疊參數及通過數目來決定內插的片段參數被取樣之位置。樣本查詢表505可以對於每個超樣本叢集選擇一形心位置或一多重樣本位置。該等多重樣本位置被輸出到一樣本內插單元510，其運算一或多個內插的參數，例如色彩通道(紅、綠、藍、艾爾發)、圖紋座標及類似者，其係對於每個超樣本叢集，即在該像素四方塊中每個像素之一組內插的參數。一遮光器520使用本技藝專業人士用來執行一片段遮光器程式或類似者之已知的技術來處理在該像素四方塊中每個像素之該組內插的參數，藉以產生每個超樣本叢集的一遮光的像素值，例如色彩。In the fragment processing unit 460, the present lookup table uses the composite anti-aliasing parameters and the number of passes to determine where the interpolated fragment parameters are sampled. The sample lookup table 505 can select a centroid location or a multiple sample location for each supersample cluster. The multiple sample positions are output to the same present interpolation unit 510, which computes one or more interpolated parameters, such as color channels (red, green, blue, Alpha), pattern coordinates, and the like. For each supersample cluster, that is, a parameter interpolated for one of each pixel in the four squares of the pixel. A shutter 520 uses the techniques known to those skilled in the art to perform a fragment shutter program or the like to process the set of interpolated parameters for each pixel in the four squares of the pixel, thereby generating each supersample. A shaded pixel value of the cluster, such as color.

在遮光期間，由於艾爾發測試或遮光器像素刪除，每個超樣本叢集的該等次像素樣本可被排除(剔除或刪除)，使得掃描場產生的覆蓋範圍被修改來基於該等像素刪除或艾爾發測試結果以產生後段遮光器覆蓋範圍。因為該等超樣本叢集在獨立通過遮光器520中被處理，超樣本叢集於艾爾發測試期間可被個別地排除。相反地，當使用習用的多重取樣來處理在一單一遮光通過中所有的次像素樣本時，所有的次像素樣本可被保持或排除，造成產生一較低品質影像之一較粗糙的艾爾發測試粒度。During the shading, due to the Alpha test or the shutter pixel deletion, the sub-pixel samples of each super-sample cluster can be excluded (culled or deleted), so that the coverage generated by the field is modified to be deleted based on the pixels. Or Alfa test results to produce a back-end shutter coverage. Because the supersample clusters are processed in independent pass shutters 520, the oversample clusters can be individually excluded during the Alfa test. Conversely, when conventional multi-sampling is used to process all sub-pixel samples in a single shading pass, all sub-pixel samples can be maintained or excluded, resulting in a rougher one of the lower quality images. Test granularity.

遮光器520分別輸出該等遮光的像素值及次像素覆蓋範圍(可能相較於由掃描場解析器455提供的覆蓋範圍來修改)到一色彩緩衝器535及一覆蓋範圍聚集器530。覆蓋範圍聚集器530累積每個遮光器通過之後段遮光器覆蓋範圍來產生每個像素之聚集的覆蓋範圍資訊。色彩緩衝器535累積每個像素之遮光值。當接收到最後遮光器通過之遮光值時，該聚集的覆蓋範圍資訊被輸出到掃描場作業單元465。該像素四方塊之遮光值可隨該聚集的覆蓋範圍資訊來輸出，或可在稍後輸出，例如在由掃描場作業單元465完成Z型測試之後。在本發明其它具體實施例中，可以省略覆蓋範圍聚集器530及色彩緩衝器535。The shutter 520 outputs the shaded pixel values and the sub-pixel coverage (which may be modified compared to the coverage provided by the field resolver 455) to a color buffer 535 and a coverage aggregator 530, respectively. The coverage aggregator 530 accumulates each of the shutters to generate aggregated coverage information for each pixel through the subsequent segment shade coverage. The color buffer 535 accumulates the shading value of each pixel. The aggregated coverage information is output to the field work unit 465 when the shading value of the last shutter pass is received. The shading value of the four squares of the pixel may be output with the aggregated coverage information, or may be output later, such as after the Z-test is completed by the fieldwork unit 465. In other embodiments of the invention, coverage aggregator 530 and color buffer 535 may be omitted.

覆蓋範圍聚集及色彩值聯合成一色彩緩衝器中在結合每個像素之該等樣本共同在記憶體中的系統中較有好處，所以多個樣本可使用一單一記憶體交易來寫入或讀取。其它具體實施例可以省略覆蓋範圍聚集器530。覆蓋範圍聚集器530在並未連續儲存一像素之樣本值在記憶體中的系統中較無好處。Coverage aggregation and color values are combined into a color buffer that is advantageous in systems where the samples of each pixel are collectively in memory, so multiple samples can be written or read using a single memory transaction. . Other embodiments may omit the coverage aggregator 530. The coverage aggregator 530 is less advantageous in systems that do not continuously store sample values of one pixel in memory.

在掃描場作業單元465內的一選擇性z/色彩壓縮單元550接收該聚集的覆蓋範圍資訊，及z值或z的另一種表示，或該等片段之深度值(在z型測試之後)，並對於一像素區域產生壓縮的z值。Z/色彩壓縮單元550亦可以接收該等片段之聚集的色彩值，並產生一像素區域之壓縮的色彩值。該壓縮可在應用到一大群組的像素時可以改善。因此，數個像素四方塊可聚集在一起，且在該結果被壓縮之前被z型測試。重要地是，複合式抗頻疊並不排除或減少z壓縮的有效性。Z型壓縮較佳地是用於降低存取該z緩衝器之記憶體頻寬需求，且在一些具體實施例中，亦存取該記憶體足跡。A selective z/color compression unit 550 within the fieldwork unit 465 receives the aggregated coverage information, and another representation of the z-value or z, or the depth value of the segments (after the z-test), And generating a compressed z value for a pixel region. The Z/color compression unit 550 can also receive the aggregated color values of the segments and produce a compressed color value for a pixel region. This compression can be improved when applied to a large group of pixels. Thus, a number of pixels and four squares can be grouped together and tested by the z-type before the result is compressed. Importantly, the composite anti-frequency stack does not exclude or reduce the effectiveness of z-compression. Z-type compression is preferably used to reduce the memory bandwidth requirements for accessing the z-buffer, and in some embodiments, accessing the memory footprint.

第六圖為根據本發明之一或多種態樣之用於執行複合式抗頻疊的方法步驟之流程圖。在步驟610中，複合式抗頻疊控制單元500接收一基元。在步驟615中，複合式抗頻疊控制單元500決定複合式抗頻疊是否被致能，如果未致能，該片段使用習用的抗頻疊來處理。如果在步驟615中複合式抗頻疊被致能，則在步驟635中複合式抗頻疊控制單元500決定該基元之複合式抗頻疊參數。更特定而言，複合式抗頻疊控制單元500決定當遮光由該基元交切的每個像素時所要使用之超樣本叢集(遮光器通過)之數目。Figure 6 is a flow diagram of the method steps for performing a composite anti-frequency stack in accordance with one or more aspects of the present invention. In step 610, the composite anti-stack control unit 500 receives a primitive. In step 615, the composite anti-stack control unit 500 determines if the composite anti-frequency stack is enabled, and if not enabled, the segment is processed using a conventional anti-frequency stack. If the composite anti-frequency stack is enabled in step 615, then in step 635 the composite anti-stack control unit 500 determines the composite anti-aliasing parameters for the primitive. More specifically, the composite anti-stack control unit 500 determines the number of super-sample clusters (passer passes) to be used when shading each pixel intersected by the primitive.

在步驟640中，掃描場解析器455產生該基元之覆蓋的部份之樣本位準覆蓋範圍。此覆蓋範圍之粒度可為粗糙或細密，但至少為一像素四方塊之大小。掃描場解析器455輸出交切於該基元的一四方塊的覆蓋範圍資訊到複合式抗頻疊積分器單元515。複合式抗頻疊積分器單元515基於該等複合式抗頻疊參數擴大每個四方塊來在多重通過中遮光該四方塊。複合式抗頻疊積分器單元515可組態成根據該覆蓋範圍資訊當在一超樣本叢集中所有該等多重樣本未被覆蓋時略過遮光器通過。在步驟643中，複合式抗頻疊積分器單元515決定該通過數目(第一、第二等)，並輸出該像素四方塊及通過數目到片段處理單元460。如前所述，當該通過數目大於一時，複合式抗頻疊積分器單元515可遮罩該覆蓋範圍資訊。樣本查詢表505使用該通過數目及該多重樣本之數目來索引化，以讀取該等多重樣本位置之程式化數值，其包括用於內插該等片段參數之該超樣本叢集內該位置之表示。內插的參數由樣本內插單元510對於該超樣本叢集來運算。In step 640, the field resolver 455 generates a sample level coverage of the portion of the coverage of the primitive. The granularity of this coverage may be rough or fine, but at least one pixel and four squares. The field resolver 455 outputs coverage information of a four squares intersecting the primitive to the composite anti-aliasing integrator unit 515. The composite anti-stack integrator unit 515 expands each of the four squares based on the composite anti-frequency stacking parameters to block the four squares in multiple passes. The composite anti-stack integrator unit 515 can be configured to bypass the shutter pass when all of the multiple samples are uncovered in an oversampled cluster based on the coverage information. In step 643, the composite anti-stack integrator unit 515 determines the number of passes (first, second, etc.) and outputs the four squares of the pixels and the number of passes to the segment processing unit 460. As previously mentioned, when the number of passes is greater than one, the composite anti-stack integrator unit 515 can mask the coverage information. The sample lookup table 505 indexes the number of passes and the number of the multiple samples to read the stylized values of the multiple sample locations, including the location within the supersample cluster for interpolating the segment parameters Said. The interpolated parameters are computed by the sample interpolation unit 510 for the supersample cluster.

在步驟645中，片段處理單元460遮光該像素四方塊，產生每個超樣本叢集之遮光值，即在該像素四方塊中每個像素之一遮光值。在一超樣本叢集之內，該遮光值將用於被該基元覆蓋的每個多重樣本。片段處理單元460亦輸出該像素四方塊之後段遮光器覆蓋範圍。該後段遮光器覆蓋範圍可不同於掃描場化的像素覆蓋範圍資訊，因為在遮光期間可以消除多重樣本，如前所述。In step 645, the segment processing unit 460 blocks the four squares of the pixel, producing a shading value for each of the supersample clusters, i.e., one of each pixel in the four squares of the pixel. Within a supersample cluster, the shading value will be used for each multisample covered by the primitive. The segment processing unit 460 also outputs the four-block rear-span shutter coverage of the pixel. The back-end shutter coverage can be different from the scanned field coverage information because multiple samples can be eliminated during shading, as previously described.

在步驟650中，複合式抗頻疊積分器單元515決定是否將使用另一個遮光器通過來處理該像素四方塊，如果是的話，步驟643及645對於另一個遮光器通過(第二、第三等)來重覆。如果在步驟650中複合式抗頻疊積分器單元515決定不需要另一個遮光器通過來處理該像素四方塊，則在步驟660，覆蓋範圍聚集器530結合每個遮光器通過之後段遮光器覆蓋範圍來產生該像素四方塊之聚集的覆蓋範圍資訊。在步驟660中，覆蓋範圍聚集器530亦可以結合每個該等遮光器通過之後段遮光器色彩值來產生該像素四方塊之聚集的色彩值。覆蓋範圍聚集器530可組態成在一多重四方塊層級下聚集後段遮光器色彩及覆蓋範圍資訊。在步驟665中，掃描場作業單元465執行該掃描場作業來決定哪些遮光值將被寫入到該像框緩衝器中。該掃描場作業可在該四方塊或多重四方塊層級下執行。在掃描場作業單元465內的Z/色彩壓縮單元550可在該z及/或色彩資料被儲存在該z緩衝器及/或色彩緩衝器之前壓縮該像素四方塊之z及/或色彩資料。In step 650, the composite anti-stack integrator unit 515 determines whether the other four shutters will be used to process the four squares of the pixel, and if so, steps 643 and 645 pass for the other shutter (second, third) Wait) to repeat. If the composite anti-aliasing integrator unit 515 determines in step 650 that another shutter is not required to process the pixel four squares, then in step 660, the coverage aggregator 530 is combined with each shutter through a subsequent segment of the shutter. Range to generate coverage information for the aggregation of the four squares of the pixel. In step 660, the coverage aggregator 530 can also combine each of the shutters to generate an aggregated color value of the four squares of the pixel by the subsequent shade color value. The coverage aggregator 530 can be configured to aggregate the back-end shade color and coverage information at a multiple four-block level. In step 665, field field unit 465 executes the field job to determine which shading values are to be written to the picture frame buffer. The field job can be performed at the four-block or multiple four-block level. The Z/color compression unit 550 within the field work unit 465 can compress the z and/or color data of the four squares of the pixel before the z and/or color material is stored in the z buffer and/or color buffer.

在步驟670中，掃描場解析器455決定是否有另一個像素四方塊交切於該基元，如果是的話，則在步驟640中，掃描場解析器455處理由該基元所覆蓋的一不同像素四方塊。如果在步驟670中，掃描場解析器455決定由該基元交切的所有像素四方塊已經被遮光，則在步驟675中，該基元處理即完成。在一管線化系統中，第六圖所示的一或多個步驟可對於不同四方塊平行地執行。In step 670, the scan field parser 455 determines if there is another pixel four blocks intersecting the primitive, and if so, in step 640, the scan field parser 455 processes a different overlay by the primitive. Pixels four squares. If, in step 670, the field resolver 455 determines that all of the pixels four blocks intersected by the primitive have been shaded, then in step 675, the primitive processing is complete. In a pipelined system, one or more of the steps shown in the sixth diagram can be performed in parallel for different four squares.

複合式抗頻疊控制單元500可動態地基於該顯像狀態(例如艾爾發測試致能/除能、圖紋地圖內容、使用者提供的品質/效能控制，或類似者)來決定每個基元(例如每個像素之超樣本叢集的數目)之複合式抗頻疊參數。基於該顯像狀態來調整該抗頻疊可改善效率，因為受惠於高品質抗頻疊之基元利用更多的樣本來遮光，且其它基元使用較少的樣本來遮光，最佳化影像品質及效能。The composite anti-stack control unit 500 can dynamically determine each based on the visualization state (eg, Alpha test enable/disable, pattern map content, user-provided quality/performance control, or the like) A composite anti-aliasing parameter for primitives (eg, the number of supersample clusters per pixel). Adjusting the anti-frequency stack based on the development state can improve efficiency because the cells benefiting from the high-quality anti-frequency stack use more samples to block the light, and other primitives use less samples to shield the light, and optimize Image quality and performance.

本發明已經參照特定具體實施例在以上進行說明。但是本技藝專業人士將可瞭解到在不悖離附屬申請專利範圍所提出之本發明的廣義精神與範疇之下可對其進行多種修正與改變。本發明一具體實施例可以實施成由一電腦系統使用的一程式產品。該程式產品的程式定義該等具體實施例的功能(包括此處所述的方法)，並可包含在多種電腦可讀取儲存媒體上。例示性的電腦可讀取儲存媒體包括但不限於：(i)不可寫入儲存媒體(例如在一電腦內之唯讀記憶體裝置，例如可由CD-ROM讀取的CD-ROM碟片，快閃記憶體，ROM晶片，或任何種類的固態非揮發性半導體記憶體)，其上可永久儲存資訊；及(ii)可寫入儲存媒體(例如在一磁碟機内的軟碟片、硬碟機，或任何種類的固態隨機存取半導體記憶體)，其上可儲存可改變的資訊。因此前述的說明及圖面係以例示性而非限制性的角度來看待。The invention has been described above with reference to specific embodiments. It will be appreciated by those skilled in the art that various modifications and changes can be made without departing from the spirit and scope of the invention. An embodiment of the invention may be implemented as a program product for use by a computer system. The program of the program product defines the functions of the specific embodiments (including the methods described herein) and can be included on a variety of computer readable storage media. Exemplary computer readable storage media include, but are not limited to: (i) non-writable storage media (eg, a read-only memory device in a computer, such as a CD-ROM disc that can be read by a CD-ROM, fast) Flash memory, ROM chip, or any kind of solid non-volatile semiconductor memory) on which information can be stored permanently; and (ii) writeable to storage media (eg floppy disk, hard disk in a disk drive) A machine, or any kind of solid state random access semiconductor memory, on which information that can be changed can be stored. Accordingly, the foregoing description and drawings are to be regarded as illustrative

100．．．運算系統100. . . Computing system

102．．．中央處理單元102. . . Central processing unit

103．．．裝置驅動器103. . . Device driver

104．．．系統記憶體104. . . System memory

105．．．記憶體橋接器105. . . Memory bridge

106．．．通訊路徑106. . . Communication path

107．．．輸入/輸出橋接器107. . . Input/output bridge

108．．．輸入裝置108. . . Input device

110．．．顯示裝置110. . . Display device

112．．．平行處理子系統112. . . Parallel processing subsystem

113．．．通訊路徑113. . . Communication path

114．．．系統碟114. . . System dish

116．．．開關116. . . switch

118．．．網路轉接器118. . . Network adapter

120．．．嵌入卡120. . . Embedded card

121．．．嵌入卡121. . . Embedded card

202(0)~(1)．．．平行處理單元202(0)~(1). . . Parallel processing unit

204(0)~(U-1)．．．平行處理記憶體204(0)~(U-1). . . Parallel processing of memory

206．．．主控介面206. . . Master interface

208(0)~(C-1)．．．核心208(0)~(C-1). . . core

210．．．工作分配單元210. . . Work distribution unit

212．．．前端單元212. . . Front end unit

214．．．記憶體介面214. . . Memory interface

302(0)~(P-1)．．．處理引擎302(0)~(P-1). . . Processing engine

304．．．本地暫存器檔案304. . . Local register file

306．．．共享記憶體306. . . Shared memory

308．．．參數記憶體/快取308. . . Parameter memory/cache

312．．．指令單元312. . . Command unit

400．．．繪圖處理管線400. . . Drawing processing pipeline

442．．．資料組成器442. . . Data component

444．．．頂點處理單元444. . . Vertex processing unit

446．．．基元組合器446. . . Primitive combiner

448．．．幾何處理單元448. . . Geometric processing unit

455．．．掃描場解析器455. . . Field parser

460．．．片段處理單元460. . . Fragment processing unit

465．．．掃描場作業單元465. . . Field unit

500．．．複合式抗頻疊控制單元500. . . Composite anti-frequency stack control unit

501．．．像素501. . . Pixel

502．．．多重樣本502. . . Multiple samples

503．．．超樣本叢集503. . . Supersample cluster

504．．．多重樣本504. . . Multiple samples

505．．．樣本查詢表505. . . Sample lookup table

509．．．片段509. . . Fragment

510．．．樣本內插單元510. . . Sample interpolation unit

511．．．超樣本叢集511. . . Supersample cluster

513．．．多重樣本513. . . Multiple samples

515．．．複合式抗頻疊積分器單元515. . . Composite anti-frequency stack integrator unit

517．．．形心位置517. . . Centroid position

520．．．遮光器520. . . Shader

530．．．覆蓋範圍聚集器530. . . Coverage aggregator

535．．．色彩緩衝器535. . . Color buffer

550．．．Z/色彩壓縮單元550. . . Z/color compression unit

所以，可以詳細瞭解本發明上述特徵之方式中，本發明的一更為特定的說明簡述如上，其可藉由參照到具體實施例來進行，其中一些例示於所附圖面中。但是其可注意到，所附圖面僅例示本發明的典型具體實施例，因此其並非要限制本發明之範圍，其可允許其它同等有效的具體實施例。Therefore, a more specific description of the present invention may be made in the foregoing description of the preferred embodiments of the invention. It is to be understood, however, that the drawings are not intended to

第一圖為組態成實施本發明一或多種態樣之運算系統的方塊圖；The first figure is a block diagram of an operational system configured to implement one or more aspects of the present invention;

第二圖為根據本發明一或多種態樣中第一圖之運算系統的一平行處理子系統之方塊圖；2 is a block diagram of a parallel processing subsystem of the computing system of the first diagram in accordance with one or more aspects of the present invention;

第三圖為根據本發明一或多種態樣中第二圖之平行運算子系統的核心之方塊圖；The third diagram is a block diagram of the core of the parallel computing subsystem of the second diagram in accordance with one or more aspects of the present invention;

第四圖為根據本發明之一或多種態樣中一繪圖處理管線之概念圖；The fourth figure is a conceptual diagram of a drawing processing pipeline in accordance with one or more aspects of the present invention;

第五A圖為根據本發明之一或多種態樣在一像素內超樣本叢集及多重樣本位置；Figure 5A is a supersample cluster and multiple sample locations within a pixel in accordance with one or more aspects of the present invention;

第五B圖為根據本發明之一或多種態樣在一多重樣本叢集內一片段及一形心位置；Figure 5B is a fragment and a centroid position within a multi-sample cluster in accordance with one or more aspects of the present invention;

第五C圖為根據本發明之一或多種態樣中該繪圖處理管線之一部份的方塊圖；及Figure 5C is a block diagram of a portion of the drawing processing pipeline in accordance with one or more aspects of the present invention; and

第六圖為根據本發明之一或多種態樣中用於執行複合式抗頻疊的方法步驟之流程圖。Figure 6 is a flow diagram of the method steps for performing a composite anti-frequency stack in accordance with one or more aspects of the present invention.

455．．．掃描場解析器455. . . Field parser

460．．．片段處理單元460. . . Fragment processing unit

465．．．掃描場作業單元465. . . Field unit

505．．．樣本查詢表505. . . Sample lookup table

510．．．樣本內插單元510. . . Sample interpolation unit

520．．．遮光器520. . . Shader

530．．．覆蓋範圍聚集器530. . . Coverage aggregator

535．．．色彩緩衝器535. . . Color buffer

550．．．Z/色彩壓縮單元550. . . Z/color compression unit

Claims

An arithmetic device configured to use a composite anti-frequency stack to illuminate a drawing primitive, the computing device configured to receive a drawing primitive; and determine an ultra-overband for each pixel of the drawing primitive The number of sample clusters; the number of multipasses used when each pixel of the drawing primitive is intersected according to the determined number of supersample clusters; rasterizing the drawing primitive to intersect the drawing primitive a segment is generated for each pixel; and a segment processing unit in the computing device uses the determined number of multiple passes to block each segment to produce at least one composite frequency offset for the mapping primitive Stacked pixels, wherein for each pass of the segment processor, multiple geometric operations performed while rasterizing the drawing unit are not repeated.

The computing device of claim 1, wherein the number of clusters of the super samples is determined based on a development state of the computing device.

The computing device of claim 2, wherein the developing state comprises one or more high quality mode settings, a high performance setting, an Alf test setting, and a map with high frequency content.

The computing device of claim 1, wherein the computing device is further configured to generate a back-end shutter coverage that represents which multiple samples of each of the super-sample clusters are covered by the drawing primitive.

An arithmetic device according to claim 4, further comprising a field resolver operating unit configured to test each of the plurality of samples covered by a drawing primitive according to the back-end shutter coverage z-type test The drawing primitives are generated to produce a z-type test value.

The computing device of claim 5, wherein the field resolver operating unit is further configured to compress a z-type test value of a portion of a z-type buffer that intersects each of the drawing primitives .

An arithmetic device according to claim 1, wherein the number of supersample clusters for each pixel of the first mapping primitive for the anti-frequency overlap is different from that for the second mapping primitive for the anti-frequency overlapping cut The number of supersample clusters per pixel.

The computing device of claim 1, wherein the number of multiple passes of each composite anti-aliasing pixel for generating a cross-cut primitive does not include by not having the drawing primitive covered Any supersample cluster of at least one multiple sample.

The computing device of claim 1, wherein the computing device is further configured to: by calculating a shading value of only one of the plurality of samples in each of the supersample clusters, and copying the shading Other multiple samples within the same supersample cluster to shade the drawing primitive.

The computing device of claim 1, wherein the computing device is further configured to calculate a shading value of the first supersample cluster of the supersample clusters by using: in the first supersample cluster The position of the first plurality of samples; using one of the geometric centroids of one of the plurality of samples within the first supersample cluster, the multiple samples being covered by the drawing primitive; or using an approximate centroid, It is a multiple sample within a first supersample cluster that is covered by a drawing primitive and that is closest to the geometric centroid of the first supersample cluster.