Video: GTX 1080 Pascal Async Compute Explained

Ali Güngör · 17 Mayıs 2016

Nvidia explained asynchronous computing in new GTX 1080 and Pascal GPU's. Asynchronous shaders and other details are explained with causes and effects.

This video was recorded at U.S.A. Texas Austin Nvidia Global Presentation of GeForce GTX 1080 and GTX 1070. After the agreed NDA date we are making the record publicly available for all technology enthusiasts. There are great many details about new Nvidia Pascal architecture, new 16 nanometer production process, new drivers, software features and VR (Virtual Reality) in this series.

Nvidia GeForce GTX 1080 Review: NVIDIA GeForce GTX 1080 İncelemesi - Technopat (Turkish language)

Modern gaming workloads are increasingly complex, with multiple independent, or "asynchronous," workloads that ultimately work together to contribute to the final rendered image. Some examples of asynchronous compute workloads include:

GPU-based physics and audio processing
Postprocessing of rendered frames
Asynchronous time warp, a technique used in VR to regenerate a final frame based on head position just before display scanout, interrupting the rendering of the next frame to do so

These asynchronous workloads create two new scenarios for the GPU architecture to consider.
The first scenario involves overlapping workloads. Certain types of workloads do not fill the GPU completely by themselves. In these cases there is a performance opportunity to run two workloads at the same time, sharing the GPU and running more efficiently—for example a PhysX workload running concurrently with graphics rendering.

For overlapping workloads, Pascal introduces support for "dynamic load balancing." In Maxwell generation GPUs, overlapping workloads were implemented with static partitioning of the GPU into a subset that runs graphics, and a subset that runs compute. This is efficient provided that the balance of work between the two loads roughly matches the partitioning ratio. However, if the compute workload takes longer than the graphics workload, and both need to complete before new work can be done, and the portion of the GPU configured to run graphics will go idle. This can cause reduced performance that may exceed any performance benefit that would have been provided from running the workloads overlapped. Hardware dynamic load balancing addresses this issue by allowing either workload to fill the rest of the machine if idle resources are available.

Pascal's Dynamic Load Balancing reduces GPU idle time when graphics work finishes early, allowing the GPU to quickly switch to compute.

Time critical workloads are the second important asynchronous compute scenario. For example, an asynchronous timewarp operation must complete before scan out starts or a frame will be dropped. In this scenario, the GPU needs to support very fast and low latency preemption to move the less critical workload off of the GPU so that the more critical workload can run as soon as possible.

As a single rendering command from a game engine can potentially contain hundreds of draw calls, with each draw call containing hundreds of triangles, and each triangle containing hundreds of pixels that have to be shaded and rendered. A traditional GPU implementation that implements preemption at a high level in the graphics pipeline would have to complete all of this work before switching tasks, resulting in a potentially very long delay.

To address this issue, Pascal is the first GPU architecture to implement Pixel Level Preemption. The graphics units of Pascal have been enhanced to keep track of their intermediate progress on rendering work, so that when preemption is requested, they can stop where they are, save off context information about where to start up again later, and preempt quickly. The illustration below shows a preemption request being executed.

Pascal supports pixel- level graphics preemption, allowing the GPU to switch workloads mid-triangle.

In the command pushbuffer, three draw calls have been executed, one is in process and two are waiting. The current draw call has six triangles, three have been processed, one is being rasterized and two are waiting. The triangle being rasterized is about halfway through. When a preemption request is received, the rasterizer, triangle shading and command pushbuffer processor will all stop and save off their current position. Pixels that have already been rasterized will finish pixel shading and then the GPU is ready to take on the new high priority workload. The entire process of switching to a new workload can complete in less than 100 microseconds (ps) after the pixel shading work is finished.

Pascal also has enhanced preemption support for compute workloads. The illustration below shows the execution of a compute workload.

Pascal supports compute preemption at the thread level for DX12 graphics.

Thread Level Preemption for compute operates similarly to Pixel Level Preemption for graphics.
Compute workloads are composed of multiple grids of thread blocks, each grid containing many threads. When a preemption request is received, the threads that are currently running on the SMs are completed. Other units save their current position to be ready to pick up where they left off later, and then the GPU is ready to switch tasks. The entire process of switching tasks can complete in less than 100 (is after the currently running threads finish.

For gaming workloads, the combination of pixel level graphics preemption and thread level compute preemption gives Pascal the ability to switch workloads extremely quickly with minimal preemption overhead.

For CUDA compute tasks, Pascal is also capable of preempting at the finest granularity possible- instruction level.

Pascal GPUs Support Instruction-Level Compute Preemption when running CUDA Apps.

In this mode of operation, when a preemption request is received, all thread processing stops at the current instruction and state is switched out immediately. This mode of operation involves substantially more state information, because all the registers of every running thread must be saved, but this is the most robust approach for general GPU compute workloads that may have substantial per-thread
runtimes.

One example application of preemption in gaming is asynchronous timewarp. The left side of the illustration below shows an asynchronous timewarp operation with traditional GPU preemption. The ATW process runs as late as possible before the display refresh interval. Howeverthe ATW work has to be given to the GPU several milliseconds in advance, because without fine grained preemption, there is variability in the time it will take to preempt and start execution of the ATW process. On the right image, with fine-grained preemption (pixel level graphics plus thread level compute preemption), the preemption time is much faster and more deterministic, so the ATW work can be submitted much later, while still being assured of completion before the display refresh deadline.

Pascal preemption support prevents idling in the async timewarp scenario above.

Video: GTX 1080 Pascal Async Compute Explained

Ayrıntılı düzenleme

Ali Güngör

Genel Yayın Yönetmeni

Benzer konular

Yeni konular

Yeni mesajlar

Gizliliğinize önem veriyoruz