Tuesday, March 4, 2014

The GPU Primer, Part III



The GPU Primer

by

Michael August, Computer Scientist
first published May 15th, 2006


In order to understand the architecture of a modern GPU, it helps to take a look at an actual one. So, to better understand the architecture of GPUs, let’s look at the NVIDIA GeForce 6 series architecture. This architecture is used in the NVIDIA GeForce 6800 series of graphics cards. The overall architecture of a computer system looks like this:

In the image above, the data transfer rates across each bus are annotated next to the bus. The CPU depicted in this image has an 800 MHz Front Side Bus, and the GPU depicted is a GeForce 6 series GPU. The bus connecting the North Bridge to the GPU is a PCI- Express bus (using an x32 slot). As one can see, the GPU is capable of transferring data along its bus to its own memory much faster than any of the other buses can transfer data. This means that the bottleneck is generally not onboard the GPU, but rather on the motherboard of the host computer. It also means that programs that run on the GPU and utilize the bandwidth of the GPU’s memory bus will run very efficiently. The NVIDIA GeForce 6 GPU is a 256-bit processor with a core clock rate of around 425 MHz. It is onboard a graphics card that has a 256-bit memory with a memory clock rate of around 550 MHz (using GDDR3 memory). It is manufactured with a 130 nm process technology and has 222 million transistors. Below is a block diagram of the NVIDIA GeForce 6 series architecture:


The “host” block at the top of the diagram denotes the host computer. The host computer sends commands, vertex data, and textures to the GPU. The GeForce 6 GPU’s implementation of the geometry stage of the graphics pipeline contains up to six programmable vertex processors (i.e. vertex shaders). Each of these vertex processors can operate on one piece of data simultaneously, meaning that all of them can process vertices in parallel. Once the vertices have been processed by the programmable vertex processors, they are sent to the remainder of the geometry stage. The vertices are assembled together into a primitive (i.e. a point, a line, or a triangle) in the process of primitive assembly. Then, the primitives that won’t be visible in the final image are culled, and the pieces of the primitives that are cut off by the edges of the viewing frustum are clipped. These steps are performed in the “cull/clip/setup” block and the output is sent to the rasterizer. The rasterizer fills in the primitive’s shape with candidate pixels called fragments. The rasterizer also checks each fragment’s depth to see if it will be hidden by any other pixel in the scene. The rasterized version of the primitive is then sent to the fragment processors (i.e. fragment shader). Note that both the vertex processors and the fragment processors have access to the texture cache. This means that both vertices and individual fragments can be blended with texels (texels are the individual pixels comprising a texture). The GeForce 6 GPU’s implementation of the fragment stage of the graphics pipeline can have up to 16 individual programmable fragment processors. Each fragment processor can operate on four fragments at a time. This highly parallel design means that many fragments can be processed simultaneously, and since the whole pipeline is broken down into many stages, each part of the pipeline can be working on different, unrelated, pieces of data at the same time. After being shaded by the fragment processors, the fragments pass over the fragment crossbar and into 16 raster operation units. The raster operation units do another depth test to ensure that the fragments aren’t being occluded by any other fragments in the final image, antialiasing is performed on the fragments, and then the resultant fragments are sent to the frame buffer. The frame buffer is split up into four partitions, and there is one connection from the GPU to each of these memory partitions. This means that there are four distinct connections from the GPU to the graphics card’s onboard memory. This prevents the full memory bus on the graphics card from being tied up at one time, since there are effectively four parallel buses over which data can be transferred from the GPU to the frame buffer. Once the pixels have been transferred into the frame buffer, then they are in their final form and are ready to be displayed on the screen. Below is the GeForce 6 series architecture with its modular implementation of the different stages of the graphics pipeline pointed out:

.

Most of the data passing through the GeForce 6 series GPU is 32-bit floating point data, though it can also be 16-bit floating point data. A GPU is truly designed to handle floating point data, as most geometric data is in floating point format. Some of the important performance metrics of a GPU are its peak pixel fill rate, its peak texel fill rate, its peak memory bandwidth, and its triangle transform rate. The GeForce 6 series GPU has a peak pixel fill rate of 6400 MegaPixels per second, a peak texel fill rate of 6400 MegaTexels per second, a peak memory bandwidth of 35.2 GB/s, and a triangle transform rate of 600 MegaTriangles per second. Most general-purpose CPUs don’t even compete in the number of floating point operations per second performance when compared to GPUs. Intel’s Pentium 4 Prescott chip, for example, can only perform at a peak of about 12 GFLOPS, whereas the GeForce 6 can perform at a peak of over 100 GFLOPS.

The whole process that has been discussed in this paper has been the concept of rendering. Rendering is the process of producing the pixels of a scene from a higher- level description of that scene’s components. There have been four major breakthroughs in the technology behind the rendering of real-time 3D computer graphics. The first of these breakthroughs was the concept of modeling a 3D object by connecting together a mesh of lines. This mesh of lines that are grouped together into triangles is called a wireframe model. By decomposing a 3D model into smaller, more manageable geometric shapes, the graphics card could now build up a model out of its constituent shapes. The second major breakthrough was to apply shading and lighting to the triangles that make up the wireframe model. By doing this, shaded solids could now be viewed on the screen and animated in real-time. The third breakthrough was to apply textures on top of the shaded triangles that make up the model. These textures cause the model to appear more realistic. The fourth breakthrough was to allow the appearance of the model’s surfaces to be programmed. In this way, the graphics programmer can make the model’s surfaces appear even more natural by animating parts of them, by adding randomness to them, and by adding a richer blend of colors and textures to them. We are currently in this fourth generation of graphics technology with the programmable vertex shaders and pixel shaders of modern GPUs. It was just recently that the transition took place from a fixed-function pipeline to a programmable pipeline. The purpose behind these different breakthroughs in graphics hardware has been to get closer to the goal of rendering photorealistic 3D computer graphics in real-time so that, eventually, you won’t be able to tell the difference between a real human being and a human being on the screen. The GPU is quickly approaching that dream.


Bibliography


Kilgariff, Emmett; Fernando, Randima, eds. GPU GEMS 2. 2005. 

SIGGRAPH 2005 Course 37 Notes. 2005.

Moya, Victor; Gonzalez, Carlos; Roca, Jordi; Fernandez, Agustin; Espasa, Roger. Shader Performance Analysis on a Modern GPU Architecture. IEEE Computer Society. 2005.

Lefohn, Aaron. GPGPU IEEE 2005 Visualization Tutorial. 2005. 

Datar, Ajit; Padhye, Apurva. Graphics Processing Unit Architecture. 2005. 

Durand, Fredo; Cutler, Barb. Modern Graphics Hardware. 2001.

Part I, Part II, Part III

No comments:

Post a Comment