Tuesday, March 4, 2014

The GPU Primer, Part I


The GPU Primer

by

Michael August, Computer Scientist
first published May 15th, 2006



As with almost all aspects of computer technology, computer video display technology has experienced an almost incomprehensible growth and evolution over the past two decades. Behind this evolution of video technology has been the demand for more photorealistic computer graphics. This demand for photorealism has come primarily from the video game community. At the heart of this photorealism is the hardware necessary to make computer-generated imagery appear in real time. All of today’s 3D graphics acceleration-enabled video cards contain a graphics processing unit (GPU) at their core. A GPU is just a special kind of central processing unit (CPU) that is dedicated to performing video-related computations. The quality of the GPU determines the quality and the kinds of graphics that can be viewed on a display device. Before getting into the details of the inner workings of a modern GPU, it is best to look back at the history of CPUs and how the technology got to where it is today.

In the early days of digital computing (in the 1930s and 1940s), the switching elements in CPUs consisted of electromechanical relays. These CPUs took up the space of whole rooms and were fairly reliable but very slow devices which performed computations that could be output onto punched paper tape or punch cards. The next step up in computational power came (in the 1940s and 1950s) when vacuum tubes were used in place of relays as the switching elements comprising the CPU. Vacuum tubes were not very reliable and had to be replaced regularly. They also took time to warm up when first turned on. However, they were faster than relays because they had no moving parts. Their output was usually sent to punch cards. In the late 1950s and early 1960s computers began to be built out of transistors. Transistors became, and still are, the fundamental switching elements out of which CPUs are formed. At first, transistors were soldered onto the circuit boards by hand. However, with the techniques of Small Scale Integration, integrated circuits consisting of multiple transistors became common. Chips containing one or more logic gates were wired together to form the CPU. Large Scale Integration and Very Large Scale Integration have led to all of these multiple chips being integrated into one chip that has tens of millions to hundreds of millions of transistors. The modern Pentium 4 Prescott CPU, for example, has 125 million transistors. The main visual output of most transistor-based computers since the 1970s has been the Cathode Ray Tube (CRT). More recently, newer display technologies, such as the Liquid Crystal Display and the Plasma Display, have become more popular than the CRT. All of these devices display graphics in two spatial dimensions (changes in the displayed graphics throughout the time dimension can create animation). There are also a number of other 2D display devices that are less common, and some experimental 3D display devices exist as well. Alongside the development of CPU technology has been the evolution of video display technology and the video cards responsible for making that new display technology function.

At first, video cards were simply frame buffers with auxiliary electronics built in. A frame buffer is a collection of memory cells that store the data (i.e. color and intensity) about each pixel that is displayed on the screen in a particular frame (a frame is one sample, or time-slice, of everything that is being displayed on the screen. Televisions in America, for example, display 30 frames per second). The CPU on the computer would perform the computations necessary to determine the color and intensity information for each pixel and then send this information along a bus to the frame buffer on the video card. Eventually, graphics coprocessors were included on the motherboard alongside the CPU to perform graphics-specific computations. The use of a coprocessor in graphics intensive tasks offloaded work from the CPU onto the coprocessor, thereby allowing the CPU to use its cycles on other computations and improving system performance. An example of a graphics coprocessor was the blitter. The blitter performed the Bit Block Transfer operation (also known as BitBLT). Introduced in 1974, the Bit Block Transfer was a special instruction that a graphics programmer could use to incorporate multiple 2D images (called sprites) into a frame. An alternative technique for displaying sprites on a frame was hardware-based sprite acceleration. Hardware-based sprite acceleration was eventually included directly on video cards. The difference between direct hardware support for sprite manipulation and the blitter is as follows. With hardware support for sprites, a sprite would be placed into a special sprite memory. The sprite would not be placed into the frame buffer. Instead, the graphics card (note: the terms “graphics card”, “video card”, and “graphics board” are all interchangeable) would read the frame buffer contents and then read the sprite memory to overlay the sprites that are in the sprite memory on top of the contents of the frame buffer. Effectively, this meant that the frame buffer provided the background image and the sprites were overlayed on top of that background image. With the blitter and the BitBLT operation, a sprite was written directly into the frame buffer by using a clever technique which applied a combination of bitmasks to the frame buffer contents. This meant that individual pixels in the frame buffer were actually overwritten by the sprites. In other words, pieces of the background image were actually replaced by the sprites. Eventually, the blitter and the hardware support for sprites were moved onto the graphics card itself. This combination of frame buffer, blitter, and hardware-enabled sprite acceleration, all in one device, was a forerunner of the modern graphics card. It could be said that the blitter itself was a primitive form of the modern GPU. The offloading of the graphics workload from the main system board onto specialized graphics modules that reside on the video card has been a common theme throughout the history of video technology. These early video cards were designed for displaying 2D images only. In fact, all modern computer monitors display only 2D images, but the use of 3D art techniques trick the human eye into believing that the images being displayed are 3D images. The implementation of these 3D techniques was first done in software by the CPU. However, the implementation of these 3D techniques bogged down the CPU extensively. It wasn’t until specialized hardware for manipulating 3D graphics emerged that realistic-looking 3D games and simulations could be created. Throughout the 1980s and 1990s, Silicon Graphics, Inc. (SGI) played a leading role in the creation of hardware and software solutions for the creation of 3D graphics. These solutions were very expensive, and thus were designed for large organizations that had a need for high-end 3D graphics technology. The consumer market for 3D graphics began in the mid-1990s with the introduction of 3D graphics accelerators that ran alongside the main 2D graphics board. With this generation of 3D graphics cards, the 3D card had a connection to the 2D card, which would then send the video output to the display. With the introduction of the Voodoo Rush graphics accelerator card by 3dfx in 1996, both the 2D and 3D functionality were integrated into one video card. In 1997, Intel introduced the Pentium MMX CPU, which had an enhanced instruction set for performing multimedia computations (referred to as MultiMedia eXtensions by some, and Matrix Math eXtensions by others). The following year, AMD introduced a similar extension to its K6-2 instruction set (referred to as 3DNOW!). In 1999, Intel responded to AMD’s 3DNOW! technology with Streaming SIMD Extensions (SSE) on its Pentium 3. The idea behind these CPU instruction set extensions was to allow the CPU to perform more graphics-intensive computations in fewer CPU cycles. Graphics-related computations tend to be based on floating point data, so these extensions improved the performance of Intel’s processors and AMD’s processors when executing instructions on floating point data. These extensions also parallelized operations on data, which improved the performance of these CPUs. In the fall of 1999, NVIDIA Corporation introduced itsGeForce 256 3D graphics accelerator card. The GeForce 256 was the first video card to incorporate a processor that had a Transform and Lighting Engine. This characteristic made the GeForce 256 the first graphics card to have a full-fledged GPU. Ever since the introduction of the GPU in 1999, more features have been added to graphics cards and their GPUs in the quest for the display of real-time photorealistic 3D graphics. In order to understand these features, we must first take a look at the purpose of the GPU and video card in general.

All shapes that one sees on a computer screen start out inside the computer as abstract entities that don’t have any physically realizable representation. These abstract entities get transformed into other abstract entities until they are in a form that is suitable for the computer screen to display them. The whole idea of a graphics card is to be the interface, or middle man, between the computer and the video display. The graphics card transforms the abstract entities it receives into a form which is suitable for the computer screen to display. The input to the graphics card is called a primitive, and this primitive is the abstract entity that needs to be transformed into a physical representation. A primitive could be a circle or a sphere or any other geometric object. The GPU on a graphics card takes each primitive that the host computer sends to it and converts that primitive into the required electronic signal that the video display electronics understands. The series of transformations that occur to the primitive to make it into a form which can be displayed is called the graphics pipeline. This pipeline represents the work that is done to each primitive before it can be displayed on the computer’s monitor. It is possible to implement the graphics pipeline, and the work that it entails, entirely in software. However, this would be such a burden on the host computer’s CPU and memory that it would prevent the computer from performing any other computations. Also, most general-purpose CPUs are not designed for performing graphics-related computations and would not be able to handle the workload required by graphics- intensive applications. Furthermore, any piece of software that emulates the functioning of a hardware device is always significantly slower than that hardware device. The goal of the modern graphics pipeline is to convert 3D primitives into a series of pixels on a 2D display device. This process of converting 3D objects that exist in a vector space into a set of pixels that exist in a 2D space is called rasterization. A raster graphics video display only accepts a signal that has encoded in it the information about each pixel on the screen. In the past, there have also existed vector graphics displays. A vector graphics display will accept individual 2D primitives for display on the screen, and then draw each 2D object on the screen in the order that it receives them. A raster graphics display, on the other hand, scans every pixel on the screen during each frame and displays the pixel based on the information that is in the video signal coming from the video card. The difference between these two techniques is that the vector graphics display requires much less information than the raster graphics display. This is because the raster graphics display needs information about each pixel on the screen whereas the vector graphics display only needs to know the shape that it is supposed to draw and the endpoints and location of that shape. This difference in display technology causes differences in the video card’s design. In a video card designed for a vector graphics display, only a display list is required. The display list contains information about the 2D shapes that will be displayed on the screen. The display list is the output of the graphics card, which is then input into the vector graphics display. On the other hand, a raster graphics video card has to have a memory which stores information about every pixel that will be displayed on the screen. This memory is called the frame buffer, and the frame buffer is the output of the video card which gets sent to the input of the raster graphics video display. Nearly all video displays currently in use are raster graphics displays, and therefore nearly all graphics cards currently in use are raster graphics video cards. Since most modern video displays support resolutions that are greater than 1000 pixels wide by 1000 pixels high, most graphics cards must have a frame buffer which is capable of storing information about more than 1,000,000 pixels (since 1000 x 1000 = 1,000,000) for each frame of a computer screen. This means that the frame buffer’s size and data transfer rate are both important characteristics in the design of graphics cards. Now that we’ve seen that the purpose of the GPU is to transform 3D graphics primitives handed off to it by the computer’s CPU into a series of pixels to be displayed by the computer’s video display, we can look inside the GPU. But before we can understand the elements that comprise a GPU, and since GPUs are just a special kind of CPU, we must first understand how CPUs work in general.

The GPU Primer, Part II



The GPU Primer

by

Michael August, Computer Scientist
first published May 15th, 2006




Ever since Charles Babbage’s original design of the analytical engine in 1837, computer architecture has made use of a mill and a store. The modern day equivalent of the mill is the CPU, and the modern day equivalent of the store is RAM (Random Access Memory). The CPU can be thought of as a fast calculator that can access memory and do arithmetic on numbers that are stored in that memory. The CPU’s purpose is to fetch instructions from an instruction cache, decode each instruction, and then execute it. In order to execute an instruction, the CPU must access any data in memory that is necessary for the instruction execution to proceed. Then, after fetching the data from memory and executing the instruction on that data, the resultant data from that instruction’s execution must be written back to memory. These various stages of a CPU’s execution together make up what is called the instruction pipeline. The particular instruction pipeline of any CPU is determined by the design of the CPU and its instruction set architecture (i.e. the set of instructions that are physically wired into the CPU’s hardware). All modern CPUs utilize superscalar designs, whereas older CPUs used scalar designs. In this context, the term “scalar” means that an average of one instruction can be executed per CPU clock cycle and thus one piece of data can be output per clock cycle. The term “superscalar” means that, on average, more than one instruction can be executed per clock cycle, and therefore multiple pieces of data can be returned from each stage of the pipeline in one clock cycle. A CPU can be made superscalar by breaking up its pipeline into very many distinct stages and by having multiple functional units operate at each stage. In a scalar processor, one instruction can operate on only one piece of data at a time. In a vector processor, one instruction can operate on multiple pieces of data at a time. Most modern general-purpose CPUs are superscalar processors, not vector processors. The vector processor was a common type of processor in supercomputers in the 1980s and 1990s. Today, most general-purpose superscalar CPUs incorporate elements of a vector processor design by providing support for SIMD (Single Instruction, Multiple Data) instructions. The upcoming Cell processor from Sony, Toshiba, and IBM contains eight vector microprocessors (called Synergistic Processing Elements) that are all under the control of a superscalar CPU. Another example of a vector processor is the Digital Signal Processor. The GPU is yet another example of a vector processor, and much of the GPU’s high performance characteristics can be attributed to its vector design. Now, let’s look at how a GPU fits into the overall graphics subsystem of a computer.

A GPU is a part of the chipset located on a modern video card. It is the part of the chipset which is responsible for 3D graphics acceleration. The output of this graphics chipset is sent to the frame buffer, the on-card memory which contains information about each pixel on the screen (though the frame buffer also acts as a temporary memory location for other data besides the individual pixel information). Every time the screen is refreshed, the frame buffer is read (i.e. sampled) and the information contained within it is displayed on the screen. If the display device happens to require an analog input, then a RAMDAC (Random Access Memory Digital-to-Analog Converter) sits in between the frame buffer and the output connection to the display device. The RAMDAC takes the digitally encoded information about each pixel that is stored in the frame buffer and converts that digital information into an analog signal which can be understood by the video display’s internal electronics. If the display device requires a digital input, then a hardware transcoder on the video card converts the pixel information in the frame buffer into the particular digital format required by the display device. This, in a nutshell, is how video cards work. The primary bottlenecks in this design are the connection from the host computer to the graphics card and the connection from the GPU to the graphics card’s frame buffer. The connection from the computer to the graphics card has traditionally been through the PCI or AGP bus, but now all modern video cards connect to the host computer via the PCI-Express bus. The PCI-Express bus supports a theoretical data transfer rate of 3.7 GB/s (in a PCI-Express x16 slot) in both directions. The theoretical data transfer rate between the GPU and the frame buffer varies widely from one graphics card to another. The transfer rate falls somewhere in the range of 4 GB/s to 50 GB/s for modern high-end graphics cards (this rate also depends on what kind of data is being transferred to and from the frame buffer). The GPU is the key player in the functioning of the video card, so let’s take a look inside.

At its core, a GPU is just an implementation of the graphics pipeline. The modern graphics pipeline is composed of multiple stages: application, command, geometry, rasterization, texture, fragment, and display. Different people have a different view of what the stages in the modern graphics pipeline are, and some people group the various functions of the pipeline into different stages. My own rendition of the graphics pipeline as presented in this paper is derived from two sources who are renowned computer graphics experts. The graphics pipeline can also be viewed as a stack, not unlike the TCP/IP protocol stack in computer networking. In the application stage of the pipeline, an application running on the host computer needs to display a geometric object that is stored in the computer’s main memory. The application has information about the vertices, or endpoints, of the geometric object. This object, its vertices, and its location on the screen all represent a geometric primitive that needs to be sent to the graphics card so that it can be processed by the GPU. In order to do this, the application must send a command to the graphics card via the host computer’s operating system. In order to achieve this, the application makes a call to the graphics API (Application Programming Interface). The graphics API can be either OpenGL or DirectX (or possibly another proprietary graphics API, though this is rare). The application makes a function call (or a set of function calls) with all of the information about the geometric primitive as a parameter to the function. This process of calling a function built into the graphics API is a part of the command stage of the graphics pipeline. The function call can be viewed as a command to the graphics card to do something with the geometric primitive which was passed to the function as a parameter. The graphics API is implemented as a part of the video card driver. The geometric primitive can be a point, a line, or a polygon, and it is represented at this stage by its vertices. Once the command has been decoded by the GPU, the data sent along with the command is operated on by the geometry stage of the graphics pipeline. This geometric data corresponds to a polygon that can be manipulated. The geometry stage is responsible for taking the vertices of the polygon passed to it by the command stage, and performing geometric transformations on the polygon such as translation, rotation, and scaling. The reason that geometric transformations are needed is as follows. The original shape that needs to be displayed on the screen lies in object space, a 3D space which is centered on that object. The object must be placed into a world that has its own coordinate system. Transforming the object from its own space into the world space is called a modeling transformation. A person’s view of the 3D world in which the object is embedded determines where the object must be placed relative to the screen’s coordinate system. The screen’s coordinate system is the world space as seen by the virtual camera (i.e. the person looking into the screen at the world). Transforming an object’s position and orientation from world space into this camera space is called a viewing transformation. Since the objects that must be displayed on the 2D screen must appear like they are in a 3D world, a further projection transformation of the object is required. The projection transformation projects the 3D object onto the flat plane (called image space) which corresponds to the screen that the person is looking at. So, first a modeling transformation is applied to the shape, then a viewing transformation is applied to the shape, and finally a projection transformation is applied to the shape. The geometry stage is also responsible for lighting the resulting polygon after it has been transformed. The Transform and Lighting Engine on a graphics card lies in the geometry stage of the graphics pipeline. Since only the shapes that are visible on the screen need to be manipulated by later stages of the pipeline, the geometry stage culls any parts of the shape that will not be visible to the viewer. This process is called hidden surface determination and it plays a large role in the efficiency of the graphics pipeline, as there is no need for the GPU to process shapes that will not be visible. The geometry stage is also responsible for taking the geometric primitive passed to it by the application stage and assembling the primitive into an actual geometric shape that can later be filled with pixels. If the original primitive was a polygon, then that polygon is broken up into many individual triangles (this process is referred to as tessellation, or triangulation). This part of the geometry stage is called triangle setup. GPUs also have another module included in the geometry stage. This module is called a vertex shader, or a vertex processor. The vertex shader takes the individual vertices of a geometric shape and transforms those vertices in various ways. It can animate the vertices, for example, or it can change the lighting on them. Modern vertex shaders are programmable. It this programmability of the GPU which allows for some very impressive special effects in real-time. The output of the geometry stage is a shape which has been fully transformed and lit. This output is then sent to the rasterization stage. The rasterization stage fills the shape with pixels. At this stage, the color information for each vertex is interpolated across the shape (i.e. a color gradient is formed from one vertex to the next) by the rasterizer. This is how the rasterizer knows what color to make each pixel. Technically, at this stage of the pipeline the shape is not filled with pixels. Rather, it is filled with candidate pixels, or potential pixels. These candidate pixels are called fragments. The reason they are considered candidate pixels is that they might not make it all the way to the end of the pipeline or they might be changed in some way before they reach the end of the pipeline. The output of the rasterization stage is a set of fragments which fill the shape that was originally fed into the graphics pipeline. These fragments are sent to the texture stage of the graphics pipeline. The texture stage applies a texture to each fragment sent to it. A texture is just an image file that is overlaid on top of a shape to make the shape’s surface look more realistic. The texture that is applied to each fragment is stored in a high-speed texture cache. Textures are also stored in a special area of the graphics card’s frame buffer for quick access by the GPU. After a texture is combined with each fragment that makes up the shape, the resulting fragments are sent to the fragment stage of the graphics pipeline. The fragment stage allows for mathematical operations to be applied to each fragment to enhance its appearance. For example, each fragment can be blended with different colors. Shadows can be added to the fragments. Fragments can be made to appear transparent (called alpha blending). Many different effects can be added to each fragment. It is also at this stage of the pipeline where various tests are performed on each fragment. One such test is the z-compare test (i.e. depth test). This test determines whether or not the fragment will be visible. If it isn’t visible, then it is just thrown away. A fragment might not be visible if it is hidden behind other opaque objects that show up on the screen. In such a case, the fragment need not progress down the graphics pipeline any more, as it won’t be made visible on the display device. All of these functions of the fragment stage use buffers that are a part of the memory (i.e. the frame buffer) located on the graphics card. Some of these buffers are the color buffer (used for color blending), the stencil buffer (for creating shadows), and the z buffer (for performing the depth test). At the fragment stage, textures can be combined and overlaid onto each fragment. Fog can also be generated at this stage. Another important module that resides in the fragment stage of the graphics pipeline is the fragment shader, also known as the pixel shader or pixel processor. The fragment shader can uniquely transform the appearance of each individual fragment that makes up the shape. The appearance of each fragment is calculated independently of all of the other fragments. A vertex shader can’t shade pixels at the level of detail that a fragment shader can. This is because the vertex shader can only interpolate the appearance of pixels between adjacent vertices, but fragment shaders can shade each individual pixel independently of the others in its vicinity. Modern fragment shaders are also programmable. This means that the graphics programmer can alter a shape’s appearance and color at a per-pixel level. This fine-grained control over the level of detail allows for enhanced photorealism in real-time. The output of the fragment stage is the set of individual pixels that make up the shape. At this point, the pixels are in their final viewable form. They are sent to the frame buffer so that they can be displayed on the screen. The output of the fragment stage of the graphics pipeline is input to the display stage. This is the final stage of the graphics pipeline. The display stage is responsible for reading the contents of the frame buffer, performing digital to analog conversion if necessary, and sending the output to the display device. If the display device is a CRT or a projector, then gamma correction will also be performed. The end result is the 2D arrangement of pixels in one frame of the image. The frames are flickered in front of our eyes at a high refresh rate, and our own persistence of vision allows us to piece these individual frames together into a moving animation sequence. Now that we’ve taken a trip down the graphics pipeline, let’s look at the architecture of modern GPUs.

Part IPart IIPart III

The GPU Primer, Part III



The GPU Primer

by

Michael August, Computer Scientist
first published May 15th, 2006


In order to understand the architecture of a modern GPU, it helps to take a look at an actual one. So, to better understand the architecture of GPUs, let’s look at the NVIDIA GeForce 6 series architecture. This architecture is used in the NVIDIA GeForce 6800 series of graphics cards. The overall architecture of a computer system looks like this:

In the image above, the data transfer rates across each bus are annotated next to the bus. The CPU depicted in this image has an 800 MHz Front Side Bus, and the GPU depicted is a GeForce 6 series GPU. The bus connecting the North Bridge to the GPU is a PCI- Express bus (using an x32 slot). As one can see, the GPU is capable of transferring data along its bus to its own memory much faster than any of the other buses can transfer data. This means that the bottleneck is generally not onboard the GPU, but rather on the motherboard of the host computer. It also means that programs that run on the GPU and utilize the bandwidth of the GPU’s memory bus will run very efficiently. The NVIDIA GeForce 6 GPU is a 256-bit processor with a core clock rate of around 425 MHz. It is onboard a graphics card that has a 256-bit memory with a memory clock rate of around 550 MHz (using GDDR3 memory). It is manufactured with a 130 nm process technology and has 222 million transistors. Below is a block diagram of the NVIDIA GeForce 6 series architecture:


The “host” block at the top of the diagram denotes the host computer. The host computer sends commands, vertex data, and textures to the GPU. The GeForce 6 GPU’s implementation of the geometry stage of the graphics pipeline contains up to six programmable vertex processors (i.e. vertex shaders). Each of these vertex processors can operate on one piece of data simultaneously, meaning that all of them can process vertices in parallel. Once the vertices have been processed by the programmable vertex processors, they are sent to the remainder of the geometry stage. The vertices are assembled together into a primitive (i.e. a point, a line, or a triangle) in the process of primitive assembly. Then, the primitives that won’t be visible in the final image are culled, and the pieces of the primitives that are cut off by the edges of the viewing frustum are clipped. These steps are performed in the “cull/clip/setup” block and the output is sent to the rasterizer. The rasterizer fills in the primitive’s shape with candidate pixels called fragments. The rasterizer also checks each fragment’s depth to see if it will be hidden by any other pixel in the scene. The rasterized version of the primitive is then sent to the fragment processors (i.e. fragment shader). Note that both the vertex processors and the fragment processors have access to the texture cache. This means that both vertices and individual fragments can be blended with texels (texels are the individual pixels comprising a texture). The GeForce 6 GPU’s implementation of the fragment stage of the graphics pipeline can have up to 16 individual programmable fragment processors. Each fragment processor can operate on four fragments at a time. This highly parallel design means that many fragments can be processed simultaneously, and since the whole pipeline is broken down into many stages, each part of the pipeline can be working on different, unrelated, pieces of data at the same time. After being shaded by the fragment processors, the fragments pass over the fragment crossbar and into 16 raster operation units. The raster operation units do another depth test to ensure that the fragments aren’t being occluded by any other fragments in the final image, antialiasing is performed on the fragments, and then the resultant fragments are sent to the frame buffer. The frame buffer is split up into four partitions, and there is one connection from the GPU to each of these memory partitions. This means that there are four distinct connections from the GPU to the graphics card’s onboard memory. This prevents the full memory bus on the graphics card from being tied up at one time, since there are effectively four parallel buses over which data can be transferred from the GPU to the frame buffer. Once the pixels have been transferred into the frame buffer, then they are in their final form and are ready to be displayed on the screen. Below is the GeForce 6 series architecture with its modular implementation of the different stages of the graphics pipeline pointed out:

.

Most of the data passing through the GeForce 6 series GPU is 32-bit floating point data, though it can also be 16-bit floating point data. A GPU is truly designed to handle floating point data, as most geometric data is in floating point format. Some of the important performance metrics of a GPU are its peak pixel fill rate, its peak texel fill rate, its peak memory bandwidth, and its triangle transform rate. The GeForce 6 series GPU has a peak pixel fill rate of 6400 MegaPixels per second, a peak texel fill rate of 6400 MegaTexels per second, a peak memory bandwidth of 35.2 GB/s, and a triangle transform rate of 600 MegaTriangles per second. Most general-purpose CPUs don’t even compete in the number of floating point operations per second performance when compared to GPUs. Intel’s Pentium 4 Prescott chip, for example, can only perform at a peak of about 12 GFLOPS, whereas the GeForce 6 can perform at a peak of over 100 GFLOPS.

The whole process that has been discussed in this paper has been the concept of rendering. Rendering is the process of producing the pixels of a scene from a higher- level description of that scene’s components. There have been four major breakthroughs in the technology behind the rendering of real-time 3D computer graphics. The first of these breakthroughs was the concept of modeling a 3D object by connecting together a mesh of lines. This mesh of lines that are grouped together into triangles is called a wireframe model. By decomposing a 3D model into smaller, more manageable geometric shapes, the graphics card could now build up a model out of its constituent shapes. The second major breakthrough was to apply shading and lighting to the triangles that make up the wireframe model. By doing this, shaded solids could now be viewed on the screen and animated in real-time. The third breakthrough was to apply textures on top of the shaded triangles that make up the model. These textures cause the model to appear more realistic. The fourth breakthrough was to allow the appearance of the model’s surfaces to be programmed. In this way, the graphics programmer can make the model’s surfaces appear even more natural by animating parts of them, by adding randomness to them, and by adding a richer blend of colors and textures to them. We are currently in this fourth generation of graphics technology with the programmable vertex shaders and pixel shaders of modern GPUs. It was just recently that the transition took place from a fixed-function pipeline to a programmable pipeline. The purpose behind these different breakthroughs in graphics hardware has been to get closer to the goal of rendering photorealistic 3D computer graphics in real-time so that, eventually, you won’t be able to tell the difference between a real human being and a human being on the screen. The GPU is quickly approaching that dream.


Bibliography


Kilgariff, Emmett; Fernando, Randima, eds. GPU GEMS 2. 2005. 

SIGGRAPH 2005 Course 37 Notes. 2005.

Moya, Victor; Gonzalez, Carlos; Roca, Jordi; Fernandez, Agustin; Espasa, Roger. Shader Performance Analysis on a Modern GPU Architecture. IEEE Computer Society. 2005.

Lefohn, Aaron. GPGPU IEEE 2005 Visualization Tutorial. 2005. 

Datar, Ajit; Padhye, Apurva. Graphics Processing Unit Architecture. 2005. 

Durand, Fredo; Cutler, Barb. Modern Graphics Hardware. 2001.

Part I, Part II, Part III