The GPU Primer
by
Michael August, Computer Scientist
first published May 15th, 2006
As with almost all aspects of computer technology, computer video display technology has experienced an almost incomprehensible growth and evolution over the past two decades. Behind this evolution of video technology has been the demand for more photorealistic computer graphics. This demand for photorealism has come primarily from the video game community. At the heart of this photorealism is the hardware necessary to make computer-generated imagery appear in real time. All of today’s 3D graphics acceleration-enabled video cards contain a graphics processing unit (GPU) at their core. A GPU is just a special kind of central processing unit (CPU) that is dedicated to performing video-related computations. The quality of the GPU determines the quality and the kinds of graphics that can be viewed on a display device. Before getting into the details of the inner workings of a modern GPU, it is best to look back at the history of CPUs and how the technology got to where it is today.
In the early days of digital computing (in the 1930s and 1940s), the switching elements in CPUs consisted of electromechanical relays. These CPUs took up the space of whole rooms and were fairly reliable but very slow devices which performed computations that could be output onto punched paper tape or punch cards. The next step up in computational power came (in the 1940s and 1950s) when vacuum tubes were used in place of relays as the switching elements comprising the CPU. Vacuum tubes were not very reliable and had to be replaced regularly. They also took time to warm up when first turned on. However, they were faster than relays because they had no moving parts. Their output was usually sent to punch cards. In the late 1950s and early 1960s computers began to be built out of transistors. Transistors became, and still are, the fundamental switching elements out of which CPUs are formed. At first, transistors were soldered onto the circuit boards by hand. However, with the techniques of Small Scale Integration, integrated circuits consisting of multiple transistors became common. Chips containing one or more logic gates were wired together to form the CPU. Large Scale Integration and Very Large Scale Integration have led to all of these multiple chips being integrated into one chip that has tens of millions to hundreds of millions of transistors. The modern Pentium 4 Prescott CPU, for example, has 125 million transistors. The main visual output of most transistor-based computers since the 1970s has been the Cathode Ray Tube (CRT). More recently, newer display technologies, such as the Liquid Crystal Display and the Plasma Display, have become more popular than the CRT. All of these devices display graphics in two spatial dimensions (changes in the displayed graphics throughout the time dimension can create animation). There are also a number of other 2D display devices that are less common, and some experimental 3D display devices exist as well. Alongside the development of CPU technology has been the evolution of video display technology and the video cards responsible for making that new display technology function.
At first, video cards were simply frame buffers with auxiliary electronics built in. A frame buffer is a collection of memory cells that store the data (i.e. color and intensity) about each pixel that is displayed on the screen in a particular frame (a frame is one sample, or time-slice, of everything that is being displayed on the screen. Televisions in America, for example, display 30 frames per second). The CPU on the computer would perform the computations necessary to determine the color and intensity information for each pixel and then send this information along a bus to the frame buffer on the video card. Eventually, graphics coprocessors were included on the motherboard alongside the CPU to perform graphics-specific computations. The use of a coprocessor in graphics intensive tasks offloaded work from the CPU onto the coprocessor, thereby allowing the CPU to use its cycles on other computations and improving system performance. An example of a graphics coprocessor was the blitter. The blitter performed the Bit Block Transfer operation (also known as BitBLT). Introduced in 1974, the Bit Block Transfer was a special instruction that a graphics programmer could use to incorporate multiple 2D images (called sprites) into a frame. An alternative technique for displaying sprites on a frame was hardware-based sprite acceleration. Hardware-based sprite acceleration was eventually included directly on video cards. The difference between direct hardware support for sprite manipulation and the blitter is as follows. With hardware support for sprites, a sprite would be placed into a special sprite memory. The sprite would not be placed into the frame buffer. Instead, the graphics card (note: the terms “graphics card”, “video card”, and “graphics board” are all interchangeable) would read the frame buffer contents and then read the sprite memory to overlay the sprites that are in the sprite memory on top of the contents of the frame buffer. Effectively, this meant that the frame buffer provided the background image and the sprites were overlayed on top of that background image. With the blitter and the BitBLT operation, a sprite was written directly into the frame buffer by using a clever technique which applied a combination of bitmasks to the frame buffer contents. This meant that individual pixels in the frame buffer were actually overwritten by the sprites. In other words, pieces of the background image were actually replaced by the sprites. Eventually, the blitter and the hardware support for sprites were moved onto the graphics card itself. This combination of frame buffer, blitter, and hardware-enabled sprite acceleration, all in one device, was a forerunner of the modern graphics card. It could be said that the blitter itself was a primitive form of the modern GPU. The offloading of the graphics workload from the main system board onto specialized graphics modules that reside on the video card has been a common theme throughout the history of video technology. These early video cards were designed for displaying 2D images only. In fact, all modern computer monitors display only 2D images, but the use of 3D art techniques trick the human eye into believing that the images being displayed are 3D images. The implementation of these 3D techniques was first done in software by the CPU. However, the implementation of these 3D techniques bogged down the CPU extensively. It wasn’t until specialized hardware for manipulating 3D graphics emerged that realistic-looking 3D games and simulations could be created. Throughout the 1980s and 1990s, Silicon Graphics, Inc. (SGI) played a leading role in the creation of hardware and software solutions for the creation of 3D graphics. These solutions were very expensive, and thus were designed for large organizations that had a need for high-end 3D graphics technology. The consumer market for 3D graphics began in the mid-1990s with the introduction of 3D graphics accelerators that ran alongside the main 2D graphics board. With this generation of 3D graphics cards, the 3D card had a connection to the 2D card, which would then send the video output to the display. With the introduction of the Voodoo Rush graphics accelerator card by 3dfx in 1996, both the 2D and 3D functionality were integrated into one video card. In 1997, Intel introduced the Pentium MMX CPU, which had an enhanced instruction set for performing multimedia computations (referred to as MultiMedia eXtensions by some, and Matrix Math eXtensions by others). The following year, AMD introduced a similar extension to its K6-2 instruction set (referred to as 3DNOW!). In 1999, Intel responded to AMD’s 3DNOW! technology with Streaming SIMD Extensions (SSE) on its Pentium 3. The idea behind these CPU instruction set extensions was to allow the CPU to perform more graphics-intensive computations in fewer CPU cycles. Graphics-related computations tend to be based on floating point data, so these extensions improved the performance of Intel’s processors and AMD’s processors when executing instructions on floating point data. These extensions also parallelized operations on data, which improved the performance of these CPUs. In the fall of 1999, NVIDIA Corporation introduced itsGeForce 256 3D graphics accelerator card. The GeForce 256 was the first video card to incorporate a processor that had a Transform and Lighting Engine. This characteristic made the GeForce 256 the first graphics card to have a full-fledged GPU. Ever since the introduction of the GPU in 1999, more features have been added to graphics cards and their GPUs in the quest for the display of real-time photorealistic 3D graphics. In order to understand these features, we must first take a look at the purpose of the GPU and video card in general.
All shapes that one sees on a computer screen start out inside the computer as abstract entities that don’t have any physically realizable representation. These abstract entities get transformed into other abstract entities until they are in a form that is suitable for the computer screen to display them. The whole idea of a graphics card is to be the interface, or middle man, between the computer and the video display. The graphics card transforms the abstract entities it receives into a form which is suitable for the computer screen to display. The input to the graphics card is called a primitive, and this primitive is the abstract entity that needs to be transformed into a physical representation. A primitive could be a circle or a sphere or any other geometric object. The GPU on a graphics card takes each primitive that the host computer sends to it and converts that primitive into the required electronic signal that the video display electronics understands. The series of transformations that occur to the primitive to make it into a form which can be displayed is called the graphics pipeline. This pipeline represents the work that is done to each primitive before it can be displayed on the computer’s monitor. It is possible to implement the graphics pipeline, and the work that it entails, entirely in software. However, this would be such a burden on the host computer’s CPU and memory that it would prevent the computer from performing any other computations. Also, most general-purpose CPUs are not designed for performing graphics-related computations and would not be able to handle the workload required by graphics- intensive applications. Furthermore, any piece of software that emulates the functioning of a hardware device is always significantly slower than that hardware device. The goal of the modern graphics pipeline is to convert 3D primitives into a series of pixels on a 2D display device. This process of converting 3D objects that exist in a vector space into a set of pixels that exist in a 2D space is called rasterization. A raster graphics video display only accepts a signal that has encoded in it the information about each pixel on the screen. In the past, there have also existed vector graphics displays. A vector graphics display will accept individual 2D primitives for display on the screen, and then draw each 2D object on the screen in the order that it receives them. A raster graphics display, on the other hand, scans every pixel on the screen during each frame and displays the pixel based on the information that is in the video signal coming from the video card. The difference between these two techniques is that the vector graphics display requires much less information than the raster graphics display. This is because the raster graphics display needs information about each pixel on the screen whereas the vector graphics display only needs to know the shape that it is supposed to draw and the endpoints and location of that shape. This difference in display technology causes differences in the video card’s design. In a video card designed for a vector graphics display, only a display list is required. The display list contains information about the 2D shapes that will be displayed on the screen. The display list is the output of the graphics card, which is then input into the vector graphics display. On the other hand, a raster graphics video card has to have a memory which stores information about every pixel that will be displayed on the screen. This memory is called the frame buffer, and the frame buffer is the output of the video card which gets sent to the input of the raster graphics video display. Nearly all video displays currently in use are raster graphics displays, and therefore nearly all graphics cards currently in use are raster graphics video cards. Since most modern video displays support resolutions that are greater than 1000 pixels wide by 1000 pixels high, most graphics cards must have a frame buffer which is capable of storing information about more than 1,000,000 pixels (since 1000 x 1000 = 1,000,000) for each frame of a computer screen. This means that the frame buffer’s size and data transfer rate are both important characteristics in the design of graphics cards. Now that we’ve seen that the purpose of the GPU is to transform 3D graphics primitives handed off to it by the computer’s CPU into a series of pixels to be displayed by the computer’s video display, we can look inside the GPU. But before we can understand the elements that comprise a GPU, and since GPUs are just a special kind of CPU, we must first understand how CPUs work in general.