Tuesday, March 4, 2014

The GPU Primer, Part II



The GPU Primer

by

Michael August, Computer Scientist
first published May 15th, 2006




Ever since Charles Babbage’s original design of the analytical engine in 1837, computer architecture has made use of a mill and a store. The modern day equivalent of the mill is the CPU, and the modern day equivalent of the store is RAM (Random Access Memory). The CPU can be thought of as a fast calculator that can access memory and do arithmetic on numbers that are stored in that memory. The CPU’s purpose is to fetch instructions from an instruction cache, decode each instruction, and then execute it. In order to execute an instruction, the CPU must access any data in memory that is necessary for the instruction execution to proceed. Then, after fetching the data from memory and executing the instruction on that data, the resultant data from that instruction’s execution must be written back to memory. These various stages of a CPU’s execution together make up what is called the instruction pipeline. The particular instruction pipeline of any CPU is determined by the design of the CPU and its instruction set architecture (i.e. the set of instructions that are physically wired into the CPU’s hardware). All modern CPUs utilize superscalar designs, whereas older CPUs used scalar designs. In this context, the term “scalar” means that an average of one instruction can be executed per CPU clock cycle and thus one piece of data can be output per clock cycle. The term “superscalar” means that, on average, more than one instruction can be executed per clock cycle, and therefore multiple pieces of data can be returned from each stage of the pipeline in one clock cycle. A CPU can be made superscalar by breaking up its pipeline into very many distinct stages and by having multiple functional units operate at each stage. In a scalar processor, one instruction can operate on only one piece of data at a time. In a vector processor, one instruction can operate on multiple pieces of data at a time. Most modern general-purpose CPUs are superscalar processors, not vector processors. The vector processor was a common type of processor in supercomputers in the 1980s and 1990s. Today, most general-purpose superscalar CPUs incorporate elements of a vector processor design by providing support for SIMD (Single Instruction, Multiple Data) instructions. The upcoming Cell processor from Sony, Toshiba, and IBM contains eight vector microprocessors (called Synergistic Processing Elements) that are all under the control of a superscalar CPU. Another example of a vector processor is the Digital Signal Processor. The GPU is yet another example of a vector processor, and much of the GPU’s high performance characteristics can be attributed to its vector design. Now, let’s look at how a GPU fits into the overall graphics subsystem of a computer.

A GPU is a part of the chipset located on a modern video card. It is the part of the chipset which is responsible for 3D graphics acceleration. The output of this graphics chipset is sent to the frame buffer, the on-card memory which contains information about each pixel on the screen (though the frame buffer also acts as a temporary memory location for other data besides the individual pixel information). Every time the screen is refreshed, the frame buffer is read (i.e. sampled) and the information contained within it is displayed on the screen. If the display device happens to require an analog input, then a RAMDAC (Random Access Memory Digital-to-Analog Converter) sits in between the frame buffer and the output connection to the display device. The RAMDAC takes the digitally encoded information about each pixel that is stored in the frame buffer and converts that digital information into an analog signal which can be understood by the video display’s internal electronics. If the display device requires a digital input, then a hardware transcoder on the video card converts the pixel information in the frame buffer into the particular digital format required by the display device. This, in a nutshell, is how video cards work. The primary bottlenecks in this design are the connection from the host computer to the graphics card and the connection from the GPU to the graphics card’s frame buffer. The connection from the computer to the graphics card has traditionally been through the PCI or AGP bus, but now all modern video cards connect to the host computer via the PCI-Express bus. The PCI-Express bus supports a theoretical data transfer rate of 3.7 GB/s (in a PCI-Express x16 slot) in both directions. The theoretical data transfer rate between the GPU and the frame buffer varies widely from one graphics card to another. The transfer rate falls somewhere in the range of 4 GB/s to 50 GB/s for modern high-end graphics cards (this rate also depends on what kind of data is being transferred to and from the frame buffer). The GPU is the key player in the functioning of the video card, so let’s take a look inside.

At its core, a GPU is just an implementation of the graphics pipeline. The modern graphics pipeline is composed of multiple stages: application, command, geometry, rasterization, texture, fragment, and display. Different people have a different view of what the stages in the modern graphics pipeline are, and some people group the various functions of the pipeline into different stages. My own rendition of the graphics pipeline as presented in this paper is derived from two sources who are renowned computer graphics experts. The graphics pipeline can also be viewed as a stack, not unlike the TCP/IP protocol stack in computer networking. In the application stage of the pipeline, an application running on the host computer needs to display a geometric object that is stored in the computer’s main memory. The application has information about the vertices, or endpoints, of the geometric object. This object, its vertices, and its location on the screen all represent a geometric primitive that needs to be sent to the graphics card so that it can be processed by the GPU. In order to do this, the application must send a command to the graphics card via the host computer’s operating system. In order to achieve this, the application makes a call to the graphics API (Application Programming Interface). The graphics API can be either OpenGL or DirectX (or possibly another proprietary graphics API, though this is rare). The application makes a function call (or a set of function calls) with all of the information about the geometric primitive as a parameter to the function. This process of calling a function built into the graphics API is a part of the command stage of the graphics pipeline. The function call can be viewed as a command to the graphics card to do something with the geometric primitive which was passed to the function as a parameter. The graphics API is implemented as a part of the video card driver. The geometric primitive can be a point, a line, or a polygon, and it is represented at this stage by its vertices. Once the command has been decoded by the GPU, the data sent along with the command is operated on by the geometry stage of the graphics pipeline. This geometric data corresponds to a polygon that can be manipulated. The geometry stage is responsible for taking the vertices of the polygon passed to it by the command stage, and performing geometric transformations on the polygon such as translation, rotation, and scaling. The reason that geometric transformations are needed is as follows. The original shape that needs to be displayed on the screen lies in object space, a 3D space which is centered on that object. The object must be placed into a world that has its own coordinate system. Transforming the object from its own space into the world space is called a modeling transformation. A person’s view of the 3D world in which the object is embedded determines where the object must be placed relative to the screen’s coordinate system. The screen’s coordinate system is the world space as seen by the virtual camera (i.e. the person looking into the screen at the world). Transforming an object’s position and orientation from world space into this camera space is called a viewing transformation. Since the objects that must be displayed on the 2D screen must appear like they are in a 3D world, a further projection transformation of the object is required. The projection transformation projects the 3D object onto the flat plane (called image space) which corresponds to the screen that the person is looking at. So, first a modeling transformation is applied to the shape, then a viewing transformation is applied to the shape, and finally a projection transformation is applied to the shape. The geometry stage is also responsible for lighting the resulting polygon after it has been transformed. The Transform and Lighting Engine on a graphics card lies in the geometry stage of the graphics pipeline. Since only the shapes that are visible on the screen need to be manipulated by later stages of the pipeline, the geometry stage culls any parts of the shape that will not be visible to the viewer. This process is called hidden surface determination and it plays a large role in the efficiency of the graphics pipeline, as there is no need for the GPU to process shapes that will not be visible. The geometry stage is also responsible for taking the geometric primitive passed to it by the application stage and assembling the primitive into an actual geometric shape that can later be filled with pixels. If the original primitive was a polygon, then that polygon is broken up into many individual triangles (this process is referred to as tessellation, or triangulation). This part of the geometry stage is called triangle setup. GPUs also have another module included in the geometry stage. This module is called a vertex shader, or a vertex processor. The vertex shader takes the individual vertices of a geometric shape and transforms those vertices in various ways. It can animate the vertices, for example, or it can change the lighting on them. Modern vertex shaders are programmable. It this programmability of the GPU which allows for some very impressive special effects in real-time. The output of the geometry stage is a shape which has been fully transformed and lit. This output is then sent to the rasterization stage. The rasterization stage fills the shape with pixels. At this stage, the color information for each vertex is interpolated across the shape (i.e. a color gradient is formed from one vertex to the next) by the rasterizer. This is how the rasterizer knows what color to make each pixel. Technically, at this stage of the pipeline the shape is not filled with pixels. Rather, it is filled with candidate pixels, or potential pixels. These candidate pixels are called fragments. The reason they are considered candidate pixels is that they might not make it all the way to the end of the pipeline or they might be changed in some way before they reach the end of the pipeline. The output of the rasterization stage is a set of fragments which fill the shape that was originally fed into the graphics pipeline. These fragments are sent to the texture stage of the graphics pipeline. The texture stage applies a texture to each fragment sent to it. A texture is just an image file that is overlaid on top of a shape to make the shape’s surface look more realistic. The texture that is applied to each fragment is stored in a high-speed texture cache. Textures are also stored in a special area of the graphics card’s frame buffer for quick access by the GPU. After a texture is combined with each fragment that makes up the shape, the resulting fragments are sent to the fragment stage of the graphics pipeline. The fragment stage allows for mathematical operations to be applied to each fragment to enhance its appearance. For example, each fragment can be blended with different colors. Shadows can be added to the fragments. Fragments can be made to appear transparent (called alpha blending). Many different effects can be added to each fragment. It is also at this stage of the pipeline where various tests are performed on each fragment. One such test is the z-compare test (i.e. depth test). This test determines whether or not the fragment will be visible. If it isn’t visible, then it is just thrown away. A fragment might not be visible if it is hidden behind other opaque objects that show up on the screen. In such a case, the fragment need not progress down the graphics pipeline any more, as it won’t be made visible on the display device. All of these functions of the fragment stage use buffers that are a part of the memory (i.e. the frame buffer) located on the graphics card. Some of these buffers are the color buffer (used for color blending), the stencil buffer (for creating shadows), and the z buffer (for performing the depth test). At the fragment stage, textures can be combined and overlaid onto each fragment. Fog can also be generated at this stage. Another important module that resides in the fragment stage of the graphics pipeline is the fragment shader, also known as the pixel shader or pixel processor. The fragment shader can uniquely transform the appearance of each individual fragment that makes up the shape. The appearance of each fragment is calculated independently of all of the other fragments. A vertex shader can’t shade pixels at the level of detail that a fragment shader can. This is because the vertex shader can only interpolate the appearance of pixels between adjacent vertices, but fragment shaders can shade each individual pixel independently of the others in its vicinity. Modern fragment shaders are also programmable. This means that the graphics programmer can alter a shape’s appearance and color at a per-pixel level. This fine-grained control over the level of detail allows for enhanced photorealism in real-time. The output of the fragment stage is the set of individual pixels that make up the shape. At this point, the pixels are in their final viewable form. They are sent to the frame buffer so that they can be displayed on the screen. The output of the fragment stage of the graphics pipeline is input to the display stage. This is the final stage of the graphics pipeline. The display stage is responsible for reading the contents of the frame buffer, performing digital to analog conversion if necessary, and sending the output to the display device. If the display device is a CRT or a projector, then gamma correction will also be performed. The end result is the 2D arrangement of pixels in one frame of the image. The frames are flickered in front of our eyes at a high refresh rate, and our own persistence of vision allows us to piece these individual frames together into a moving animation sequence. Now that we’ve taken a trip down the graphics pipeline, let’s look at the architecture of modern GPUs.

Part IPart IIPart III

No comments:

Post a Comment