A graphics processing unit (GPU), additionally occasionally called visual processing unit (VPU), is a specialised electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display. GPUs are used in embedded systems, mobile phones, personal computers, workstations, and game consoles. Modern GPUs are quite efficient at manipulating computer graphics and image processing, and their highly parallel structure makes them more efficient than general-purpose CPUs for algorithms where the processing of large blocks of data is done in parallel. In a personal computer, a GPU can be present on a video card, or it can be embedded on the motherboard or—in certain CPUs—on the CPU die.
The term GPU was popularised by Nvidia in 1999, who marketed the GeForce 256 as "the world's first GPU", or Graphics Processing Unit. It was presented as a "single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines that are capable of processing a minimum of 10 million polygons per second". Rival ATI Technologies coined the term "visual processing unit" or VPU with the release of the Radeon 9700 in 2002.
Arcade system boards have been using specialised graphics chips after the 1970s. The key to understanding early video games hardware is that the RAM for frame buffers was too expensive, so video chips composited data together as the display was being scanned out on the monitor.
Fujitsu's MB14241 video shifter was used to accelerate the drawing of sprite graphics for numerous 1970s arcade games from Taito and Midway, like Gun Fight (1975), Sea Wolf (1976) and Space Invaders (1978). The Namco Galaxian arcade system in 1979 used specialised graphics hardware supporting RGB color, multi-colored sprites and tilemap backgrounds. The Galaxian hardware was widely used throughout the golden age of arcade video games, by game companies like Namco, Centuri, Gremlin, Irem, Konami, Midway, Nichibutsu, Sega and Taito.
In the home market, the Atari 2600 in 1977 used a video shifter called the Television Interface Adaptor. The Atari 8-bit computers (1979) had ANTIC, a video processor which interpreted instructions describing a "display list"—the way the scan lines map to specific bitmapped or character modes and where the memory is stored (so there didn't need to be a contiguous frame buffer). 6502 machine code subroutines can be triggered on scan lines by setting a bit on a display list instruction. ANTIC additionally supported smooth vertical and horizontal scrolling independent of the CPU.
In 1985, the Commodore Amiga featured a custom graphics chip, with a blitter unit accelerating bitmap manipulation, line draw, and area fill functions. Also included is a coprocessor with its own primitive instruction set, capable of manipulating graphics hardware registers in sync with the video beam (e.g. for per-scanline palette switches, sprite multiplexing, and hardware windowing), or driving the blitter.
In 1986, Texas Instruments released the TMS34010, the first microprocessor with on-chip graphics capabilities. It could run general-purpose code, but it had a quite graphics-oriented instruction set. In 1990-1991, this chip would become the basis of the Texas Instruments Graphics Architecture ("TIGA") Windows accelerator cards.
In 1987, the IBM 8514 graphics system was released as one of the first video cards for IBM PC compatibles to implement fixed-function 2D primitives in electronic hardware. The same year, Sharp released the X68000, which used a custom graphics chipset that was powerful for a home computer at the time, with a 65,536 colour palette and hardware support for sprites, scrolling and multiple playfields, eventually serving as a development machine for Capcom's CP System arcade board. Fujitsu later competed with the FM Towns computer, released in 1989 with support for a full 16,777,216 colour palette.
In 1991, S3 Graphics introduced the S3 86C911, which its designers named after the Porsche 911 as an implication of the performance increase it promised. The 86C911 spawned a host of imitators: by 1995, all major PC graphics chip makers had added 2D acceleration support to their chips. By this time, fixed-function Windows accelerators had surpassed costly general-purpose graphics coprocessors in Windows performance, and these coprocessors faded away from the PC market.
Throughout the 1990s, 2D GUI acceleration continued to evolve. As manufacturing capabilities improved, so did the level of integration of graphics chips. Additional application programming interfaces (APIs) arrived for a variety of tasks, like Microsoft's WinG graphics library for Windows 3.x, and their later DirectDraw interface for hardware acceleration of 2D games within Windows 95 and later.
In the early- and mid-1990s, CPU-assisted real-time 3D graphics were fitting increasingly common in arcade, computer and console games, which led to an increasing public demand for hardware-accelerated 3D graphics. Early examples of mass-market 3D graphics hardware can be found in arcade system boards like the Sega Model 1, Namco System 22, and Sega Model 2, and the fifth-generation video game consoles like the Saturn, PlayStation and Nintendo 64. Arcade systems like the Sega Model 2 and Namco Magic Edge Hornet Simulator in 1993 were capable of hardware T&L (transform, clipping, and lighting) years before appearing in consumer graphics cards. Other systems used DSPs to accelerate transformations.Fujitsu, which worked on the Sega Model 2 arcade system, began working on integrating T&L into a single LSI solution for use in home computers in 1995; the Fujitsu Pinolite, the first 3D geometry processor for personal computers, released in 1997. The first hardware T&L GPU on home video game consoles was the Nintendo 64's Reality Coprocessor, released in 1996. In 1997, Mitsubishi released the 3Dpro/2MP, a fully featured GPU capable of transformation and lighting, for workstations and Windows NT desktops; AMD utilised it for their FireGL 4000 graphics card, released in 1997.
In the PC world, notable failed first tries for low-cost 3D graphics chips were the S3 ViRGE, ATI Rage, and Matrox Mystique. These chips were essentially previous-generation 2D accelerators with 3D features bolted on. Many were even pin-compatible with the earlier-generation chips for ease of implementation and minimal cost. Initially, performance 3D graphics were possible only with discrete boards dedicated to accelerating 3D functions (and lacking 2D GUI acceleration entirely) like the PowerVR and the 3dfx Voodoo. Notwithstanding as manufacturing technology continued to progress, video, 2D GUI acceleration and 3D functionality were all integrated into one chip. Rendition's Verite chipsets were amongst the first to do this well enough to be worthy of note. In 1997, Rendition went a step further by collaborating with Hercules and Fujitsu on a "Thriller Conspiracy" project which combined a Fujitsu FXG-1 Pinolite geometry processor with a Vérité V2200 core to create a graphics card with a full T&L engine years before Nvidia's GeForce 256. This card, designed to reduce the load placed upon the system's CPU, never made it to market.
OpenGL appeared in the early '90s as a professional graphics API, but originally suffered from performance issues which allowed the Glide API to step in and become a dominant force on the PC in the late '90s. Notwithstanding these issues were overcome and the Glide API fell by the wayside. Software implementations of OpenGL were common throughout this time, although the influence of OpenGL eventually led to widespread hardware support. Over time, a parity emerged between features offered in hardware and those offered in OpenGL. DirectX became popular amongst Windows game developers throughout the late 90s. Unlike OpenGL, Microsoft insisted on providing strict one-to-one support of hardware. The approach made DirectX less popular as a standalone graphics API initially, after a large number of GPUs provided their own specific features, which existing OpenGL applications were already able to benefit from, leaving DirectX most often one generation behind. (See: Comparison of OpenGL and Direct3D.)
Over time, Microsoft began to work more closely with hardware developers, and started to target the releases of DirectX to coincide with those of the supporting graphics hardware. Direct3D 5.0 was the first version of the burgeoning API to gain widespread adoption in the gaming market, and it competed directly with a large number of more-hardware-specific, most often proprietary graphics libraries, while OpenGL maintained a strong following. Direct3D 7.0 introduced support for hardware-accelerated transform and lighting (T&L) for Direct3D, while OpenGL had this capability already exposed from its inception. 3D accelerator cards moved beyond being just simple rasterizers to add another significant hardware stage to the 3D rendering pipeline. The Nvidia GeForce 256 (also known as NV10) was the first consumer-level card released on the market with hardware-accelerated T&L, while professional 3D cards already had this capability. Hardware transform and lighting, both already existing features of OpenGL, came to consumer-level hardware in the '90s and set the precedent for later pixel shader and vertex shader units which were far more flexible and programmable.
2000 to 2006
Nvidia was first to produce a chip capable of programmable shading, the GeForce 3 (code named NV20). Each pixel could now be processed by a short programme that could include additional image textures as inputs, and each geometric vertex could likewise be processed by a short programme before it was projected onto the screen. Used in the Xbox console, it competed with the PlayStation 2, which used a custom vector DSP for hardware accelerated vertex processing.
By October 2002, with the introduction of the ATI Radeon 9700 (also known as R300), the world's first Direct3D 9.0 accelerator, pixel and vertex shaders could implement looping and lengthy floating point math, and were fitting as flexible as CPUs, yet orders of magnitude faster for image-array operations. Pixel shading is most often used for bump mapping, which adds texture, to make an object look shiny, dull, rough, or even round or extruded.
2006 to present
With the introduction of the GeForce 8 series, which was produced by Nvidia, and then new generic stream processing unit GPUs became a more generalised computing device. Today, parallel GPUs have begun making computational inroads against the CPU, and a subfield of research, dubbed GPU Computing or GPGPU for General Purpose Computing on GPU, has found its way into fields as diverse as machine learning, oil exploration, scientific image processing, linear algebra, statistics, 3D reconstruction and even stock options pricing determination. Over the years, the energy consumption of GPUs has increased and to manage it, several techniques have been proposed.
Nvidia's CUDA platform was the earliest widely adopted programming model for GPU computing. More recently OpenCL has become broadly supported. OpenCL is an open standard defined by the Khronos Group which allows for the development of code for both GPUs and CPUs with an emphasis on portability. OpenCL solutions are supported by Intel, AMD, Nvidia, and ARM, and according to a recent report by Evan's Data, OpenCL is the GPGPU development platform most widely used by developers in both the US and Asia Pacific.
Many companies have produced GPUs under a number of brand names. In 2009, Intel, Nvidia and AMD/ATI were the market share leaders, with 49.4%, 27.8% and 20.6% market share respectively. Notwithstanding those numbers include Intel's integrated graphics solutions as GPUs. Not counting those numbers, Nvidia and ATI control nearly one hundred percent of the market as of 2008. In addition, S3 Graphics (owned by VIA Technologies) and Matrox produce GPUs.
Modern GPUs use most of their transistors to do calculations related to 3D computer graphics. They were initially used to accelerate the memory-intensive work of texture mapping and rendering polygons, later adding units to accelerate geometric calculations like the rotation and translation of vertices into different coordinate systems. Recent developments in GPUs include support for programmable shaders which can manipulate vertices and textures with a large number of of the same operations supported by CPUs, oversampling and interpolation techniques to reduce aliasing, and quite high-precision color spaces. Because most of these computations involve matrix and vector operations, engineers and scientists have increasingly studied the use of GPUs for non-graphical calculations; they're especially suited to additional embarrassingly parallel problems.
In addition to the 3D hardware, today's GPUs include basic 2D acceleration and framebuffer capabilities (usually with a VGA compatibility mode). Newer cards like AMD/ATI HD5000-HD7000 even lack 2D acceleration; it has to be emulated by 3D hardware.
GPU accelerated video decoding
Most GPUs made after 1995 support the YUV color space and hardware overlays, important for digital video playback, and a large number of GPUs made after 2000 additionally support MPEG primitives like motion compensation and iDCT. This process of hardware accelerated video decoding, where portions of the video decoding process and video post-processing are offloaded to the GPU hardware, is commonly referred to as "GPU accelerated video decoding", "GPU assisted video decoding", "GPU hardware accelerated video decoding" or "GPU hardware assisted video decoding".
More recent graphics cards even decode high-definition video on the card, offloading the central processing unit. The most common APIs for GPU accelerated video decoding are DxVA for Microsoft Windows operating system and VDPAU, VAAPI, XvMC, and XvBA for Linux-based and UNIX-like operating systems. All except XvMC are capable of decoding videos encoded with MPEG-1, MPEG-2, MPEG-4 ASP (MPEG-4 Part 2), MPEG-4 AVC (H.264 / DivX 6), VC-1, WMV3/WMV9, Xvid / OpenDivX (DivX 4), and DivX 5 codecs, while XvMC is only capable of decoding MPEG-1 and MPEG-2.
Video decoding processes that can be accelerated
The video decoding processes that can be accelerated by today's modern GPU hardware are:
- Motion compensation (mocomp)
- Inverse discrete cosine transform (iDCT)
- Inverse telecine 3:2 and 2:2 pull-down correction
- Inverse modified discrete cosine transform (iMDCT)
- In-loop deblocking filter
- Intra-frame prediction
- Inverse quantization (IQ)
- Variable-length decoding (VLD), more commonly known as slice-level acceleration
- Spatial-temporal deinterlacing and automatic interlace/progressive source detection
- Bitstream processing (Context-adaptive variable-length coding/Context-adaptive binary arithmetic coding) and perfect pixel positioning.
Dedicated graphics cards
The GPUs of the most powerful class ordinarily interface with the motherboard by means of an expansion slot like PCI Express (PCIe) or Accelerated Graphics Port (AGP) and can most of the time be replaced or upgraded with relative ease, assuming the motherboard is capable of supporting the upgrade. A few graphics cards still use Peripheral Component Interconnect (PCI) slots, but their bandwidth is so limited that they're ordinarily used only when a PCIe or AGP slot isn't available.
A dedicated GPU isn't necessarily removable, nor does it necessarily interface with the motherboard in a standard fashion. The term "dedicated" refers to the fact that dedicated graphics cards have RAM that's dedicated to the card's use, not to the fact that most dedicated GPUs are removable. Dedicated GPUs for portable computers are most commonly interfaced through a non-standard and most often proprietary slot due to size and weight constraints. Such ports might still be considered PCIe or AGP in terms of their logical host interface, even if they aren't physically interchangeable with their counterparts.
Integrated graphics solutions
Integrated graphics solutions, shared graphics solutions, or integrated graphics processors (IGP) utilise a portion of a computer's system RAM rather than dedicated graphics memory. IGPs can be integrated onto the motherboard as part of the chipset, or within the same die as CPU (like AMD APU or Intel HD Graphics). On certain motherboards  AMD's IGPs can use dedicated sideport memory. This is a separate fixed block of high performance memory that's dedicated for use by the GPU. In early 2007, computers with integrated graphics account for about ninety percent of all PC shipments. These solutions are less costly to implement than dedicated graphics solutions, but tend to be less capable. Historically, integrated solutions were most often considered unfit to play 3D games or run graphically intensive programmes but could run less intensive programmes like Adobe Flash. Examples of such IGPs would be offerings from SiS and VIA circa 2004. Notwithstanding modern integrated graphics processors like AMD Accelerated Processing Unit and Intel HD Graphics are more than capable of handling 2D graphics or low stress 3D graphics.
As a GPU is extremely memory intensive, an integrated solution might find itself competing for the already relatively slow system RAM with the CPU, as it has minimal or no dedicated video memory. IGPs can have up to 29.856 GB/s of memory bandwidth from system RAM, however graphics cards can enjoy up to 264 GB/s of bandwidth between its RAM and GPU core. This bandwidth is what's referred to as the memory bus and can be performance limiting. Older integrated graphics chipsets lacked hardware transform and lighting, but newer ones include it.
Hybrid graphics cards are somewhat more costly than integrated graphics, but much less costly than dedicated graphics cards. These share memory with the system and have a small dedicated memory cache, to make up for the high latency of the system RAM. Technologies within PCI Express can make this possible. While these solutions are at times advertised as having as much as 768MB of RAM, this refers to how much can be shared with the system memory.
Stream Processing and General Purpose GPUs (GPGPU)
It is fitting increasingly common to use a general purpose graphics processing unit (GPGPU) as a modified form of stream processor (or a vector processor), running compute kernels. This concept turns the massive computational power of a modern graphics accelerator's shader pipeline into general-purpose computing power, as opposed to being hard wired solely to do graphical operations. In certain applications requiring massive vector operations, this can yield several orders of magnitude higher performance than a conventional CPU. The two largest discrete (see "" above) GPU designers, ATI and Nvidia, are beginning to pursue this approach with an array of applications. Both Nvidia and ATI have teamed with Stanford University to create a GPU-based client for the Folding@home distributed computing project, for protein folding calculations. In certain circumstances the GPU calculates forty times faster than the conventional CPUs traditionally used by such applications.
GPGPU can be used for a large number of types of embarrassingly parallel tasks including ray tracing. They are ordinarily suited to high-throughput type computations that exhibit data-parallelism to exploit the wide vector width SIMD architecture of the GPU.
Furthermore, GPU-based high performance computers are starting to play a significant role in large-scale modelling. Three of the 10 most powerful supercomputers in the world take advantage of GPU acceleration.
NVIDIA cards support API extensions to the C programming language like CUDA and OpenCL. CUDA is specifically for NVIDIA GPUs whilst OpenCL is designed to work across a multitude of architectures including GPU, CPU and DSP (using vendor specific SDKs). These technologies allow specified functions (kernels) from a normal C programme to run on the GPU's stream processors. This makes C programmes capable of taking advantage of a GPU's ability to operate on large buffers in parallel, while still making use of the CPU when appropriate. CUDA is additionally the first API to allow CPU-based applications to directly access the resources of a GPU for more general purpose computing without the limitations of using a graphics API.
Since 2005 there has been interest in using the performance offered by GPUs for evolutionary computation in general, and for accelerating the fitness evaluation in genetic programming in particular. Most approaches compile linear or tree programs on the host PC and transfer the executable to the GPU to be run. Typically the performance advantage is only obtained by running the single active programme simultaneously on a large number of example problems in parallel, using the GPU's SIMD architecture. Notwithstanding substantial acceleration can additionally be obtained by not compiling the programs, and instead transferring them to the GPU, to be interpreted there. Acceleration can then be obtained by either interpreting multiple programmes simultaneously, simultaneously running multiple example problems, or combinations of both. A modern GPU (e.g. 8800 GTX or later) can readily simultaneously interpret hundreds of thousands of quite small programs.
External GPU (eGPU)
An external GPU is a graphics processor located outside of the housing of the computer. External graphics processors are at times used with laptop computers. Laptops might have a substantial amount of RAM and a sufficiently powerful central processing unit (CPU), but most often lack a powerful graphics processor (and instead have a less powerful but more energy-efficient on-board graphics chip). On-board graphics chips are most often not powerful enough for playing the latest games, or for additional tasks (video editing, ...).
Therefore, it is desirable to be able to attach a GPU to a few external bus of a notebook. PCI Express is the sole bus commonly used for this purpose. The port might be, for example, an ExpressCard or mPCIe port (PCIe ×1, up to 5 or 2.5 Gbit/s respectively) or a Thunderbolt 1, 2, or 3 port (PCIe ×4, up to 10, 20, or 40 Gbit/s respectively). Those ports are only available on certain notebook systems.
In 2013, 438.3 million GPUs were shipped globally and the forecast for 2014 was 414.2 million.
- Comparison of AMD graphics processing units
- Comparison of Nvidia graphics processing units
- Comparison of Intel graphics processing units
- Intel GMA
- Nvidia PureVideo - the bit-stream technology from Nvidia used in their graphics chips to accelerate video decoding on hardware GPU with DXVA.
- UVD (Unified Video Decoder) - is the video decoding bit-stream technology from ATI Technologies to support hardware (GPU) decode with DXVA.
- OpenGL API
- DirectX Video Acceleration (DxVA) API for Microsoft Windows operating-system.
- Mantle (API)
- Vulkan (API)
- Video Acceleration API (VA API)
- VDPAU (Video Decode and Presentation API for Unix)
- X-Video Bitstream Acceleration (XvBA), the X11 equivalent of DXVA for MPEG-2, H.264, and VC-1
- X-Video Motion Compensation, the X11 equivalent for MPEG-2 video codec only
- GPU cluster
- Mathematica includes built-in support for CUDA and OpenCL GPU execution
- MATLAB acceleration using the Parallel Computing Toolbox and MATLAB Distributed Computing Server, as well as third party packages like Jacket.
- Molecular modelling on GPU
- Deeplearning4j, open-source, distributed deep learning for Java. Machine vision and textual topic modelling toolkit.