GPU Speedup


In our last blog, Concurrency and Parallelism, we demonstrated the speedup for large operations through an example of initializing a two dimension array with concurrency and limited use of parallelism trough the multiple cores of a CPU.

In this blog we will move the entire operation to the GPU and utilize its tremendous power of parallelism.


A form of parallelism known as single instruction multiple data (SIMD) refers to the ability of most modern microprocessors to perform a mathematical operation on multiple data items in parallel, using a single machine instruction. The combination of SIMD and multithreading forms parallelism known as single instruction multiple thread (SIMT), the basis of all modern GPUs.

GPUs are designed specifically to perform data-parallel computations on very large datasets. For computational tasks to be well suited for execution on a GPU, the computations performed on any one element of the dataset must be independent of the results of computation on other elements. The metal kernel example below shows the result matrix r at particular index is independent from other indices in the same matrix

kernel void (device float* r [[buffer (0)]],
		   	constant float* a [[buffer (1)]],
		   	constant float* b [[buffer (2)]],
		   	uint pid [[thread_position_in_grid]]){
			r[pid] = a[pid] * b[pid];


One might ask why there is a fundamental difference between the GPU and CPU in terms of parallelism performance. The answer lies in the design philosophy between the two types of processors.

GPU must be capable of moving extremely large amount of data in and out of its main DRAM because of graphics frame buffer requirements, this is called throughput. While a CPU has to satisfy requirements from a legacy OS, applications, and I/O operations make memory bandwidth more difficult to increase, therefore they are designed to minimize the execution latency of a single thread

An important observation is that reducing latency is much more expensive than increasing throughput in terms of power and chip area. Therefore, the prevailing solution is to optimize for the execution throughput of massive numbers of threads.

One can conclude that an ideal setup for an application is to be designed with low latency and high throughput. Where low latency allow the CPU to execute commands sequentially without the system being slow or unresponsive; and high throughput lets the GPU process data in parallel.


Finally the fun part, coding. Many APIs exists for parallel programming such as CUDA by NVIDIA, openCL by Kronos group, DirectX by Microsoft, and most importantly Metal by Apple. As usual on this site we will focus on Apple Metal API.

We will start by learning essential tasks of Metal programming for the purpose of GPGPU, remember graphics programming is out of the scope of this post.

Metal Initialization

Communicating with the GPU, requires the following line:

var device = MTLCreateSystemDefaultDevice()!

device responsibility will be creating directly or indirectly objects that are usable only with that device object. Apps that use multiple GPUs will use multiple device objects and create a similar hierarchy of Metal objects for each.

  1. CommandQueue: this object is responsible for creating and organizing Metal Buffers to send and schedule tasks to the GPU.
  2. Library: similar to the commandQueue it is created by the device to hold a pointer to the metal functions, vertex, fragment, and kernel.
  3. Pipeline: A pipeline specifies the steps that the GPU performs to complete a specific task; By converting the function into executable code on the GPU. Because we are focusing on GPGPU, we use a kernel function.
  4. Buffer: holds the data and commands for the GPU.
  5. Encoder: the commandEncoder is used to write commands into the buffer object.

Summing it up, you will start with an MTLDevice for the device object that, using this object to create one MTLCommandQueue object, and one MTLLibrary object in your app. You will have at least one MTLComputePipeline object and at least one MTLBuffer object.


Massive parallelism calls for massive threads! In Metal threads are organized in 1D, 2D, and 3D grids. Metal subdivides the grids into Threadgroups up to 3 dimensions. Threads in a thread group share the same memory space.

Thread counts are part of the pipeline setting.


The code on the CPU side:

import Foundation
import MetalKit

let row : uint = 30000
var column : uint = 4000
var array  = Array(repeating: Array<Float>(repeating: 0, count: Int(column)), count: Int(row))

let start = // <<<<<<<<<< Start time

var device = MTLCreateSystemDefaultDevice()!
var commandQueue = device.makeCommandQueue()!
var library = device.makeDefaultLibrary()
let commandBuffer = commandQueue.makeCommandBuffer()
let computeEncoder = commandBuffer?.makeComputeCommandEncoder()
var computeFunction = library?.makeFunction(name: "kernel_main")!
var computePipelineState = try! device.makeComputePipelineState(function: computeFunction!)
var matrixBuffer = device.makeBuffer(bytes: &array, length: Int(row*column) * MemoryLayout<Float>.stride, options: [])
computeEncoder?.setBuffer(matrixBuffer, offset: 0, index: 0)
computeEncoder?.setBytes(&column, length: MemoryLayout<uint>.stride, index: 1)

let threadsPerThreadGrid = MTLSizeMake(Int(row * column), 1, 1)
computeEncoder?.dispatchThreadgroups(threadsPerThreadGrid, threadsPerThreadgroup: MTLSizeMake(1, 1, 1))

let end =   // <<<<<<<<<<   end time

let nanoTime = end.uptimeNanoseconds - start.uptimeNanoseconds // <<<<< Difference in nano seconds (UInt64)
let timeInterval = Double(nanoTime) / 1_000_000_000 // Technically could overflow for long running tests

print("Time to execute: \(timeInterval) seconds")

let contents = matrixBuffer?.contents()
let pointer = contents?.bindMemory(to: Float.self, capacity: Int(row*column)

Going through the code

  1. Creating a device object which directly creates a commandQueue and library object. These objects will be the same throughout the lifetime of the application.
  2. Creating a commandBuffer to hold the data and commands the CPU will setup for the commandQueue object.
  3. Creating the object that holds the pointer to the file where the function is defined. Next the pipeline object is created and will only execute the function on the current buffer object.
  4. The matrixBuffer object is a GPU memory layout that will be initialized with array variable.
  5. The encoder object will set the pipeline, all objects that needs to be sent to the GPU memory, and the number of threads that the GPU needs to be created and executed.
  6. Finally we will end the encoding and commit the buffer to the GPU for execution. For the purpose of timing we will wait for the GPU.

The following illustration demonstrate the code above:

The code on the GPU

#include <metal_stdlib>
using namespace metal;

kernel void kernel_main(device float* factors [[buffer(0)]],
                        constant uint& column [[buffer(1)]],
                        uint pid [[thread_position_in_grid]]){
    factors[pid] = (pid / column) * (pid % column);

Going through the parameters of our function:

  1. device memory location allows read and write operations, the factors array will store the result of the multiplication. The attribute buffer is at location 0 where we set it on the CPU side.
  2. The column variable is in the constant memory since it will be a read operation only. The attribute buffer is is at location 1 where it was set on the CPU side.
  3. The last parameter is threads used, here we set every thread in its on workgroup because there was no cooperation between any thread. Each thread computed independently the value for its location in the array.

The body of the function is a single line, our two dimensional array on the CPU side has been passed has a one dimensional array. That is why we need the number of columns to split into the next row.

Metal is low level, very low, all memory’s are C type behavior. Note that all kernel functions are void functions in Metal.


From our previous blog, using the CPU with GCD and setting 30000 rows with 4000 columns we get the following out:

Time to execute: 31.385106194 seconds
Program ended with exit code: 0

Compared to utilizing the GPU, the output of the above code is:

Time to execute: 0.488734338 seconds
Program ended with exit code: 0

A massive speedup, 64.18 times faster! That’s more than 6400% with far less stress on the CPU!


The performance, speed, and other factors gained by the GPU makes it a must to utilize in modern day applications. In our next blog we will demonstrate more features and how to utilize the GPU further.

Hopefully you found this post useful, please consider following and press the like button. Recommending to others is greatly appreciated. Please follow me on Medium.

Thank you and til next time, Happy Coding!


Related Posts

%d bloggers like this: