Nvidia CUDA in Jupyter¶

What is CUDA?¶

It's Nvidia's compute driver. I don't like a lot of things about Nvidia or what they do, but there aren't a lot of alternatives. It works by compiling C or C++ code into PTX assembly, and then passing it to the GPU where it can run tons of threads. A lot of machine-learning papers are set around GPU related foolery and require close-to-the-metal access so it'd be nice to have nvcc working in Jupyter (where I do most of my notes and re-implementations of papers). I also will probably use it just as a way to compile C code in Jupyter because all the other plugins I've tried don't really work.

Setting It Up¶

First install this random plugin from the internet without verifying that it isn't malware.

In [2]:
!pip install nvcc4jupyter
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: nvcc4jupyter in /home/jarlold/.sage/local/lib/python3.10/site-packages (1.2.1)

Then blindly execute it with full user permissions in my home folder

In [2]:
%load_ext nvcc4jupyter
Source files will be saved in "/tmp/tmp63l0jrw0".

Host vs Device Code¶

CUDA is pretty much just good old fashion C, except for a few things:

  • We can call some functions using a special kind of multi-threading for GPUs.
  • Every function we write we have to designate as device, host or global. But the last one has some restrictions.
  • When we allocate, copy, or free memory, we also need to be wary of where it is (the CPU or the GPU).

In order to get data to and from the GPU, we pretty much always have to allocate it on both, and then copy it over. See there's 3 kinds of functions in CUDA, host, device, and global. host really means CPU and device really means GPU. In general, host functions can only talk to host and global function, and device functions can really only talk to global and device functions. That means to get data from the CPU to the GPU (or vice-versa) it has to go through a global function. But global functions must always be void. So we just pass them pointers to arrays for the output.

image.png

Cookbook¶

Hello World¶

In [4]:
%%cuda
#include <iostream>
int main() {
    std::cout << "Cuda is setup right!\n";
    return 0;
}
Cuda is setup right!

Sharing Memory¶

In [23]:
%%cuda
#include <stdio.h>

__global__ void operate(int* x, int size_x) {
    //# We'll just cube every number in the chunk
    for (int i =0; i < size_x; i++) {
        x[i] = x[i] * x[i] * x[i];
    }
}

int main(void) {
    //# Create an array of integer (2^20)
    long int N = 1<<20;
    int *device_x, *host_x;
    cudaMalloc(&device_x, N*sizeof(int));
    host_x = (int*)malloc(N*sizeof(int));

    //# Initialize the host x as all 2s
    for (int i = 0; i < N; i++) {
        host_x[i] = 2;
    }

    //# Then copy it over to the device list
    cudaMemcpy(device_x, host_x, N*sizeof(int), cudaMemcpyHostToDevice);

    //# Run our cool function on them
    operate<<<1, 1>>>(device_x, N);

    //# Then copy them over again
    cudaMemcpy(host_x, device_x, N*sizeof(int), cudaMemcpyDeviceToHost);

    //# And we'll just print the first one since the results will all be the same
    printf("Value is %d\n", host_x[0]);

    //# We can now deallocate it 
    cudaFree(device_x);
    free(host_x);
}
Value is 8

Assembly Code (PTX)¶

A full list of the assembly instructions is available here, but for the code below we'll only use mul.

The mul operand is a good one to demostrate operand modes, the general syntax goes something like this:

mul.mode.type d, a, b

Where mode is one of the following:

  • .hi which means "multiply a and b and output it to a register the same size as a and b. If it overflows keep the upper-half.
  • .lo which is like the above but keeps the lower half.
  • .wide which means "multiply a and b and output to a register that's twice the size as a or b. Keep the whole thing.

Then type is one of the following, which specify the type of the two input operands a and b (they should always be the same length):

  • u16, u32, u64
  • s16, s32, s64

So for the example below, we want to multiply two unsigned 32 bit integers, and if it overflows we'll keep the bottom half. That mean we want to use mul.lo.u32.

In [6]:
%%cuda
#include <stdio.h>

__device__ int do_assembly(unsigned int x) {
    //# We'll allocate a normal integer Y to be our output register
    int y;

    //# Then this code which uses mul.lo.u32 (twice since we're cubing things)
    asm(
        //# y = x * x
        " mul.lo.u32 %0, %1, %1;"
        //# y = y * x
        " mul.lo.u32 %0, %0, %1;"        
        //# This specifies that %0 is our output register, and that it should be bound to y.
        //# and that %1 is our input register, and it should be bound to x.
        //# (if we had another input it would be %2)
        : "=r"(y) : "r" (x));     
    return y;
}

__global__ void operate(int* x, int size_x) {
    //# We'll just cube every number in the chunk
    for (int i =0; i < size_x; i++) {
        //# Then in this example we'll just call our assembly code on it
        x[i] = do_assembly(x[i]);
    }
}

int main(void) {
    //# Create an array of integer (2^20)
    long int N = 1<<20;
    int *device_x, *host_x;
    cudaMalloc(&device_x, N*sizeof(int));
    host_x = (int*)malloc(N*sizeof(int));

    //# Initialize the host x as all 2s
    for (int i = 0; i < N; i++) {
        host_x[i] = 2;
    }

    //# Then copy it over to the device list
    cudaMemcpy(device_x, host_x, N*sizeof(int), cudaMemcpyHostToDevice);

    //# Run our cool function on them, timing it of course
    printf("Time is %lu\n", time(NULL));
    operate<<<1, 1>>>(device_x, N);
    printf("Time is %lu\n", time(NULL));

    //# Then copy them over again
    cudaMemcpy(host_x, device_x, N*sizeof(int), cudaMemcpyDeviceToHost);

    //# And we'll just print the first one since the results will all be the same
    printf("Value is %d\n", host_x[0]);

    //# We can now deallocate it 
    cudaFree(device_x);
    free(host_x);
}
Time is 1728776155
Time is 1728776155
Value is 8

Temporary Registers in PTX¶

If we want to allocate and use temporary registers, we're almost always better off adding some scope to them using { and } (just like in any C like language).

In [5]:
%%cuda
#include <stdio.h>

__device__ int do_assembly(unsigned int x) {
    int y;
    asm("{"
        ".reg .u32 t1;"
        " mul.lo.u32 t1, %1, %1;"
        " mov.u32 %0, t1;"
        "}"
        : "=r"(y) : "r" (x));     
    return y;
}

__global__ void operate(int* x, int size_x) {
    //# We'll just cube every number in the chunk
    for (int i =0; i < size_x; i++) {
        //# Then in this example we'll just call our assembly code on it
        x[i] = do_assembly(x[i]);
    }
}

int main(void) {
    //# Create an array of integer (2^20)
    long int N = 1<<20;
    int *device_x, *host_x;
    cudaMalloc(&device_x, N*sizeof(int));
    host_x = (int*)malloc(N*sizeof(int));

    //# Initialize the host x as all 2s
    for (int i = 0; i < N; i++) {
        host_x[i] = 2;
    }

    //# Then copy it over to the device list
    cudaMemcpy(device_x, host_x, N*sizeof(int), cudaMemcpyHostToDevice);

    //# Run our cool function on them
    operate<<<1, 1>>>(device_x, N);

    //# Then copy them over again
    cudaMemcpy(host_x, device_x, N*sizeof(int), cudaMemcpyDeviceToHost);

    //# And we'll just print the first one since the results will all be the same
    printf("Value is %d\n", host_x[0]);

    //# We can now deallocate it 
    cudaFree(device_x);
    free(host_x);
}
Time is 1728932039
Time is 1728932039
Value is 4

Matrix Multiplication¶

In [22]:
%%cuda
#include <stdio.h>
#include <cuda_runtime.h>

__global__ void inner_product(float *A, float *B, float *C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if(row < N && col < N) {
        float sum = 0;
        for(int k = 0; k < N; ++k) sum += A[row * N + k] * B[k * N + col];
        C[row * N + col] = sum;
    }
}

void print_matrix(float* m, int n) {
    for (int i = 0; i < n*n; i++) {
        if (i % n == 0) printf("\n");
        printf("%06.2f ", m[i]);

    }
}

int main() {
    int N = 8;
    float *A, *B, *C;
    float *d_A, *d_B, *d_C;

    //# Make 3 matrices of size N*N
    A = (float *)malloc(N * N * sizeof(float));
    B = (float *)malloc(N * N * sizeof(float));
    C = (float *)malloc(N * N * sizeof(float));

    //# Then initialize all their values to random floating points (except C of course)
    for (int i = 0; i < N*N; i++) {
        A[i] = random()/(float)random();
        B[i] = random()/(float)random();
    }

    //# Allocate space for them on the GPU
    cudaMalloc((void **)&d_A, N * N * sizeof(float));
    cudaMalloc((void **)&d_B, N * N * sizeof(float));
    cudaMalloc((void **)&d_C, N * N * sizeof(float));

    //# Then copy them over to the GPU
    cudaMemcpy(d_A, A, N * N * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, N * N * sizeof(float), cudaMemcpyHostToDevice);

    //# Then finally we can call the matrix multiplication
    dim3 threadsPerBlock(N, N);
    dim3 blocksPerGrid(N / threadsPerBlock.x, N / threadsPerBlock.y);
    inner_product<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    //# Then copy the answer back
    cudaMemcpy(C, d_C, N * N * sizeof(float), cudaMemcpyDeviceToHost);

    //# And print the result
    print_matrix(C, N);

    //# And free up all the memory we allocated
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    free(A);
    free(B);
    free(C);

    return 0;
}
038.71 020.86 622.17 025.69 038.73 083.74 048.12 012.90 
005.49 017.14 043.33 008.36 013.66 013.23 009.11 005.72 
012.90 028.07 136.42 014.53 015.16 021.71 037.36 007.02 
014.61 018.56 163.75 016.92 022.81 027.45 033.80 010.13 
008.39 010.43 057.83 008.36 019.13 014.72 025.88 005.55 
018.82 264.59 082.10 043.10 059.06 038.45 070.76 019.03 
021.69 044.17 088.10 019.55 027.25 024.69 090.19 014.09 
017.12 039.01 302.73 023.22 015.97 038.42 021.50 007.87 

Timing Functions¶

For one reason or another, the time(NULL); way of getting the system time (like you'd usually use in C) is not terribly useful in CUDA. It's accurate when you're just using host function on the CPU, but it's not the best at keeping track of operations runnings on GPU (device and global functions). According to Nvidia's documentation the right way to time this is using CUDA Events. For our example (and for most uses) we'll only need the following utilities:

  • cudaEvent_t event; the type CUDA uses to store recorded events.
  • cudaEventCreate(event_pointer); we have to initialize the events using this.
  • cudaEventRecord(event); records when an event happens.
  • cudaEventSynchronize(event); stops the CPU from executing any code until the GPU records event.
  • cudaEventElapsedTime(output_float, event1, event2); computes the ms between event1 and event2, store it in output_float.
In [35]:
%%cuda
#include <stdio.h>
__global__ void operate(int* x, int size_x) {
    //# We'll just cube every number in the chunk
    for (int i =0; i < size_x; i++) {
        x[i] = x[i] * x[i] * x[i];
    }
}

int main(void) {
    //# Create an the length of THE MAX VALUE OF AN UNSIGNED LONG INT
    unsigned long int N = 4294967295;
    int *device_x, *host_x;
    cudaMalloc(&device_x, N*sizeof(int));
    host_x = (int*)malloc(N*sizeof(int));

    //# Create some CUDA events so we can keep track of time
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    //# Initialize the host x as all 2s
    for (int i = 0; i < N; i++) {
        host_x[i] = 2;
    }

    //# Then copy it over to the device list
    cudaMemcpy(device_x, host_x, N*sizeof(int), cudaMemcpyHostToDevice);

    //# Record the start time
    cudaEventRecord(start);

    //# Run our cool function
    operate<<<1, 1>>>(device_x, N);

    //# Record the stop time
    cudaEventRecord(stop);

    //# This function stops the CPU execution until the "stop" event is recorded
    cudaEventSynchronize(stop);

    //# Then we'll compute the elapsed time using another CUDA function
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);

    printf("Operation completed in %fms\n", milliseconds);
    
    //# Then copy them over again
    cudaMemcpy(host_x, device_x, N*sizeof(int), cudaMemcpyDeviceToHost);

    //# And we'll just print the first one since the results will all be the same
    printf("Value is %d\n", host_x[0]);

    //# We can now deallocate it 
    cudaFree(device_x);
    free(host_x);
}
Operation completed in 0.006144ms
Value is 2

In [ ]: