Compilation and Linking Cuda with C

Managing complexity and modularity becomes important as your project scope increases. Therefore, separate compilation and linking Cuda with C is a must have. Learn how you could compile your Cuda code separately and link with your C object code.

Example Files

As an example, we will look at a stencil computation (nearest neighbor computation). Let’s start with the main program. In this program, first we initialize two arrays (Array A and B). Array A will hold the results and Array B will hold the values required for the computation. The extern keyword in

extern void perform_stencil(float * a, float * b, const int n)

specifies that the object code for the function perform_stencil will be linked separately. In our case, this would be the object code generated from the nvcc compiler (using the Cuda code).

#include <stdio.h>
#include <stdlib.h>

#define N 256
#define MIN 0 
#define MAX 1000

void initialize_matrices(float * a, float * b);
extern void perform_stencil(float * a, float * b, const int n);

int main() {
    float * a = (float *)malloc(N * N * N * sizeof(float));
    float * b = (float *)malloc(N * N * N * sizeof(float));

    initialize_matrices(a, b);
    perform_stencil(a, b, N);
    return 0;
}

void initialize_matrices(float * a, float * b) {
    for (int i = 0; i < N * N * N; i ++) {
        a[i] = 0.0;
        b[i] = MIN + (MAX - MIN) * (rand() / (float)RAND_MAX);
    }
}

Next, let’s look at the Cuda code where the actual computation happens. As you can see extern “C” keyword specifies the entry point for the linker. Since we are linking Cuda with C, we must specify with “C”. For C++, you do not need to add the specifier.

#include <stdio.h>

__global__ void kernel(float * a, float * b, const int N) {
    size_t i = blockIdx.x * blockDim.x + threadIdx.x;
    size_t j = blockIdx.y * blockDim.y + threadIdx.y;
    size_t k = blockIdx.z * blockDim.z + threadIdx.z;

    if (1 < i && 1 < j && 1 < k && i < N - 1 && j < N - 1 && k < N - 1) {
        a[i * N + j * N + k] = 0.8 * (b[(i - 1) * N + j * N + k] + b[(i + 1) * N + N * j + k] + b[i * N + (j - 1) * N + k] +
                               b[i * N + (j + 1) * N + k] + b[i * N + j * N + k-1] + b[i * N + j * N  + k + 1]);
    }
}


extern "C" void perform_stencil(float * a, float * b, const int N) {
    float * d_a;
    float * d_b;

    cudaEvent_t start, stop;
    float       elapsedTime;

    /* begin timing */
    cudaEventCreate(&start);
    cudaEventRecord(start, 0);

    cudaMalloc(&d_a, sizeof(float) * N * N * N);
    cudaMalloc(&d_b, sizeof(float) * N * N * N);

    cudaMemcpy(d_a, a, sizeof(float) * N * N * N, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, sizeof(float) * N * N * N, cudaMemcpyHostToDevice);

    dim3 threadsPerBlock(8, 8, 8);
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y, N / threadsPerBlock.z);

    kernel <<<numBlocks, threadsPerBlock>>>(d_a, d_b, N);
    cudaMemcpy(a,d_a, sizeof(float) * N * N * N, cudaMemcpyDeviceToHost);
    /* end timing */
    cudaEventCreate(&stop);
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);

    cudaEventElapsedTime(&elapsedTime, start, stop);
    printf("Execution time: %f seconds\n", elapsedTime / 1000);
    cudaFree(d_a);
    cudaFree(d_b);
}

Compilation and Linking Cuda

In the next section, we will look at the steps required for compilation and linking Cuda with C. First, you need to compile Cuda code. However, you need to specify that you will perform linking separately. Therefore, you need to specify that you are using device code linking.

You can do device code linking in two ways.

  • –dc flag tells nvcc to generate device code for later linking. You will also have to explicitly specify –arch=sm_20 before the –dc option, because not all SM code variants support device linking and nvcc must know that it’s a compatible SM architecture. But using -dc flag will force you to do the final linking with nvcc (If you are using gcc in the final step this won’t work).
  • -dlink flag specifies that nvcc link the Cuda device code so that the CPU compiler would know how to link the code.

We will use the -dlink flag as we are using gcc in the final step.

StepCommand
Compile Cuda object filenvcc -O3 kernal.cu -o target.o -gencode arch=compute_35,code=sm_35
Device Linknvcc -O3 target.o -o dlink.o -gencode arch=compute_35,code=sm_35 -dlink
Compile C codegcc -std=c99 -c main.c -o main.o
Final Linkgcc -std=c99 dlink.o main.o target.o -o stencil -lcudadevrt -lcudart

Here’s an example Makefile containing all of the above steps.

SM := 35

CC := gcc
NVCC := nvcc

CFLAGS = -std=c99
NVCCFLAGS = -O3

GENCODE_FLAGS = -gencode arch=compute_$(SM),code=sm_$(SM)
LIB_FLAGS = -lcudadevrt -lcudart

BUILDDIR = build

TARGET = stencil
all: $(TARGET)

$(TARGET): $(BUILDDIR)/dlink.o $(BUILDDIR)/main.o $(BUILDDIR)/$(TARGET).o
	$(CC) $(CFLAGS) $^ -o $@ $(LIB_FLAGS)

$(BUILDDIR)/dlink.o: $(BUILDDIR)/$(TARGET).o 
	$(NVCC) $(NVCCFLAGS) $^ -o $@ $(GENCODE_FLAGS) -dlink

$(BUILDDIR)/main.o: main.c
	$(CC) $(CFLAGS) -c $< -o $@

$(BUILDDIR)/$(TARGET).o: kernal.cu
	$(NVCC) $(NVCCFLAGS) $< -o $@ $(GENCODE_FLAGS) 

clean:
	rm -f $(BUILDDIR)/*.o $(TARGET) 

That’s all you need to know! Hope this blog post enabled you to organize your project better!

Leave a Reply

Your email address will not be published. Required fields are marked *