Managing complexity and modularity becomes important as your project scope increases. Therefore, separate compilation and linking Cuda with C is a must have. Learn how you could compile your Cuda code separately and link with your C object code.
Example Files
As an example, we will look at a stencil computation (nearest neighbor computation). Let’s start with the main program. In this program, first we initialize two arrays (Array A and B). Array A will hold the results and Array B will hold the values required for the computation. The extern keyword in
extern void perform_stencil(float * a, float * b, const int n)
specifies that the object code for the function perform_stencil will be linked separately. In our case, this would be the object code generated from the nvcc compiler (using the Cuda code).
#include <stdio.h>
#include <stdlib.h>
#define N 256
#define MIN 0
#define MAX 1000
void initialize_matrices(float * a, float * b);
extern void perform_stencil(float * a, float * b, const int n);
int main() {
float * a = (float *)malloc(N * N * N * sizeof(float));
float * b = (float *)malloc(N * N * N * sizeof(float));
initialize_matrices(a, b);
perform_stencil(a, b, N);
return 0;
}
void initialize_matrices(float * a, float * b) {
for (int i = 0; i < N * N * N; i ++) {
a[i] = 0.0;
b[i] = MIN + (MAX - MIN) * (rand() / (float)RAND_MAX);
}
}
Next, let’s look at the Cuda code where the actual computation happens. As you can see extern “C” keyword specifies the entry point for the linker. Since we are linking Cuda with C, we must specify with “C”. For C++, you do not need to add the specifier.
#include <stdio.h>
__global__ void kernel(float * a, float * b, const int N) {
size_t i = blockIdx.x * blockDim.x + threadIdx.x;
size_t j = blockIdx.y * blockDim.y + threadIdx.y;
size_t k = blockIdx.z * blockDim.z + threadIdx.z;
if (1 < i && 1 < j && 1 < k && i < N - 1 && j < N - 1 && k < N - 1) {
a[i * N + j * N + k] = 0.8 * (b[(i - 1) * N + j * N + k] + b[(i + 1) * N + N * j + k] + b[i * N + (j - 1) * N + k] +
b[i * N + (j + 1) * N + k] + b[i * N + j * N + k-1] + b[i * N + j * N + k + 1]);
}
}
extern "C" void perform_stencil(float * a, float * b, const int N) {
float * d_a;
float * d_b;
cudaEvent_t start, stop;
float elapsedTime;
/* begin timing */
cudaEventCreate(&start);
cudaEventRecord(start, 0);
cudaMalloc(&d_a, sizeof(float) * N * N * N);
cudaMalloc(&d_b, sizeof(float) * N * N * N);
cudaMemcpy(d_a, a, sizeof(float) * N * N * N, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, sizeof(float) * N * N * N, cudaMemcpyHostToDevice);
dim3 threadsPerBlock(8, 8, 8);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y, N / threadsPerBlock.z);
kernel <<<numBlocks, threadsPerBlock>>>(d_a, d_b, N);
cudaMemcpy(a,d_a, sizeof(float) * N * N * N, cudaMemcpyDeviceToHost);
/* end timing */
cudaEventCreate(&stop);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
printf("Execution time: %f seconds\n", elapsedTime / 1000);
cudaFree(d_a);
cudaFree(d_b);
}
Compilation and Linking Cuda
In the next section, we will look at the steps required for compilation and linking Cuda with C. First, you need to compile Cuda code. However, you need to specify that you will perform linking separately. Therefore, you need to specify that you are using device code linking.
You can do device code linking in two ways.
–dc
flag tellsnvcc
to generate device code for later linking. You will also have to explicitly specify–arch=sm_20
before the–dc
option, because not all SM code variants support device linking andnvcc
must know that it’s a compatible SM architecture. But using-dc
flag will force you to do the final linking with nvcc (If you are using gcc in the final step this won’t work).-dlink
flag specifies that nvcc link the Cuda device code so that the CPU compiler would know how to link the code.
We will use the -dlink
flag as we are using gcc in the final step.
Step | Command |
Compile Cuda object file | nvcc -O3 kernal.cu -o target.o -gencode arch=compute_35,code=sm_35 |
Device Link | nvcc -O3 target.o -o dlink.o -gencode arch=compute_35,code=sm_35 -dlink |
Compile C code | gcc -std=c99 -c main.c -o main.o |
Final Link | gcc -std=c99 dlink.o main.o target.o -o stencil -lcudadevrt -lcudart |
Here’s an example Makefile containing all of the above steps.
SM := 35
CC := gcc
NVCC := nvcc
CFLAGS = -std=c99
NVCCFLAGS = -O3
GENCODE_FLAGS = -gencode arch=compute_$(SM),code=sm_$(SM)
LIB_FLAGS = -lcudadevrt -lcudart
BUILDDIR = build
TARGET = stencil
all: $(TARGET)
$(TARGET): $(BUILDDIR)/dlink.o $(BUILDDIR)/main.o $(BUILDDIR)/$(TARGET).o
$(CC) $(CFLAGS) $^ -o $@ $(LIB_FLAGS)
$(BUILDDIR)/dlink.o: $(BUILDDIR)/$(TARGET).o
$(NVCC) $(NVCCFLAGS) $^ -o $@ $(GENCODE_FLAGS) -dlink
$(BUILDDIR)/main.o: main.c
$(CC) $(CFLAGS) -c $< -o $@
$(BUILDDIR)/$(TARGET).o: kernal.cu
$(NVCC) $(NVCCFLAGS) $< -o $@ $(GENCODE_FLAGS)
clean:
rm -f $(BUILDDIR)/*.o $(TARGET)
That’s all you need to know! Hope this blog post enabled you to organize your project better!