3, జులై 2026, శుక్రవారం

Real Life CUDA Programming - Part 1 — A gentle introduction to the GPU.................

 

  • Introduction (why care about the hardware)
  • Yours new best friends (common tools)
  • GPU deep dive (useful terms)
Press enter or click to view image in full size
Source: http://furryball.aaa-studio.eu/images/multiplecores.jpg

Introduction

If you’re also interested in the motivation behind writing all this, check out part 0.

Unlike most of the programming problems, coding for the GPU requires some understanding of the basic terms and concepts of the hardware itself. This is necessary for two main reasons:

  1. If you’re coding I’m CUDA, you’re looking for speed. While the GPU, generally speaking, will give you a lot of speed, you must know how to utilize the device properly.
  2. Many of the effects and restrictions you see when programming in CUDA are caused by the hardware underneath, and understanding that hardware will also improve your ability to program successfully.

Other than those, if you want to program in CUDA you need to, like in all crafts, know the tools available to you.

Your new best friends

Let me start from the end and introduce you to two of the most powerful tools you will use for GPU programming.

NVCC

NVCC (NVidia Cross Compiler) is the compiler for any CUDA program. The compiler behaves like any other C\C++ compiler, including the compilation and linking process for any regular, non-CUDA code that is included in your project. NVCC is special in that it can compile all CUDA functions and symbols, which a regular compiler can’t. Other than that, you simply swap your current compiler for NVCC and everything should work out as before.

This also means that you can used NVCC in any project building system you used before, for example, a makefile.

NVProf

NVProf (NVidia Profiler) is the second most common tool you’ll find. While profiling (measuring the time each part of your program takes) is commonplace and used extensively, it is especially important for GPU programming. The simple reason being that, as I said before (and as I’ll probably say again), we have a need for speed. This drives us to always improve on our time, and the profiler is a tremendous help in that regard. The downside of this tool is that it only profiles code that runs on the GPU (“device code”), and provides no information about any of the CPU code (“host code”).

GPU Deep Dive

So, now we must face the huge wall that is the GPU architecture.

Press enter or click to view image in full size
Source: https://pascalprecht.github.io/posts/why-i-use-vim/

(Full disclosure, that learning curve is the Vim learning curve, but I think it applies just the same).

Join Medium for free to get updates from this writer.

Although difficult to understand at first, once you get a hang of the architecture and capabilities of the GPU, the rest of the way is extremely easy. There are three main ideas that need to be understood:

  1. GPU Cores
  2. GPU Threads & thread blocks
  3. GPU Memory

Once you are confident in those three ideas, you’ll be ready to program!

Let’s tackle them one at a time.

GPU cores

First, the architecture itself.

A GPU, unlike a CPU, is designed for massive parallel computing. As a result, the GPU’s design is of hundreds of cores which can handle thousands of threads, all running in parallel.

It is important to remember that unlike on the CPU, where the number of parallel operations is limited by the number of cores in the CPU itself (which is usually no more that 8), the GPU has no such limitation. You are able, and even encouraged, to schedule thousands of parallel threads, so as to keep the GPU cores busy and maximise their usage.

This view is important, because as we will see, we will schedule many more processes than there are cores.

GPU Threads

As I said, the GPU is capable of running thousands of threads in parallel, but order to launch that many threads, we must group them into smaller units called thread blocks. Each thread block is of a fixed size, which must be a multiple of the warp size (usually 32). We can then launch as many blocks as we want in what is called a grid. These two terms will appear every time you want to run something on the GPU, as you must specify both the block size (which is usually static and hardcoded) and the grid size (which is usually determined at runtime so that there is one thread for each element of the computation.

For example, when adding together two arrays (an example we’ll analyze in depth in the next post), we run a seperate thread for each element in the array.

GPU Memory

Source: https://devblogs.nvidia.com/unified-memory-in-cuda-6/

The last element that must be understood before we can start the actual coding is the memory that can be accessed by the GPU.

Generally speaking, there are three types of memory that are in play when running GPU code.

  1. Regular CPU memory (“host memory”). This is the memory that is created when allocating memory in the standard code (i.e. when using the new keyword or the malloc function). This code is only accessible by the CPU and not the GPU.
  2. GPU-only memory (“device memory”). This memory exists on the GPU and accessible by the GPU and not the CPU.
  3. Unified Memory. This memory is accessible by both the CPU and the GPU. In practice, the unified memory is automatically managed by CUDA and transferred as needed between the CPU and the GPU. Using unified memory simplifies the programming process and is discussed in a following post.

The immediate implication of having two seperate memories is that there is always a bottleneck in the program that is caused by the necessary memory copying that has to be done by the CPU (because, as we said, the GPU has no access to the memory on the host.

Summary

So what have we learned today?
We looked at the common tools that are used with every CUDA project (NVCC, NVProf).

We’ve also looked at the basic terms surrounding GPU and CUDA programming and saw some of their implications on how we should write code for the GPU.

Next time — our first CUDA program!

large caps.........

 

my latest paper.......................My latest research paper from klu@klef......................

 


Mr Dr.............CUDA (Compute Unified Device Architecture) is a proprietary parallel computing platform and programming model created by Nvidia. It allows software developers to use a compatible graphics processing unit (GPU) for general-purpose processing, vastly accelerating compute-intensive tasks like artificial intelligence, scientific simulations, and video rendering............................

 #include <stdio.h>

#include <stdlib.h>

#include <cuda_runtime.h>


// Define matrix dimensions (N x N)

#define N 1024

#define BLOCK_SIZE 16


// CUDA Kernel for Element-wise Matrix Addition

__global__ void matrixAddKernel(const float* A, const float* B, float* C, int n) {

    // Calculate global row and column index for the thread

    int col = blockIdx.x * blockDim.x + threadIdx.x;

    int row = blockIdx.y * blockDim.y + threadIdx.y;


    // Boundary check to prevent accessing out-of-bounds memory

    if (row < n && col < n) {

        // Map 2D coordinate to a flattened 1D index

        int index = row * n + col;

        C[index] = A[index] + B[index];

    }

}


int main() {

    int numElements = N * N;

    size_t size = numElements * sizeof(float);


    // 1. Allocate memory on the Host (CPU)

    float *h_A = (float*)malloc(size);

    float *h_B = (float*)malloc(size);

    float *h_C = (float*)malloc(size);


    // 2. Initialize host matrices with arbitrary data

    for (int i = 0; i < numElements; i++) {

        h_A[i] = 1.0f; // Fill A with 1.0

        h_B[i] = 2.0f; // Fill B with 2.0

    }


    // 3. Allocate memory on the Device (GPU)

    float *d_A = NULL;

    float *d_B = NULL;

    float *d_C = NULL;

    cudaMalloc((void**)&d_A, size);

    cudaMalloc((void**)&d_B, size);

    cudaMalloc((void**)&d_C, size);


    // 4. Copy data from Host to Device memory

    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);


    // 5. Configure Thread Blocks and Grid Dimensions

    // dim3 elements define 2D shapes for blocks and grids

    dim3 threadsPerBlock(BLOCK_SIZE, BLOCK_SIZE); 

    dim3 numBlocks((N + BLOCK_SIZE - 1) / BLOCK_SIZE, (N + BLOCK_SIZE - 1) / BLOCK_SIZE);


    // 6. Launch the CUDA Kernel on the GPU

    matrixAddKernel<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);


    // Wait for the GPU to finish before accessing results on CPU

    cudaDeviceSynchronize();


    // 7. Copy the final result from Device back to Host memory

    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);


    // 8. Verify the result (Sample check of a few elements)

    int success = 1;

    for (int i = 0; i < numElements; i++) {

        if (h_C[i] != 3.0f) {

            success = 0;

            break;

        }

    }


    if (success) {

        printf("Success! Matrix addition completed correctly on the GPU.\n");

        printf("Sample Element C[0]: %f (Expected: 3.000000)\n", h_C[0]);

    } else {

        printf("Error! Matrix addition validation failed.\n");

    }


    // 9. Free Device and Host memory

    cudaFree(d_A);

    cudaFree(d_B);

    cudaFree(d_C);

    free(h_A);

    free(h_B);

    free(h_C);


    return 0;

}


console ...........

Running in FUNCTIONAL mode...

Compiling...

Executing...

Success! Matrix addition completed correctly on the GPU.

Sample Element C[0]: 3.000000 (Expected: 3.000000)

Exit status: 0.......................................