Освоение параллельного программирования на GPU с CUDA: (HW & SW) [Udemy] [Hamdy egy]

Bot · 10 Апр 2026

Performance Optimization and Analysis for High-Performance Computing
This hands-on course teaches you how to unlock the huge parallel-processing power of modern GPUs with CUDA. You’ll start with the fundamentals of GPU hardware, trace the evolution of flagship architectures (Fermi → Pascal → Volta → Ampere → Hopper), and learn—through code-along labs—how to write, profile, and optimize high-performance kernels.
This is an independent training resource. It is not sponsored by, endorsed by, or otherwise affiliated with NVIDIA Corporation. “CUDA”, “Nsight”, and the architecture codenames are trademarks of NVIDIA and are used here only as factual references.

What you'll learn

Comprehensive Understanding of GPU vs CPU Architecture
learn the history of graphical processing unit (GPU) until the most recent products
Understand the internal structure of GPU
Understand the different types of memories and how they affect the performance
Understand the most recent technologies in the GPU internal components
Understand the basics of the CUDA programming on GPU
Start programming GPU using both CUDA on Both windows and linux
understand the most efficient ways for parallelization
Profiling and Performance Tuning
Leveraging Shared Memory

What you’ll master

GPU vs. CPU fundamentals – why GPUs dominate data-parallel workloads.
Generational design advances – the hardware features that matter most for performance.
CUDA toolkit installation – Windows, Linux, and WSL, plus first-run sanity checks.
Core CUDA concepts – threads, blocks, grids, and the memory hierarchy, built up with labs such as vector addition.
Profiling & tuning with Nsight Compute / nvprof – measure occupancy, hide latency, and break bottlenecks.
2-D indexing for matrices – write efficient kernels for real-world linear-algebra tasks.
Optimization playbook – handle non-power-of-two data, leverage shared memory, maximize bandwidth, and minimize warp divergence.
Robust debugging & error handling – use runtime-API checks to ship production-ready code.

By the end, you’ll be able to design, analyze, and fine-tune CUDA kernels that run efficiently on today’s GPUs—equipping you to tackle demanding scientific, engineering, and AI workloads.
Who this course is for:

For any one interested in GPU and CUDA like engineering students, researchers and any other one

Requirements

C and C++ basics
Linux and windows basics
Computer Architecture basics

Course content
12 sections • 58 lectures • 23h 3m total length

Introduction to the Nvidia GPUs hardware
12 lectures • 2hr 52min

GPU vs CPU (very important)
NVidia's history (How Nvidia started dominating the GPU sector)
Architectures and Generations relationship [Hopper, Ampere, GeForce and Tesla]
How to know the Architecture and Generation
The difference between the GPU and the GPU Chip
The architectures and the corresponding chips
Nvidia GPU architectures From Fermi to hopper
Parameters required to compare between different Architectures
Please don't skip this video. It is pivotal for the the whole course.
Half, single and double precision operations
Compute capability and utilizations of the GPUs
Before reading any whitepapers !! look at this
Volta+Ampere+Pascal+SIMD (Don't skip)

Installing Cuda and other programs
4 lectures • 22min

What features installed with the CUDA toolkit?
Installing CUDA on Windows
Installing WSL to use Linux on windows OS.
Installing Cuda toolkits on Linux

Introduction to CUDA programming
8 lectures • 1hr 52min

The course github repo
Mapping SW from CUDA to HW + introducing CUDA.
001 Hello World program (threads - Blocks)
Compiling Cuda on Linux
002 Hello World program ( Warp_IDs)
003 : Vector addition + the Steps for any CUDA project
004 : Vector addition + blocks and thread indexing + GPU performance
005 levels of parallelization - Vector addition with Extra-large vectors

Profiling
9 lectures • 4hr 18min

Query the device properties using the Runtime APIs
Nvidia-smi and its configurations (Linux User)
The GPU's Occupancy and Latency hiding
Allocated active blocks per SM (important)
how many blocks can we run concurrently per SM?
Starting with the nsight compute (first issue)
All profiling tools from NVidia (Nsight systems - compute - nvprof ...)
Error checking APIs
Nsight Compute performance using command line analysis
Graphical Nsight Compute (windows and linux)

Performance analysis for the previous applications
2 lectures • 45min

Performance analysis
Vector addition with a size not power of 2 !!! important

2D Indexing
2 lectures • 1hr 16min

Matrices addition using 2D of blocks and threads
Why L1 Hit-rate is zero?

Shared Memory + Warp Divergence
2 lectures • 50min

The shared memory
Quiz 1
Warp Divergence

Debugging tools
1 lecture • 40min

Debugging using visual studio (important) 1

Vector Reduction
7 lectures • 4hr 30min

Vector Reduction using global memory only (baseline)
Understanding the code and the profiling of the vector reduction
Optimizing the vector reduction (removing the filter)
The Race Condition and the debugging option
Optimizing the thread utilizations on vector reduction
Optimization using shared memory and unrolling
Shuffle operations optimizations

Roofline model
1 lecture • 43min

Roofline Analysis

Compute and Memory bounds apps)

About the author:
Hamdy egy is a Research Assistant and a Ph.D. student. He graduated from the Computer and System Engineering Department in 2012 and was ranked second in his class. After graduation, he worked as a teaching assistant in the same department for about 10 years. He also worked as an embedded systems instructor for 5 years.

Поиск

Поиск

Освоение параллельного программирования на GPU с CUDA: (HW & SW) [Udemy] [Hamdy egy]

Bot

Администратор