showsbta.blogg.se - Wmma 3 versions

Looking at the “Scheduler Statistics” in nsight, I find that 6 warps are active, but only 0.45 are eligible and 0.18 are issued. However, when profiling the code in Nsight, it takes 145us (very approximately 2TFlops if I didn’t mess up my calculations). I would have expected this to run very quickly, as the global memory access is always cached (which I have verified in Nsight: 99.98% L1 hitrate). (For simpler timing, no results are written, but I have verified that the SASS code is correct and does not optimize away anything). So the kernel simply multiplies two matrices, one from global memory and the other from shared, and stores the result in shared memory. Wmma::store_matrix_sync( d_C, acc_frag, 16, wmma::mem_row_major ) įor ( auto& a : A ) a = float( rand() ) / RAND_MAX įor ( auto& b : B ) b = float( rand() ) / RAND_MAX ĬudaMalloc( &d_A, 16 * 16 * sizeof( dtype ) ) ĬudaMalloc( &d_B, 16 * 16 * sizeof( dtype ) ) ĬudaMalloc( &d_C, 16 * 16 * sizeof( dtype ) ) ĬudaMemcpy( d_A, A.data(), 16 * 16 * sizeof( dtype ), cudaMemcpyHostToDevice ) ĬudaMemcpy( d_B, B.data(), 16 * 16 * sizeof( dtype ), cudaMemcpyHostToDevice ) ĬudaMemcpy( d_C, C.data(), 16 * 16 * sizeof( dtype ), cudaMemcpyHostToDevice ) Wmma::mma_sync( acc_frag, a_frag, b_frag, acc_frag ) Wmma::load_matrix_sync( a_frag, d_A, 16 ) _global_ void test_wmma( _half* d_A, _half* d_B, _half* d_C ) In the hope that someone here can help me understand what I am doing wrong, I will post a small repro-case here.

I am attempting to use the tensor cores efficiently in a custom DL inference kernel, but I get very poor performance.