Profiling CUDA Utilizing Nsight Techniques: A Numba Instance | by Carlos Costa, Ph.D. | Could, 2024

Profiling CUDA Utilizing Nsight Techniques: A Numba Instance | by Carlos Costa, Ph.D. | Could, 2024
Profiling CUDA Utilizing Nsight Techniques: A Numba Instance | by Carlos Costa, Ph.D. | Could, 2024


Study profiling by inspecting concurrent and parallel Numba CUDA code in Nsight Techniques

Optimization is a vital a part of writing excessive efficiency code, regardless of if you’re writing an online server or computational fluid dynamics simulation software program. Profiling permits you to make knowledgeable choices relating to your code. In a way, optimization with out profiling is like flying blind: largely positive for seasoned professionals with professional information and fine-tuned instinct, however a recipe for catastrophe for nearly everybody else.

Photograph by Rafa Sanfilippo on Unsplash

Following my preliminary sequence CUDA by Numba Examples (see elements 1, 2, 3, and 4), we are going to research a comparability between unoptimized, single-stream code and a barely higher model which makes use of stream concurrency and different optimizations. We’ll study, from the ground-up, the best way to use NVIDIA Nsight Systems to profile and analyze CUDA code. All of the code on this tutorial may also be discovered within the repo cako/profiling-cuda-nsight-systems.

NVIDIA recommends as greatest apply to observe the APOD framework (Assess, Parallelize, Optimize, Deploy). There are a selection of proprietary, open-source, free, and industrial software program for various kinds of assessments and profiling. Veteran Python customers could also be conversant in primary profilers resembling cProfile, line_profiler, memory_profiler (sadly, unmaintaned as of 2024) and extra superior instruments like PyInstrument and Memray. These profilers goal particular features of the “host” resembling CPU and RAM utilization.

Nonetheless, profiling “gadget” (e.g., GPU) code — and its interactions with the host — requires specialised instruments offered by the gadget vendor. For NVIDIA GPUs, Nsight Techniques, Nsight Compute, Nsight Graphics can be found for profiling totally different features of computation. On this tutorial we are going to concentrate on utilizing Nsight Techniques, which is a system-wide profiler. We’ll use it to profile Python code which interacts with the GPU by way of Numba CUDA.

To get began, you’ll need Nsight Techniques CLI and GUI. The CLI might be put in individually and will likely be used to profile the code in a GPGPU-capable system. The complete model consists of each CLI and GUI. Observe that each variations may very well be put in in a system and not using a GPU. Seize the model(s) you want from the NVIDIA website.

To make it simpler to visualise code sections within the GUI, NVIDIA additionally supplies the Python pip and conda-installable library nvtx which we are going to use to annotate sections of our code. Extra on this later.

On this part we are going to set our improvement and profiling surroundings up. Under are two quite simple Python scripts: kernels.py and run_v1.py. The previous will comprise all CUDA kernels, and the latter will function the entry level to run the instance. On this instance we’re following the “scale back” sample launched in article CUDA by Numba Examples Part 3: Streams and Events to compute the sum of an array.

#%%writefile kernels.py
import numba
from numba import cuda

THREADS_PER_BLOCK = 256
BLOCKS_PER_GRID = 32 * 40

@cuda.jit
def partial_reduce(array, partial_reduction):
i_start = cuda.grid(1)
threads_per_grid = cuda.blockDim.x * cuda.gridDim.x
s_thread = numba.float32(0.0)
for i_arr in vary(i_start, array.measurement, threads_per_grid):
s_thread += array[i_arr]

s_block = cuda.shared.array((THREADS_PER_BLOCK,), numba.float32)
tid = cuda.threadIdx.x
s_block[tid] = s_thread
cuda.syncthreads()

i = cuda.blockDim.x // 2
whereas i > 0:
if tid < i:
s_block[tid] += s_block[tid + i]
cuda.syncthreads()
i //= 2

if tid == 0:
partial_reduction[cuda.blockIdx.x] = s_block[0]

@cuda.jit
def single_thread_sum(partial_reduction, sum):
sum[0] = numba.float32(0.0)
for factor in partial_reduction:
sum[0] += factor

@cuda.jit
def divide_by(array, val_array):
i_start = cuda.grid(1)
threads_per_grid = cuda.gridsize(1)
for i in vary(i_start, array.measurement, threads_per_grid):
array[i] /= val_array[0]

#%%writefile run_v1.py
import argparse
import warnings

import numpy as np
from numba import cuda
from numba.core.errors import NumbaPerformanceWarning

from kernels import (
BLOCKS_PER_GRID,
THREADS_PER_BLOCK,
divide_by,
partial_reduce,
single_thread_sum,
)

# Ignore NumbaPerformanceWarning
warnings.simplefilter("ignore", class=NumbaPerformanceWarning)

def run(measurement):
# Outline host array
a = np.ones(measurement, dtype=np.float32)
print(f"Outdated sum: {a.sum():.3f}")

# Array copy to gadget and array creation on the gadget.
dev_a = cuda.to_device(a)
dev_a_reduce = cuda.device_array((BLOCKS_PER_GRID,), dtype=dev_a.dtype)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype)

# Launching kernels to normalize array
partial_reduce[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_reduce)
single_thread_sum[1, 1](dev_a_reduce, dev_a_sum)
divide_by[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_sum)

# Array copy to host
dev_a.copy_to_host(a)
cuda.synchronize()
print(f"New sum: {a.sum():.3f}")

def essential():
parser = argparse.ArgumentParser(description="Easy Instance v1")
parser.add_argument(
"-n",
"--array-size",
sort=int,
default=100_000_000,
metavar="N",
assist="Array measurement",
)

args = parser.parse_args()
run(measurement=args.array_size)

if __name__ == "__main__":
essential()

This can be a easy script that may simply be run with:

$ python run_v1.py
Outdated sum: 100000000.000
New sum: 1.000

We additionally run this code by our profiler, which simply entails calling nsys with some choices earlier than the decision to our script:

$ nsys profile 
--trace cuda,osrt,nvtx
--gpu-metrics-device=all
--cuda-memory-usage true
--force-overwrite true
--output profile_run_v1
python run_v1.py
GPU 0: Basic Metrics for NVIDIA TU10x (any frequency)
Outdated sum: 100000000.000
New sum: 1.000
Producing '/tmp/nsys-report-fb78.qdstrm'
[1/1] [========================100%] profile_run_v1.nsys-rep
Generated:
/content material/profile_run_v1.nsys-rep

You’ll be able to seek the advice of the Nsight CLI docs for all of the obtainable choices to the nsys CLI. For this tutorial we are going to all the time use those above. Let’s dissect this command:

  • profile places nsys in profile mode. There are numerous different modes like export and launch.
  • --trace cuda,osrt,nvtx ensures we “hear” to all CUDA calls (cuda), OS runtime library calls (osrt) and nvtx annotations (none on this instance). There are numerous extra hint choices resembling cublas, cudnn, mpi,dx11 and several other others. Test the docs for all choices.
  • --gpu-metrics-device=all information GPU metrics for all GPUs, together with Tensor Core utilization.
  • --cuda-memory-usage tracks GPU reminiscence utilization of kernels. It might considerably decelerate execution and requires --trace=cuda. We use it as a result of our scripts our fairly quick in any case.

If the command exited efficiently, we may have a profile_run_v1.nsys-rep within the present folder. We’ll open this file by launching the Nsight Techniques GUI, File > Open. The preliminary view is barely complicated. So we are going to begin by decluttering: resize the Occasions View port to the underside, and reduce CPU, GPU and Processes below the Timeline View port. Now broaden solely Processes > python > CUDA HW. See Figures 1a and 1b.

Determine 1a: Opening an nsys report and decluttering the interface. Credit: Personal work. CC BY-SA 4.0.
Determine 1b: nsys report displaying host-to-device reminiscence operations (inexperienced), device-to-host reminiscence operations (purple) and CUDA kernels (blue). Credit: Personal work. CC BY-SA 4.0.

First up, let’s discover our kernels. On the CUDA HW line, you can see inexperienced and purple blobs, and really small slivers of sunshine blue (see Determine 1b). In case you hover over these you will notice tooltips saying, “CUDA Reminiscence operations in progress” for purple and inexperienced, and “CUDA Kernel Working (89.7%)” for the sunshine blues. These are going to be the bread and butter of our profiling. On this line, we will inform when and the way reminiscence is being transferred (purple and inexperienced) and when and the way our kernels are operating (gentle blue).

Let’s dig in just a little bit extra on our kernels. You must see three very small blue slivers, every representing a kernel name. We’ll zoom into the area by clicking and dragging the mouse from simply earlier than the beginning of the primary kernel name to only after the top of the final one, after which urgent Shift + Z. See Determine 2.

Determine 2: Navigating an nsys report and zooming into an space of curiosity. Credit: Personal work. CC BY-SA 4.0.

Now that we’ve discovered our kernels, let’s see some metrics. We open the GPU > GPU Metrics tabs for that. On this panel, can discover “Warp Occupancy” (beige) for compute kernels. One option to optimize CUDA code is to make sure that the warp occupancy is as near 100% as potential for so long as potential. Which means that our GPU will not be idling. We discover that that is occurring for the primary and final kernels however not the center kernel. That’s anticipated as the center kernel launches a single thread. One closing factor to notice on this part is the GPU > GPU Metrics > SMs Energetic > Tensor Energetic / FP16 Energetic line. This line will present whether or not the tensor cores are getting used. On this case it’s best to confirm that they aren’t.

Now let’s briefly take a look at the Occasions View. Proper click on Processes > python > CUDA HW and click on “Present in Occasions View”. Then kind the occasions by descending period. In Determine 3, we see that the slowest occasions are two pageable reminiscence transfers. We now have seen in CUDA by Numba Examples Part 3: Streams and Events that pageable reminiscence transfers might be suboptimal, and we must always choose page-locked or “pinned” reminiscence transfers. If we’ve gradual reminiscence transfers due to make use of of pageable reminiscence, the Occasions View could be a nice location to establish the place these gradual transfers might be discovered. Professional tip: you possibly can isolate reminiscence transfers by proper clicking Processes > python > CUDA HW > XX% Reminiscence as a substitute.

Determine 3. Occasions View in Nsight Techniques displaying a pageable (non-pinned) reminiscence switch. Credit: Personal work. CC BY-SA 4.0.

On this part we discovered the best way to profile a Python program which makes use of CUDA, and the best way to visualize primary data of this program within the Nsight Techniques GUI. We additionally seen that on this easy program, we’re utilizing pageable as a substitute of pinned reminiscence, that one among our kernels will not be occupying all warps, that the GPU is idle for fairly a while between kernels being run and that we aren’t utilizing tensor cores.

On this part we are going to discover ways to enhance our profiling expertise by annotation sections in Nsight Techniques with NVTX. NVTX permits us to mark totally different areas of the code. It will probably mark ranges and instantaneous occasions. For a deeper look, test the docs. Under we create run_v2.py, which, along with annotating run_v1.py, additionally modifications this line:

a = np.ones(measurement, dtype=np.float32)

to those:

a = cuda.pinned_array(measurement, dtype=np.float32)
a[...] = 1.0

Subsequently, along with the annotations, we are actually utilizing a pinned reminiscence. If you wish to study extra concerning the various kinds of recollections that CUDA helps, see the CUDA C++ Programming Guide. It’s of relevance that this isn’t the one option to pin an array in Numba. A beforehand created Numpy array may also be created with a context, as defined within the Numba documentation.

#%%writefile run_v2.py
import argparse
import warnings

import numpy as np
import nvtx
from numba import cuda
from numba.core.errors import NumbaPerformanceWarning

from kernels import (
BLOCKS_PER_GRID,
THREADS_PER_BLOCK,
divide_by,
partial_reduce,
single_thread_sum,
)

# Ignore NumbaPerformanceWarning
warnings.simplefilter("ignore", class=NumbaPerformanceWarning)

def run(measurement):
with nvtx.annotate("Compilation", shade="purple"):
dev_a = cuda.device_array((BLOCKS_PER_GRID,), dtype=np.float32)
dev_a_reduce = cuda.device_array((BLOCKS_PER_GRID,), dtype=dev_a.dtype)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype)
partial_reduce[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_reduce)
single_thread_sum[1, 1](dev_a_reduce, dev_a_sum)
divide_by[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_sum)

# Outline host array
a = cuda.pinned_array(measurement, dtype=np.float32)
a[...] = 1.0
print(f"Outdated sum: {a.sum():.3f}")

# Array copy to gadget and array creation on the gadget.
with nvtx.annotate("H2D Reminiscence", shade="yellow"):
dev_a = cuda.to_device(a)
dev_a_reduce = cuda.device_array((BLOCKS_PER_GRID,), dtype=dev_a.dtype)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype)

# Launching kernels to normalize array
with nvtx.annotate("Kernels", shade="inexperienced"):
partial_reduce[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_reduce)
single_thread_sum[1, 1](dev_a_reduce, dev_a_sum)
divide_by[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_sum)

# Array copy to host
with nvtx.annotate("D2H Reminiscence", shade="orange"):
dev_a.copy_to_host(a)
cuda.synchronize()
print(f"New sum: {a.sum():.3f}")

def essential():
parser = argparse.ArgumentParser(description="Easy Instance v2")
parser.add_argument(
"-n",
"--array-size",
sort=int,
default=100_000_000,
metavar="N",
assist="Array measurement",
)

args = parser.parse_args()
run(measurement=args.array_size)

if __name__ == "__main__":
essential()

Evaluating the 2 information, you possibly can see it’s so simple as wrapping some GPU kernel calls with

with nvtx.annotate("Area Title", shade="purple"):
...

Professional tip: you may also annotate capabilities by inserting the @nvtx.annotate decorator above their definition, robotically annotate every thing by calling your script with python -m nvtx run_v2.py, or apply the autoannotator selectively in you code by enabling or disabling nvtx.Profile(). See the docs!

Let’s run this new script and open the leads to Nsight Techniques.

$ nsys profile 
--trace cuda,osrt,nvtx
--gpu-metrics-device=all
--cuda-memory-usage true
--force-overwrite true
--output profile_run_v2
python run_v2.py
GPU 0: Basic Metrics for NVIDIA TU10x (any frequency)
Outdated sum: 100000000.000
New sum: 1.000
Producing '/tmp/nsys-report-69ab.qdstrm'
[1/1] [========================100%] profile_run_v2.nsys-rep
Generated:
/content material/profile_run_v2.nsys-rep

Once more, we begin by minimizing every thing, leaving solely Processes > python > CUDA HW open. See Determine 4. Discover that we now have a brand new line, NVTX. On this line within the timeline window we must always see totally different coloured blocks equivalent to the annotation areas that we created within the code. These are Compilation, H2D Reminiscence, Kernels and D2H Reminiscence. A few of these my be too small to learn, however will likely be legible in the event you zoom into the area.

Determine 4. Instance of NVTX annotations and an Occasions View with pinned reminiscence. Credit: Personal work. CC BY-SA 4.0.

The profiler confirms that this reminiscence is pinned, guaranteeing that our code is actually utilizing pinned reminiscence. As well as, H2D Reminiscence and D2H Reminiscence are actually taking lower than half of the time that they had been taking earlier than. Typically we are able to count on higher efficiency utilizing pinned reminiscence or prefetched mapped arrays (not supported by Numba).

Now we are going to examine whether or not we are able to enhance this code by introducing streams. The thought is that whereas reminiscence transfers are occurring, the GPU can begin processing the info. This permits a stage of concurrency, which hopefully will be certain that we’re occupying our warps as absolutely as potential.

Determine 5. Utilizing totally different streams might permit for concurrent execution. Credit: Zhang et al. 2021 (CC BY 4.0).

Within the code beneath we are going to break up the processing of our array into roughly equal elements. Every half will run in a separate stream, together with transferring knowledge and computing the sum of the array. Then, we synchronize all streams and sum their partial sums. At this level we are able to then launch normalization kernels for every stream independently.

We wish to reply just a few questions:

  1. Will the code beneath really create concurrency? May we be introducing a bug?
  2. Is it quicker than the code which makes use of a single stream?
  3. Is the warp occupancy higher?
#%%writefile run_v3_bug.py
import argparse
import warnings
from math import ceil

import numpy as np
import nvtx
from numba import cuda
from numba.core.errors import NumbaPerformanceWarning

from kernels import (
BLOCKS_PER_GRID,
THREADS_PER_BLOCK,
divide_by,
partial_reduce,
single_thread_sum,
)

# Ignore NumbaPerformanceWarning
warnings.simplefilter("ignore", class=NumbaPerformanceWarning)

def run(measurement, nstreams):
with nvtx.annotate("Compilation", shade="purple"):
dev_a = cuda.device_array((BLOCKS_PER_GRID,), dtype=np.float32)
dev_a_reduce = cuda.device_array((BLOCKS_PER_GRID,), dtype=dev_a.dtype)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype)
partial_reduce[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_reduce)
single_thread_sum[1, 1](dev_a_reduce, dev_a_sum)
divide_by[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_sum)

# Outline host array
a = cuda.pinned_array(measurement, dtype=np.float32)
a[...] = 1.0

# Outline areas for streams
step = ceil(measurement / nstreams)
begins = [i * step for i in range(nstreams)]
ends = [min(s + step, size) for s in starts]
print(f"Outdated sum: {a.sum():.3f}")

# Create streams
streams = [cuda.stream()] * nstreams

cpu_sums = [cuda.pinned_array(1, dtype=np.float32) for _ in range(nstreams)]
devs_a = []
with cuda.defer_cleanup():
for i, (stream, begin, finish) in enumerate(zip(streams, begins, ends)):
cpu_sums[i][...] = np.nan

# Array copy to gadget and array creation on the gadget.
with nvtx.annotate(f"H2D Reminiscence Stream {i}", shade="yellow"):
dev_a = cuda.to_device(a[start:end], stream=stream)
dev_a_reduce = cuda.device_array(
(BLOCKS_PER_GRID,), dtype=dev_a.dtype, stream=stream
)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype, stream=stream)
devs_a.append(dev_a)

# Launching kernels to sum array
with nvtx.annotate(f"Sum Kernels Stream {i}", shade="inexperienced"):
for _ in vary(50): # Make it spend extra time in compute
partial_reduce[BLOCKS_PER_GRID, THREADS_PER_BLOCK, stream](
dev_a, dev_a_reduce
)
single_thread_sum[1, 1, stream](dev_a_reduce, dev_a_sum)
with nvtx.annotate(f"D2H Reminiscence Stream {i}", shade="orange"):
dev_a_sum.copy_to_host(cpu_sums[i], stream=stream)

# Guarantee all streams are caught up
cuda.synchronize()

# Combination all 1D arrays right into a single 1D array
a_sum_all = sum(cpu_sums)

# Ship it to the GPU
with cuda.pinned(a_sum_all):
with nvtx.annotate("D2H Reminiscence Default Stream", shade="orange"):
dev_a_sum_all = cuda.to_device(a_sum_all)

# Normalize by way of streams
for i, (stream, begin, finish, dev_a) in enumerate(
zip(streams, begins, ends, devs_a)
):
with nvtx.annotate(f"Divide Kernel Stream {i}", shade="inexperienced"):
divide_by[BLOCKS_PER_GRID, THREADS_PER_BLOCK, stream](
dev_a, dev_a_sum_all
)

# Array copy to host
with nvtx.annotate(f"D2H Reminiscence Stream {i}", shade="orange"):
dev_a.copy_to_host(a[start:end], stream=stream)

cuda.synchronize()
print(f"New sum: {a.sum():.3f}")

def essential():
parser = argparse.ArgumentParser(description="Easy Instance v3")
parser.add_argument(
"-n",
"--array-size",
sort=int,
default=100_000_000,
metavar="N",
assist="Array measurement",
)
parser.add_argument(
"-s",
"--streams",
sort=int,
default=4,
metavar="N",
assist="Array measurement",
)

args = parser.parse_args()
run(measurement=args.array_size, nstreams=args.streams)

if __name__ == "__main__":
essential()

Let’s run the code and acquire outcomes.

$ nsys profile 
--trace cuda,osrt,nvtx
--gpu-metrics-device=all
--cuda-memory-usage true
--force-overwrite true
--output profile_run_v3_bug_4streams
python run_v3_bug.py -s 4
GPU 0: Basic Metrics for NVIDIA TU10x (any frequency)
Outdated sum: 100000000.000
New sum: 1.000
Producing '/tmp/nsys-report-a666.qdstrm'
[1/1] [========================100%] profile_run_v3_bug_4streams.nsys-rep
Generated:
/content material/profile_run_v3_bug_4streams.nsys-rep

This system ran and yielded the proper reply. However after we open the profiling file (see Determine 6), we discover that there are two streams as a substitute of 4! And one is principally utterly idle! What’s happening right here?

Determine 6. Instance of buggy multi-stream code. Credit: Personal work. CC BY-SA 4.0.

There’s a bug within the creation of the streams. By doing

streams = [cuda.stream()] * nstreams

we are literally making a single stream and repeating it nstreams occasions. So why are we seeing two streams as a substitute of 1? The truth that one will not be doing a lot computation ought to be an indicator that there’s a stream that we aren’t utilizing. This stream is the default stream, which we aren’t utilizing in any respect in out code since all GPU interactions are given a stream, the stream we created.

We are able to repair this bug with:

streams = [cuda.stream() for _ in range(nstreams)]
# Guarantee they're all totally different
assert all(s1.deal with != s2.deal with for s1, s2 in zip(streams[:-1], streams[1:]))

The code above may even guarantee they’re actually totally different streams, so it might have caught the bug had we had it within the code. It does so by checking the stream pointer worth.

Now we are able to run the mounted code with 1 stream and eight streams for comparability. See Figures 7 and eight, respectively.

$  nsys profile 
--trace cuda,osrt,nvtx
--gpu-metrics-device=all
--cuda-memory-usage true
--force-overwrite true
--output profile_run_v3_1stream
python run_v3.py -s 1
GPU 0: Basic Metrics for NVIDIA TU10x (any frequency)
Outdated sum: 100000000.000
New sum: 1.000
Producing '/tmp/nsys-report-de65.qdstrm'
[1/1] [========================100%] profile_run_v3_1stream.nsys-rep
Generated:
/content material/profile_run_v3_1stream.nsys-rep
$ nsys profile 
--trace cuda,osrt,nvtx
--gpu-metrics-device=all
--cuda-memory-usage true
--force-overwrite true
--output profile_run_v3_8streams
python run_v3.py -s 8
GPU 0: Basic Metrics for NVIDIA TU10x (any frequency)
Outdated sum: 100000000.000
New sum: 1.000
Producing '/tmp/nsys-report-1fb7.qdstrm'
[1/1] [========================100%] profile_run_v3_8streams.nsys-rep
Generated:
/content material/profile_run_v3_8streams.nsys-rep
Determine 7. Instance of single stream code. Credit: Personal work. CC BY-SA 4.0.
Determine 7. Instance of right multi-stream code. Credit: Personal work. CC BY-SA 4.0.

Once more, each give right outcomes. By opening the one with 8 streams we see that sure, the bug has been mounted (Determine 7). Certainly, we now see 9 streams (8 created + default). As well as, we see that they’re working on the identical time! So we’ve achieved concurrency!

Sadly, if we dig a bit deeper we discover that the concurrent code will not be essentially quicker. On my machine the essential part of each variations, from begin of reminiscence switch to the final GPU-CPU copy takes round 160 ms.

A probable perpetrator is the warp occupancy. We discover that the warp occupancy is considerably higher within the single-stream model. The beneficial properties we’re getting on this instance in compute are doubtless being misplaced by not occupying our GPU as effectively. That is doubtless associated to the construction of the code which (artificially) calls means too many kernels. As well as, if all threads are stuffed by a single stream, there isn’t a acquire in concurrency, since different streams should be idle till sources unlock.

This instance is necessary as a result of it reveals that our preconceived notions of efficiency are simply hypotheses. They should be verified.

At this level of APOD, we’ve assessed, parallelized (each by threads and concurrency) and so the subsequent step could be to deploy. We additionally seen a slight efficiency regression with concurrency, so for this instance, a single-stream model would doubtless be the one deployed. In manufacturing, the subsequent step could be to observe the subsequent piece of code which is greatest fitted to parallelization and restarting APOD.

On this article we noticed the best way to arrange, use and interpret outcomes from profiling Python code in NVIDIA Nsight Techniques. C and C++ code might be analyzed very equally, and certainly a lot of the materials on the market makes use of C and C++ examples.

We additionally present how profiling can permit us to catch bugs and efficiency take a look at our packages, guaranteeing that the options we introduce really are enhancing efficiency, and if they aren’t, why.

Leave a Reply

Your email address will not be published. Required fields are marked *