Replaced globalized variable with X bytes of shared memory. [OMP111]¶
This optimization occurs when a globalized variable’s data is shared between multiple threads, but requires a constant amount of memory that can be determined at compile time. This is the case when only a single thread creates the memory and is then shared between every thread. The memory can then be pushed to a static buffer of shared memory on the device. This optimization allows users to declare shared memory on the device without using OpenMP’s custom allocators.
Globalization occurs when a pointer to a thread-local variable escapes the
current scope. If a single thread is known to be responsible for creating and
sharing the data it can instead be mapped directly to the device’s shared
memory. Checking if only a single thread can execute an instruction requires
that the parent functions have internal linkage. Otherwise, an external caller
could invalidate this analysis but having multiple threads call that function.
The optimization pass will make internal copies of each function to use for this
reason, but it is still recommended to mark them as internal using keywords like
static
whenever possible.
Example¶
This optimization should apply to any variable declared in an OpenMP target region that is then shared with every thread in a parallel region. This allows the user to declare shared memory without using custom allocators. A simple stencil calculation shows how this can be used.
void stencil(int M, int N, double *X, double *Y) {
#pragma omp target teams distribute collapse(2) \
map(to : X [0:M * N]) map(tofrom : Y [0:M * N])
for (int i0 = 0; i0 < M; i0 += MC) {
for (int j0 = 0; j0 < N; j0 += NC) {
double sX[MC][NC];
#pragma omp parallel for collapse(2) shared(sX) default(firstprivate)
for (int i1 = 0; i1 < MC; ++i1)
for (int j1 = 0; j1 < NC; ++j1)
sX[i1][j1] = X[(i0 + i1) * N + (j0 + j1)];
#pragma omp parallel for collapse(2) shared(sX) default(firstprivate)
for (int i1 = 1; i1 < MC - 1; ++i1)
for (int j1 = 1; j1 < NC - 1; ++j1)
Y[(i0 + i1) * N + j0 * j1] = (sX[i1 + 1][j1] + sX[i1 - 1][j1] +
sX[i1][j1 + 1] + sX[i1][j1 - 1] +
-4.0 * sX[i1][j1]) / (dX * dX);
}
}
}
$ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass=openmp-opt -fopenmp-version=51 omp111.cpp
omp111.cpp:10:14: remark: Replaced globalized variable with 8192 bytes of shared memory. [OMP111]
double sX[MC][NC];
^
The default mapping for variables captured in an OpenMP parallel region is
shared
. This means taking a pointer to the object which will ultimately
result in globalization that will be mapped to shared memory when it could have
been placed in registers. To avoid this, make sure each variable that can be
copied into the region is marked firstprivate
either explicitly or using the
OpenMP 5.1 feature default(firstprivate)
.
Diagnostic Scope¶
OpenMP target offloading optimization remark.
Table of Contents
- OpenMP Optimization Remarks
- Potentially unknown OpenMP target region caller [OMP100]
- Parallel region is used in unknown / unexpected ways. Will not attempt to rewrite the state machine. [OMP101]
- Parallel region is not called from a unique kernel. Will not attempt to rewrite the state machine. [OMP102]
- Moving globalized variable to the stack. [OMP110]
- Replaced globalized variable with X bytes of shared memory. [OMP111]
- Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112]
- Could not move globalized variable to the stack. Variable is potentially captured in call. Mark parameter as __attribute__((noescape)) to override. [OMP113]
- Transformed generic-mode kernel to SPMD-mode [OMP120]
- Value has potential side effects preventing SPMD-mode execution. Add __attribute__((assume("ompx_spmd_amenable"))) to the called function to override. [OMP121]
- Removing unused state machine from generic-mode kernel. [OMP130]
- Rewriting generic-mode kernel with a customized state machine. [OMP131]
- Generic-mode kernel is executed with a customized state machine that requires a fallback. [OMP132]
- Call may contain unknown parallel regions. Use __attribute__((assume(“omp_no_parallelism”))) to override. [OMP133]
- Could not internalize function. Some optimizations may not be possible. [OMP140]
- Parallel region merged with parallel region at <location>. [OMP150]
- Removing parallel region with no side-effects. [OMP160]
- OpenMP runtime call <call> deduplicated. [OMP170]
- Replacing OpenMP runtime call <call> with <value>.
- Redundant barrier eliminated. (device only)