tensorflow: TensorBoard callback without profile_batch setting cause Errors: CUPTI_ERROR_INSUFFICIENT_PRIVILEGES and CUPTI_ERROR_INVALID_PARAMETER

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Stateless LSTM from Keras tutorial using tf backend
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.1.0
  • Python version: 3.7.4
  • CUDA/cuDNN version: 10.1
  • GPU model and memory: MX150 10GB

Describe the current behavior When using tf.keras.callbacks.TensorBoard() without the profile_batch setting, it gives out errors of CUPTI_ERROR_INSUFFICIENT_PRIVILEGES and CUPTI_ERROR_INVALID_PARAMETER from tensorflow/core/profiler/internal/gpu/cupti_tracer.cc.

Describe the expected behavior With profile_batch = 0, these two errors are gone. But comes back when profile_batch = 1, or other non-zero values.

Code to reproduce the issue


from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM


input_len = 1000
tsteps = 2
lahead = 1
batch_size = 1
epochs = 5

print("*" * 33)
if lahead >= tsteps:
    print("STATELESS LSTM WILL ALSO CONVERGE")
else:
    print("STATELESS LSTM WILL NOT CONVERGE")
print("*" * 33)

np.random.seed(1986)

print('Generating Data...')


def gen_uniform_amp(amp=1, xn=10000):

    data_input = np.random.uniform(-1 * amp, +1 * amp, xn)
    data_input = pd.DataFrame(data_input)
    return data_input


to_drop = max(tsteps - 1, lahead - 1)
data_input = gen_uniform_amp(amp=0.1, xn=input_len + to_drop)

expected_output = data_input.rolling(window=tsteps, center=False).mean()

if lahead > 1:
    data_input = np.repeat(data_input.values, repeats=lahead, axis=1)
    data_input = pd.DataFrame(data_input)
    for i, c in enumerate(data_input.columns):
        data_input[c] = data_input[c].shift(i)

expected_output = expected_output[to_drop:]
data_input = data_input[to_drop:]


def create_model(stateful):
    model = Sequential()
    model.add(LSTM(20,
              input_shape=(lahead, 1),
              batch_size=batch_size,
              stateful=stateful))
    model.add(Dense(1))
    model.compile(loss='mse', optimizer='adam')
    return model

print('Creating Stateful Model...')
model_stateful = create_model(stateful=True)


def split_data(x, y, ratio=0.8):
    to_train = int(input_len * ratio)
    to_train -= to_train % batch_size
    x_train = x[:to_train]
    y_train = y[:to_train]
    x_test = x[to_train:]
    y_test = y[to_train:]

    # tweak to match with batch_size
    to_drop = x.shape[0] % batch_size
    if to_drop > 0:
        x_test = x_test[:-1 * to_drop]
        y_test = y_test[:-1 * to_drop]

    # some reshaping
    reshape_3 = lambda x: x.values.reshape((x.shape[0], x.shape[1], 1))
    x_train = reshape_3(x_train)
    x_test = reshape_3(x_test)

    reshape_2 = lambda x: x.values.reshape((x.shape[0], 1))
    y_train = reshape_2(y_train)
    y_test = reshape_2(y_test)

    return (x_train, y_train), (x_test, y_test)


(x_train, y_train), (x_test, y_test) = split_data(data_input, expected_output)
print('x_train.shape: ', x_train.shape)
print('y_train.shape: ', y_train.shape)
print('x_test.shape: ', x_test.shape)
print('y_test.shape: ', y_test.shape)

print('Creating Stateless Model...')
model_stateless = create_model(stateful=False)

import os
import datetime
ROOT_DIR = os.getcwd()
log_dir = os.path.join('callback_tests')
if not os.path.exists(log_dir):
    os.makedirs(log_dir)
print(log_dir)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir)
                                       
print('Training')
history = model_stateless.fit(x_train,
                    y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test),
                    shuffle=False,
                    callbacks=[tensorboard_callback]
                    )


Other info / logs Train on 800 samples, validate on 200 samples 2020-01-14 21:30:27.591905: I tensorflow/core/profiler/lib/profiler_session.cc:225] Profiler session started. 2020-01-14 21:30:27.594743: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1259] Profiler found 1 GPUs 2020-01-14 21:30:27.599172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cupti64_101.dll 2020-01-14 21:30:27.704083: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1307] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES 2020-01-14 21:30:27.716790: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1346] function cupti_interface_->ActivityRegisterCallbacks( AllocCuptiActivityBuffer, FreeCuptiActivityBuffer)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES Epoch 1/5 2020-01-14 21:30:28.370429: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll 2020-01-14 21:30:28.651767: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll 2020-01-14 21:30:29.662864: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1329] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER 2020-01-14 21:30:29.670282: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:88] GpuTracer has collected 0 callback api events and 0 activity events. 800/800 [==============================] - 5s 6ms/sample - loss: 0.0011 - val_loss: 0.0011 Epoch 2/5 800/800 [==============================] - 3s 4ms/sample - loss: 8.5921e-04 - val_loss: 0.0010 Epoch 3/5 800/800 [==============================] - 3s 3ms/sample - loss: 8.5613e-04 - val_loss: 0.0010 Epoch 4/5 800/800 [==============================] - 3s 4ms/sample - loss: 8.5458e-04 - val_loss: 9.9713e-04 Epoch 5/5 800/800 [==============================] - 3s 4ms/sample - loss: 8.5345e-04 - val_loss: 9.8825e-04

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 23
  • Comments: 56 (6 by maintainers)

Most upvoted comments

Adding options nvidia "NVreg_RestrictProfilingToAdminUsers=0" to /etc/modprobe.d/nvidia-kernel-common.conf and reboot should resolve the permision issue.

Anyone with the same issue on Windows 10? The two offered solutions only work for Linux.

This solved the issue for me: Right-click on your desktop desktop for quick access to the NVIDIA Control Panel Windows Step 1: Open the NVIDIA Control Panel, select ‘Desktop’, and ensure ‘Enable Developer Settings’ is checked. Windows Step 2: Under ‘Developer’ > ‘Manage GPU Performance Counters’, select ‘Allow access to the GPU performance counter to all users’ to enable unrestricted profiling[1]

@tamaramiteva I resolved the problem with docker run option ‘–privileged=true’. There is no more errors such as ‘CUPTI_ERROR_INSUFFICIENT_PRIVILEGES’.

For people using Docker on Linux: Instead of running the container with --privileged=true just pass --cap-add=CAP_SYS_ADMIN.

Anyone with the same issue on Windows 10? The two offered solutions only work for Linux.

Hey @trisolaran thanks for brief intro. The thing is i do not have /etc/modprobe.d/nvidia-kernel-common.conf such file. I am using a conda environment.

Can confirm that the following fixed the problem for me. It’s possible that only a subset is strictly necessary.

(note: Docker config, using official 2.2.0 image)

  • updating host machine to Ubuntu 20.04
  • adding options nvidia "NVreg_RestrictProfilingToAdminUsers=0" to /etc/modprobe.d/nvidia-kernel-common.conf and running update-initramfs -u
  • adding export CUDA_VERSION="10.1", export LD_LIBRARY_PATH="/usr/local/cuda-${CUDA_VERSION}/lib64:/usr/local/cuda-${CUDA_VERSION}/extras/CUPTI/lib64 and export LD_INCLUDE_PATH="/usr/local/cuda-${CUDA_VERSION}/include:/usr/local/cuda-${CUDA_VERSION}/extras/CUPTI/include" to the host machine’s .zshrc
  • adding ENV LD_INCLUDE_PATH="/usr/local/cuda/include:/usr/local/cuda/extras/CUPTI/include:$LD_INCLUDE_PATH to the Dockerfile
  • running the Docker container with --privileged

@gawain-git-code , I tried running the code in colab and I was able to run it successfully. please find the gist for reference.Thanks!

In order to run docker: nvidia-docker run '--privileged=true' -d -it --name retina_net -v /home/readib/Experiments/:/home -p 8000:8888 -v /tmp/.X11-unix/:/tmp/.X11-unix -e DISPLAY=$DISPLAY retina_net:latest /bin/bash

Now I can profile with --profile_steps=1000, 1005, for example, 5 steps, but if I increase it to 10, there is this non-deterministic segfault appearing. Not sure whether this happened to anyone else?

Yes, I get that segfault too – I think it’s because the overhead of profiling, on top of regular GPU computations, causes GPU memory overflow.

@SarfarazHabib Hi I am using a conda enviroment too and I solved this problem by adding
options nvidia "NVreg_RestrictProfilingToAdminUsers=0" to /etc/modprobe.d/nvidia-kernel-common.conf. I had not had the file too , so you should make the file.

I am having the same error in anaconda environment. None of the solutions posted above work for me. Does anyone have any ideas what can be done? Also what does this error actually mean, if someone is kind enough to explain it to a noob ?

Anyone with the same issue on Windows 10? The two offered solutions only work for Linux.

This solved the issue for me: Right-click on your desktop desktop for quick access to the NVIDIA Control Panel Windows Step 1: Open the NVIDIA Control Panel, select ‘Desktop’, and ensure ‘Enable Developer Settings’ is checked. Windows Step 2: Under ‘Developer’ > ‘Manage GPU Performance Counters’, select ‘Allow access to the GPU performance counter to all users’ to enable unrestricted profiling[1]

This seems to be a global setting, so it works with monkey-patched Anaconda environments.