cudf: [BUG] thrust::system::system_error what(): for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
This script just reads randomly created JSON files using Dask with no heavy processing.
Dask Worker logs show something like the errors below, which eventually causes workers to restart frantically and eventually cause connection issues b/w the scheduler and workers.
NOTE: If I do not use Dask, the processing seems to go though without failures.
Worker logs:
terminate called after throwing an instance of 'thrust::system::system_error' what(): for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
terminate called recursively
distributed.nanny - INFO - Worker process 13050 was killed by signal 6
I used the following commands —
- Start Scheduler:
nohup dask-scheduler --host localhost &> scheduler.out & - Start Workers:
CUDA_VISIBLE_DEVICES=0 nohup dask-worker localhost:8786 --nprocs 2 --nthreads 2 --memory-limit="16GB" --resources "process=1" >& worker.out &
Logs can be seen in scheduler.out and worker.out.
Random JSON files producer script:
# Creates 25 JSON files, 2*120MB each
from random import randrange,seed
import json
import math
import time
import random
num_columns = 40
def column_names(size):
base_cols = ["AppId{}", "LoggedTime{}", "timestamp{}"]
cols = []
mult = math.ceil(size/len(base_cols))
for i in range(mult):
for c in base_cols:
cols.append(c.format(i))
if(len(cols) == size): break
return cols
def generate_json(num_columns):
dict_out = {}
cols = column_names(num_columns)
for col in cols:
if col.startswith("AppId"): dict_out[col] = randrange(1,50000)
elif col.startswith("LoggedTime"): dict_out[col] = randrange(1,50000)
else: dict_out[col] = randrange(1,50000)
return json.dumps(dict_out)
for i in range(0,25):
count = 0
f = open("json_files/json-%i.txt" % i, "w+")
while count < 2*150000:
f.write(generate_json(num_columns) + "\n")
count = count + 1
f.close()
Processing script:
from distributed import Client, LocalCluster
import cudf
client = Client("localhost:8786")
client.get_versions(check=True)
def func_json(batch):
file = f"json_files/json-{batch}.txt"
df = cudf.read_json(file, lines=True, engine="cudf")
return len(df)
batch_arr = [i for i in range(1,25)]
res = client.map(func_json, batch_arr)
print(client.gather(res))
Can someone please help? I’m seeing this kind of failure only as recent as one week.
I am using a fresh conda environment with this being the only installation command:
conda install -y -c rapidsai-nightly -c nvidia -c conda-forge -c defaults custreamz python=3.7 cudatoolkit=10.2.
I am using a T4 GPU with CUDA 10.2.
P.S. This seems similar to #5897.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 38 (35 by maintainers)
Got local repro with multithreaded JSON reads:
Reproes fairly consistently.
When using GPUs with dask the current working assumption is that there should be 1 worker and 1 thread per GPU. This is generally for proper CUDA context creation but also useful resource management. We built dask-cuda to make this setup trivial for users.
I made significant change to the JSON reader 2 weeks ago that could affect this.
I’m suspecting synchronization issue(s) that got exposed by GPU saturation from concurrent reads. Digging into the repro, I found a few places where the synchronization is iffy. Need to look into it some more to root cause.
Okay, so I thought of using CSV files instead of JSON so, I used
to convert existing JSON to CSV files, and then updated the repro script to call read_csv
It seems to run fine with 2 processes and 2 threads. So this is specifically happening with the JSON reader?
Does this reproduce with a
ThreadPoolExecutor. Maybe something like this?Edit: May be worth playing with
max_workershere.cc @harrism as we’re seeing a threading related issue and there was substantial changes with regards to RMM and threading.
I’m running the repro locally, will update once the script is done.
Filed an MRE here ( https://github.com/rapidsai/dask-cuda/issues/364 ).
Just to add to this, IOW this is an issue related to PR ( https://github.com/rapidsai/rmm/pull/466 ). We are discussing this in other contexts as well.
If it is an OOM issue it’s possible this is related to an RMM/Dask-CUDA/Dask issue where device 0 is the only device being used even though multiple GPUs are requested