cudf: [BUG] thrust::system::system_error what(): for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

This script just reads randomly created JSON files using Dask with no heavy processing.

Dask Worker logs show something like the errors below, which eventually causes workers to restart frantically and eventually cause connection issues b/w the scheduler and workers.

NOTE: If I do not use Dask, the processing seems to go though without failures.

Worker logs:

terminate called after throwing an instance of 'thrust::system::system_error' what():  for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
terminate called recursively
distributed.nanny - INFO - Worker process 13050 was killed by signal 6

I used the following commands —

  1. Start Scheduler: nohup dask-scheduler --host localhost &> scheduler.out &
  2. Start Workers: CUDA_VISIBLE_DEVICES=0 nohup dask-worker localhost:8786 --nprocs 2 --nthreads 2 --memory-limit="16GB" --resources "process=1" >& worker.out &

Logs can be seen in scheduler.out and worker.out.

Random JSON files producer script:

# Creates 25 JSON files, 2*120MB each 

from random import randrange,seed
import json
import math
import time
import random

num_columns = 40

def column_names(size):
    base_cols = ["AppId{}", "LoggedTime{}", "timestamp{}"]
    cols = []
    mult = math.ceil(size/len(base_cols))
    for i in range(mult):
        for c in base_cols:
            cols.append(c.format(i))
            if(len(cols) == size): break
    return cols

def generate_json(num_columns):
    dict_out = {}
    cols = column_names(num_columns)
    for col in cols:
        if col.startswith("AppId"): dict_out[col] = randrange(1,50000)
        elif col.startswith("LoggedTime"): dict_out[col] = randrange(1,50000)
        else: dict_out[col] = randrange(1,50000)
    return json.dumps(dict_out)

for i in range(0,25):
    count = 0
    f = open("json_files/json-%i.txt" % i, "w+")
    while count < 2*150000:
        f.write(generate_json(num_columns) + "\n")
        count = count + 1
    f.close()

Processing script:

from distributed import Client, LocalCluster
import cudf

client = Client("localhost:8786")
client.get_versions(check=True)

def func_json(batch):
    file = f"json_files/json-{batch}.txt"
    df = cudf.read_json(file, lines=True, engine="cudf")
    return len(df)

batch_arr = [i for i in range(1,25)]
res = client.map(func_json, batch_arr)
print(client.gather(res))

Can someone please help? I’m seeing this kind of failure only as recent as one week.

I am using a fresh conda environment with this being the only installation command: conda install -y -c rapidsai-nightly -c nvidia -c conda-forge -c defaults custreamz python=3.7 cudatoolkit=10.2.

I am using a T4 GPU with CUDA 10.2.

P.S. This seems similar to #5897.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 38 (35 by maintainers)

Most upvoted comments

Got local repro with multithreaded JSON reads:

TEST_F(JsonReaderTest, Repro)
{
  auto read_all = [&]() {
    cudf_io::read_json_args in_args{cudf_io::source_info{""}};
    in_args.lines = true;
    for (int i = 0; i < 25; ++i) {
      in_args.source =
        cudf_io::source_info{"/home/vukasin/cudf/json-" + std::to_string(i) + ".txt"};
      auto df = cudf_io::read_json(in_args);
    }
  };

  auto th1 = std::async(std::launch::async, read_all);
  auto th2 = std::async(std::launch::async, read_all);
}

Reproes fairly consistently.

When using GPUs with dask the current working assumption is that there should be 1 worker and 1 thread per GPU. This is generally for proper CUDA context creation but also useful resource management. We built dask-cuda to make this setup trivial for users.

I made significant change to the JSON reader 2 weeks ago that could affect this.

I’m suspecting synchronization issue(s) that got exposed by GPU saturation from concurrent reads. Digging into the repro, I found a few places where the synchronization is iffy. Need to look into it some more to root cause.

Okay, so I thought of using CSV files instead of JSON so, I used

import cudf
for i in range(0,20):
    file = f"json_files/json-{i}.txt"
    cudf.read_json(file, lines=True, engine="cudf").to_csv("csv_files/csv-"+str(i)+".csv")

to convert existing JSON to CSV files, and then updated the repro script to call read_csv

def func_csv(batch):
    file = f"csv_files/csv-{batch}.csv"
    df = cudf.read_csv(file)
    return len(df)

It seems to run fine with 2 processes and 2 threads. So this is specifically happening with the JSON reader?

Does this reproduce with a ThreadPoolExecutor. Maybe something like this?

from concurrent.futures import ThreadPoolExecutor
import cudf


def func_json(batch):
    file = f"json_files/json-{batch}.txt"
    df = cudf.read_json(file, lines=True, engine="cudf")
    return len(df)


with ThreadPoolExecutor(max_workers=1) as executor:
    batch_arr = [i for i in range(1, 25)]
    res = executor.map(func_json, batch_arr)
    for e in res:
        print(e)

Edit: May be worth playing with max_workers here.

cc @harrism as we’re seeing a threading related issue and there was substantial changes with regards to RMM and threading.

@jakirkham I’m still seeing the same issues with the latest nightlies (0.16.0a200812). Can you try and reproduce them locally so that I can make sure I’m not doing anything differently?

I’m running the repro locally, will update once the script is done.

If it is an OOM issue it’s possible this is related to an RMM/Dask-CUDA/Dask issue where device 0 is the only device being used even though multiple GPUs are requested

Just to add to this, IOW this is an issue related to PR ( https://github.com/rapidsai/rmm/pull/466 ). We are discussing this in other contexts as well.

If it is an OOM issue it’s possible this is related to an RMM/Dask-CUDA/Dask issue where device 0 is the only device being used even though multiple GPUs are requested