cudf: [BUG] thrust::system::system_error what(): for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

This script just reads randomly created JSON files using Dask with no heavy processing.

Dask Worker logs show something like the errors below, which eventually causes workers to restart frantically and eventually cause connection issues b/w the scheduler and workers.

NOTE: If I do not use Dask, the processing seems to go though without failures.

Worker logs:

terminate called after throwing an instance of 'thrust::system::system_error' what():  for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
terminate called recursively
distributed.nanny - INFO - Worker process 13050 was killed by signal 6

I used the following commands —

Start Scheduler: nohup dask-scheduler --host localhost &> scheduler.out &
Start Workers: CUDA_VISIBLE_DEVICES=0 nohup dask-worker localhost:8786 --nprocs 2 --nthreads 2 --memory-limit="16GB" --resources "process=1" >& worker.out &

Logs can be seen in scheduler.out and worker.out.

Random JSON files producer script:

# Creates 25 JSON files, 2*120MB each 

from random import randrange,seed
import json
import math
import time
import random

num_columns = 40

def column_names(size):
    base_cols = ["AppId{}", "LoggedTime{}", "timestamp{}"]
    cols = []
    mult = math.ceil(size/len(base_cols))
    for i in range(mult):
        for c in base_cols:
            cols.append(c.format(i))
            if(len(cols) == size): break
    return cols

def generate_json(num_columns):
    dict_out = {}
    cols = column_names(num_columns)
    for col in cols:
        if col.startswith("AppId"): dict_out[col] = randrange(1,50000)
        elif col.startswith("LoggedTime"): dict_out[col] = randrange(1,50000)
        else: dict_out[col] = randrange(1,50000)
    return json.dumps(dict_out)

for i in range(0,25):
    count = 0
    f = open("json_files/json-%i.txt" % i, "w+")
    while count < 2*150000:
        f.write(generate_json(num_columns) + "\n")
        count = count + 1
    f.close()

Processing script:

from distributed import Client, LocalCluster
import cudf

client = Client("localhost:8786")
client.get_versions(check=True)

def func_json(batch):
    file = f"json_files/json-{batch}.txt"
    df = cudf.read_json(file, lines=True, engine="cudf")
    return len(df)

batch_arr = [i for i in range(1,25)]
res = client.map(func_json, batch_arr)
print(client.gather(res))

Can someone please help? I’m seeing this kind of failure only as recent as one week.

I am using a fresh conda environment with this being the only installation command: conda install -y -c rapidsai-nightly -c nvidia -c conda-forge -c defaults custreamz python=3.7 cudatoolkit=10.2.

I am using a T4 GPU with CUDA 10.2.

P.S. This seems similar to #5897.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 38 (35 by maintainers)

Most upvoted comments

Got local repro with multithreaded JSON reads:

TEST_F(JsonReaderTest, Repro)
{
  auto read_all = [&]() {
    cudf_io::read_json_args in_args{cudf_io::source_info{""}};
    in_args.lines = true;
    for (int i = 0; i < 25; ++i) {
      in_args.source =
        cudf_io::source_info{"/home/vukasin/cudf/json-" + std::to_string(i) + ".txt"};
      auto df = cudf_io::read_json(in_args);
    }
  };

  auto th1 = std::async(std::launch::async, read_all);
  auto th2 = std::async(std::launch::async, read_all);
}

Reproes fairly consistently.

vuule on Aug 14, 2020

When using GPUs with dask the current working assumption is that there should be 1 worker and 1 thread per GPU. This is generally for proper CUDA context creation but also useful resource management. We built dask-cuda to make this setup trivial for users.

quasiben on Aug 13, 2020

I made significant change to the JSON reader 2 weeks ago that could affect this.

vuule on Aug 14, 2020

I’m suspecting synchronization issue(s) that got exposed by GPU saturation from concurrent reads. Digging into the repro, I found a few places where the synchronization is iffy. Need to look into it some more to root cause.

vuule on Aug 14, 2020

Okay, so I thought of using CSV files instead of JSON so, I used

import cudf
for i in range(0,20):
    file = f"json_files/json-{i}.txt"
    cudf.read_json(file, lines=True, engine="cudf").to_csv("csv_files/csv-"+str(i)+".csv")

to convert existing JSON to CSV files, and then updated the repro script to call read_csv

def func_csv(batch):
    file = f"csv_files/csv-{batch}.csv"
    df = cudf.read_csv(file)
    return len(df)

It seems to run fine with 2 processes and 2 threads. So this is specifically happening with the JSON reader?

chinmaychandak on Aug 14, 2020

Does this reproduce with a ThreadPoolExecutor. Maybe something like this?

from concurrent.futures import ThreadPoolExecutor
import cudf


def func_json(batch):
    file = f"json_files/json-{batch}.txt"
    df = cudf.read_json(file, lines=True, engine="cudf")
    return len(df)


with ThreadPoolExecutor(max_workers=1) as executor:
    batch_arr = [i for i in range(1, 25)]
    res = executor.map(func_json, batch_arr)
    for e in res:
        print(e)

Edit: May be worth playing with max_workers here.

jakirkham on Aug 14, 2020

cc @harrism as we’re seeing a threading related issue and there was substantial changes with regards to RMM and threading.

kkraus14 on Aug 13, 2020

@jakirkham I’m still seeing the same issues with the latest nightlies (0.16.0a200812). Can you try and reproduce them locally so that I can make sure I’m not doing anything differently?

I’m running the repro locally, will update once the script is done.

vuule on Aug 13, 2020

Filed an MRE here ( https://github.com/rapidsai/dask-cuda/issues/364 ).

jakirkham on Aug 12, 2020

If it is an OOM issue it’s possible this is related to an RMM/Dask-CUDA/Dask issue where device 0 is the only device being used even though multiple GPUs are requested

Just to add to this, IOW this is an issue related to PR ( https://github.com/rapidsai/rmm/pull/466 ). We are discussing this in other contexts as well.

jakirkham on Aug 12, 2020

If it is an OOM issue it’s possible this is related to an RMM/Dask-CUDA/Dask issue where device 0 is the only device being used even though multiple GPUs are requested

quasiben on Aug 12, 2020