datatable: FTRL algo does not work properly on views

Hi,

I’m trying to use datatable FTRL proximal algo on a dataset and it behaves strangely. LogLoss increases with the number of epochs.

Here is the code I use :

train_dt = dt.fread('dt_ftrl_test_set.csv.gz')
features = [f for f in train_dt.names if f not in ['HasDetections']]
for n in range(10):
    ftrl = Ftrl(nepochs=n+1)
    ftrl.fit(train_dt[:, features], train_dt[:, 'HasDetections'])
    print(log_loss(np.array(train_dt[trn_, 'HasDetections'])[:, 0], np.array(ftrl.predict(train_dt[trn_, features]))))

The output is

0.6975873940617929
0.7004277294410224
0.7030339011892597
0.705290424565774
0.7072685897773024
0.7091474008277487
0.7108282513596036
0.7123130263929156
0.713890830846544
0.7151695514165213

my own version of FTRL trains correctly with the following output:

time_used:0:00:01.026606	epoch: 0   rows:10001	t_logloss:0.59638
time_used:0:00:01.715622	epoch: 1   rows:10001	t_logloss:0.52452
time_used:0:00:02.436984	epoch: 2   rows:10001	t_logloss:0.48113
time_used:0:00:03.158367	epoch: 3   rows:10001	t_logloss:0.44260
time_used:0:00:03.851369	epoch: 4   rows:10001	t_logloss:0.39633
time_used:0:00:04.553488	epoch: 5   rows:10001	t_logloss:0.38197
time_used:0:00:05.264179	epoch: 6   rows:10001	t_logloss:0.35380
time_used:0:00:05.973398	epoch: 7   rows:10001	t_logloss:0.32839
time_used:0:00:06.688121	epoch: 8   rows:10001	t_logloss:0.32057
time_used:0:00:07.394217	epoch: 9   rows:10001	t_logloss:0.29917
  • Your environment? I’m on ubuntu 16.04, clang+llvm-7.0.0-x86_64-linux-gnu-ubuntu-16.04, python 3.6, datatable is compiled from source.

let me know if you need more.

I guess I’m missing something but could not find anything in the unit tests.

Thanks for your help.

P.S. : make test results and the dataset I use are attached. datatable_make_test_results.txt dt_ftrl_test_set.csv.gz

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 19 (9 by maintainers)

Most upvoted comments

@goldentom42 Oliver, if your original Python code works, and you clarify points (1) and (2), I can also try to reproduce the problem on my side. I suspect the difference may be caused by the fact that I measure logloss for all the rows :, while you use only a given subset trn_.

Olivier, the OMP error that you are seeing is caused by several OMP libraries being linked during the runtime. Here’s what this means:

  • When datatable is imported, it loads dynamic library libomp.so (unless it is already loaded);
  • If another library is loaded afterwards (such as numpy or scikit-learn), it may also want to have OpenMP support, and also try to load its own OpenMP library. Normally this process works smoothly, if that other library is compiled with dynamic loading of OpenMP. However, if the library has compiled OpenMP statically, then problems will occur: OpenMP will detect that there are 2 versions of itself being present, and will abort execution.

The right solution to this problem is to recompile all libraries with dynamic loading of OpenMP. Although I understand how this might not be an easy thing to do. Possible workarounds are:

  • Load the “bad” library first, so that it brings its own OpenMP; and only then import datatable, which will use the OpenMP that is already present;
  • Set environment variable KMP_DUPLICATE_LIB_OK=TRUE, although I don’t know what the consequences might be.