tensorflow: 4x slowdown in feed_dict in tf-nightly-gpu

Upgrading to pip install tf-nightly-gpu from pip install tensorflow (1.5) slows down feeding about 4x

Feeding 100MB array used to take 15ms, and it takes 60ms after the change. I think this is due to change in alignment requirements for AVX-compiled binary. I want to start the thread to figure out how to regain the performance before these changes make it into official release

benchmark: align_feed_bug.py

# version: 1.5.0
python align_feed_bug.py
feed-cpu-variable   : min: 17.17, median: 19.95, mean: 19.82

# After upgrading to tf nightly
# version: 1.7.0-dev20180221
python align_feed_bug.py
feed-cpu-variable   : min: 53.97, median: 57.15, mean: 66.60

I’ve tried using @eamartin 's recipe (https://github.com/numpy/numpy/issues/5312#issuecomment-299533915) to make sure numpy array are 128-byte aligned, but that didn’t make any difference in speed

python align_feed_bug.py --align=1
feed-cpu-variable   : min: 49.54, median: 50.50, mean: 60.06

cc @martinwicke

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 2
  • Comments: 26 (18 by maintainers)

Commits related to this issue

Most upvoted comments

@niklas88 GPU at low utilization while TF only uses 1 core is consistent with copy problem. memcpy of 1GB array would take 1 second and will use a single core. You could check if align_numpy_tf is working for you by feeding 1GB aligned array and checking if takes 1 sec vs 1ms

@tatianashp the problem occurs on nvidia-docker with gcr.io/tensorflow/tensorflow:1.6.0-gpu-py3 on an NVIDIA TITAN X (Maxwell and Pascal). Both inside Kubernetes as well as on the Ubuntu 16.04 LTS host + Docker CE. Driver version is 390.25 but the same happened with an older version already (tried updating today during our investigation). As for the docker image there are almost no other dependencies (gensim + joblib) and those remain the same when changing from 1.5 to 1.6.

I have spent some time trying to get @yaroslavvb ‘s align_numpy_tf() hack to work but it’s a bit hairy because the embeddings’ size is taken into account when building the graph and at that point there is no session yet. I got it working with a hack but couldn’t really see a performance difference. Still, I think that the very large NumPy array feed is a likely culprit as other models by colleagues don’t show a similar degradation but also don’t feed large NumPy arrays.

Note that this is with the embeddings created under a the ‘cpu:0’ device. When creating on the GPU it takes 1 minute per batch even on 1.5, I’m not sure about 1.6. The loss seems to develop almost identically. Interestingly with 1.5 I see 60-70% utilization in nvidia-smi but with 1.6 it drops to 0-2%. CPU load is 100% with both versions, though on 1.5 when using no GPU it doesn’t go beyond 100% while with 1.6 it uses just one core.

I can try to create a reproducer tomorrow though it will likely need a large download for the embeddings…