neptune-client: BUG: Neptune v1.3.2 is slow when using forked data loaders in Linux

Describe the bug

I have a training script that uses the Trainer class from the HuggingFace Transformers library to train a text classification model. The Trainer class uses a NeptuneCallback to log training metrics to Neptune.

I noticed that my training script is inexplicably slow, so I profiled it using py-spy and found that neptune/internal/threading/daemon.py is taking more than 90% of the execution time of my script. I tried disabling the Neptune callback and saw a dramatic improvement in the speed of the training script.

Reproduction

Here’s my training code. I have the Neptune client installed and project name and API key in the environment variables so the Trainer class automatically send logs to Neptune using the NeptuneCallback class.

import logging

import hydra
from transformers import Trainer, TrainingArguments

from dataset import LukalficationDataset
from dataset.classes import LEVEL1, LEVEL2, LEVEL3
from metrics import MetricsLogger
from model import LukalificationModel
from preprocessing import MultiLabelOneHotEncoder, Processor

logger = logging.getLogger(__name__)


@hydra.main(config_path=".", config_name="config.yaml", version_base="1.2")
def main(config):
    logger.info("Creating dataset")
    train_dataset = LukalficationDataset(**config.dataset.train)
    eval_dataset = LukalficationDataset(**config.dataset.eval)

    logger.info("Creating preprocessor")
    processor = Processor(
        tokenizer_config=config.tokenizer,
        label_encoders=[
            MultiLabelOneHotEncoder(LEVEL1),
            MultiLabelOneHotEncoder(LEVEL2),
            MultiLabelOneHotEncoder(LEVEL3),
        ],
    )

    logger.info("Creating model")
    model = LukalificationModel([len(LEVEL1), len(LEVEL2), len(LEVEL3)], **config.model)

    training_args = TrainingArguments(**config.training)

    logger.info("Training model")
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=processor.data_collator,
        compute_metrics=MetricsLogger(
            label_encoders=processor.label_encoders
        ).compute_metrics,
    )

    trainer.train()

    logger.info("Training complete")


if __name__ == "__main__":
    main()

Expected behavior

I expect logging to Neptune to not impact the speed of my training script.

Environment

The output of pip list:

Package                    Version
-------------------------- ---------
accelerate                 0.20.3
aiobotocore                2.5.2
aiohttp                    3.8.4
aiohttp-retry              2.8.3
aioitertools               0.11.0
aiosignal                  1.3.1
amqp                       5.1.1
annotated-types            0.5.0
antlr4-python3-runtime     4.9.3
appdirs                    1.4.4
appnope                    0.1.3
arrow                      1.2.3
asn1crypto                 1.5.1
asttokens                  2.2.1
async-timeout              4.0.2
asyncssh                   2.13.2
atpublic                   4.0
attrs                      23.1.0
backcall                   0.2.0
backoff                    2.2.1
beautifulsoup4             4.12.2
billiard                   4.1.0
boto3                      1.26.161
botocore                   1.29.161
bravado                    11.0.3
bravado-core               5.17.1
celery                     5.3.1
certifi                    2023.5.7
cffi                       1.15.1
charset-normalizer         3.2.0
click                      8.1.4
click-didyoumean           0.3.0
click-plugins              1.1.1
click-repl                 0.3.0
colorama                   0.4.6
comm                       0.1.3
configobj                  5.0.8
cryptography               40.0.2
debugpy                    1.6.7
decorator                  5.1.1
dictdiffer                 0.9.0
diskcache                  5.6.1
distro                     1.8.0
dpath                      2.1.6
dulwich                    0.21.5
dvc                        3.5.1
dvc-data                   2.5.0
dvc-http                   2.30.2
dvc-objects                0.23.0
dvc-render                 0.5.3
dvc-s3                     2.23.0
dvc-studio-client          0.11.0
dvc-task                   0.3.0
executing                  1.2.0
filelock                   3.12.2
flatten-dict               0.4.2
flufl.lock                 7.1.1
fqdn                       1.5.1
frozenlist                 1.3.3
fsspec                     2023.6.0
funcy                      2.0
future                     0.18.3
gitdb                      4.0.10
GitPython                  3.1.32
grandalf                   0.8
huggingface-hub            0.16.4
hydra-core                 1.3.2
idna                       3.4
ipykernel                  6.24.0
ipython                    8.14.0
isoduration                20.11.0
iterative-telemetry        0.0.8
jedi                       0.18.2
Jinja2                     3.1.2
jmespath                   1.0.1
joblib                     1.3.1
jsonpointer                2.4
jsonref                    1.1.0
jsonschema                 4.18.0
jsonschema-specifications  2023.6.1
jupyter_client             8.3.0
jupyter_core               5.3.1
kombu                      5.3.1
markdown-it-py             3.0.0
MarkupSafe                 2.1.3
matplotlib-inline          0.1.6
mdurl                      0.1.2
monotonic                  1.6
mpmath                     1.3.0
msgpack                    1.0.5
multidict                  6.0.4
nanotime                   0.5.2
neptune                    1.3.2
nest-asyncio               1.5.6
networkx                   3.1
numpy                      1.25.1
oauthlib                   3.2.2
omegaconf                  2.3.0
orjson                     3.9.2
oscrypto                   1.3.0
packaging                  23.1
pandas                     2.0.3
parso                      0.8.3
pathspec                   0.11.1
pexpect                    4.8.0
pickleshare                0.7.5
Pillow                     10.0.0
pip                        23.1.2
platformdirs               3.8.1
prompt-toolkit             3.0.39
protobuf                   3.19.6
psutil                     5.9.5
ptyprocess                 0.7.0
pure-eval                  0.2.2
pyarrow                    10.0.1
pycparser                  2.21
pycryptodomex              3.18.0
pydantic                   2.0.2
pydantic_core              2.1.2
pydot                      1.4.2
pygit2                     1.12.2
Pygments                   2.15.1
pygtrie                    2.5.0
PyJWT                      2.7.0
pyOpenSSL                  23.2.0
pyparsing                  3.1.0
python-dateutil            2.8.2
pytz                       2023.3
PyYAML                     6.0
pyzmq                      25.1.0
referencing                0.29.1
regex                      2023.6.3
requests                   2.31.0
requests-oauthlib          1.3.1
rfc3339-validator          0.1.4
rfc3987                    1.3.8
rich                       13.4.2
rpds-py                    0.8.10
ruamel.yaml                0.17.32
ruamel.yaml.clib           0.2.7
s3fs                       2023.6.0
s3transfer                 0.6.1
safetensors                0.3.1
scikit-learn               1.3.0
scipy                      1.11.1
scmrepo                    1.0.4
sentencepiece              0.1.99
setuptools                 68.0.0
shortuuid                  1.0.11
shtab                      1.6.2
simplejson                 3.19.1
six                        1.16.0
smmap                      5.0.0
snowflake-connector-python 3.0.4
sortedcontainers           2.4.0
soupsieve                  2.4.1
sqltrie                    0.7.0
stack-data                 0.6.2
swagger-spec-validator     3.0.3
sympy                      1.12
tabulate                   0.9.0
threadpoolctl              3.1.0
tokenizers                 0.13.3
tomlkit                    0.11.8
torch                      2.0.1
tornado                    6.3.2
tqdm                       4.65.0
traitlets                  5.9.0
transformers               4.30.2
typing_extensions          4.7.1
tzdata                     2023.3
uri-template               1.3.0
urllib3                    1.26.16
vine                       5.0.0
voluptuous                 0.13.1
wcwidth                    0.2.6
webcolors                  1.13
websocket-client           1.6.1
wheel                      0.40.0
wrapt                      1.15.0
yarl                       1.9.2
zc.lockfile                3.0.post1

The operating system you’re using: Linux

The output of python --version: Python 3.11.3

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 34 (18 by maintainers)

Most upvoted comments

We just released v1.4.1 which fixes this 🥳

I’ll be closing this issue now, but please feel free to let us know if you are still facing issues.

Hello all 👋

We have identified the root cause of this issue and have pushed a fix v1.4.1rc1 🎉

With this fix, you will not need to resort to any of the above-mentioned workarounds. ✅

We will be publishing the stable version soon, but if you want to test the pre-release version, you can update neptune by running pip install neptune==1.4.1rc1.

Please let us know if this works 🙏

We believe that changes introduced to better support forking processes in neptune==1.3.2 might be responsible for this. Can you please try downgrading to neptune==1.3.1 and let us know if it solves the issue?

Yeah, I can confirm this occurs to me with 1.3.2 but not with 1.3.1.

Hey all, We have found a workaround that should work for most cases.

Instead of using fork for multiprocessing, we recommend using spawn or forkserver. spawn is already the default in MacOS and Windows. forkserver is not supported on Windows.

To use spawn/forkserver, you will need to have a single entry point for your training script:

if __name__ == "__main__":
    multiprocessing.set_start_method("spawn")
    train_fn() # or trainer.train(), etc

Here is a minimal example of how you can do this --> Neptune slowdown issue workaround

Note that using spawn/forkserver comes with some limitations. You can check all the limitations here

If you do not want to make these changes, you can update the Neptune connection mode to “offline”, and then upload the data to Neptune servers using neptune sync once training is complete.

While we work on a more permanent solution, please let me know if either of these suggestions seem viable workarounds for now 🙏

@SiddhantSadangi I confirm that the issue does not happen when using neptune==1.3.1

Hello all,

We are having some trouble reproducing this issue. Would it be possible for anyone to share a reproducible example? Code, data sample, hardware details, etc. would help a lot pray

You can send these to support@neptune.ai.

I have sent this e-mail a reproduction of the issue in Colab.

Hello @dscarmo ,

You can use neptune clear to delete all unsynced information in the .neptune folder.

Since you also upgraded from neptune-client to neptune, I am sharing an upgrade guide that might be useful.

Please let me know if this helps.

It seems like I fixed my issue by clearing with Neptune clear, uninstalling Neptune, and installing neptune-client 0.16.16.

Note that neptune clear did not fix anything while still using 1.0.

I think others on this thread might fix their issues by using 0.16.16 too.

As additional information, here are screenshots of gpu usage of the same exact debug run (1 step in training and validation) with 1.0 and with 0.16.16:

image 1.0

image

0.16

Clearly something is bottlenecking the main training loop hard, maybe intensive disk operations interfering with IO? Might be the same bottleneck the Issue OP found in his profiling.

I don’t think any checkpoints are being logged because I’m always turning the training script off before it reaches 3000 steps (it’s too slow). When I look at the Artifacts tab it doesn’t show me any artifacts.

Here’s the run.

Thanks @SiddhantSadangi, I tried setting report_to to none as you suggested. Here’s the code and config:

import logging

import hydra
from transformers import Trainer, TrainingArguments
from transformers.integrations import NeptuneCallback

from dataset import LukalficationDataset
from dataset.classes import LEVEL1, LEVEL2, LEVEL3
from metrics import MetricsLogger
from model import LukalificationModel
from preprocessing import MultiLabelOneHotEncoder, Processor

logger = logging.getLogger(__name__)


@hydra.main(config_path=".", config_name="config.yaml", version_base="1.2")
def main(config):
    logger.info("Creating dataset")
    train_dataset = LukalficationDataset(**config.dataset.train)
    eval_dataset = LukalficationDataset(**config.dataset.eval)

    logger.info("Creating preprocessor")
    processor = Processor(
        tokenizer_config=config.tokenizer,
        label_encoders=[
            MultiLabelOneHotEncoder(LEVEL1),
            MultiLabelOneHotEncoder(LEVEL2),
            MultiLabelOneHotEncoder(LEVEL3),
        ],
    )

    logger.info("Creating model")
    model = LukalificationModel([len(LEVEL1), len(LEVEL2), len(LEVEL3)], **config.model)

    training_args = TrainingArguments(**config.training)

    logger.info("Training model")
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=processor.data_collator,
        compute_metrics=MetricsLogger(
            label_encoders=processor.label_encoders
        ).compute_metrics,
        callbacks=[NeptuneCallback(log_checkpoints="best")],
    )

    trainer.train()

    logger.info("Training complete")


if __name__ == "__main__":
    main()

Config:

model:
  freeze_backbone: false
  checkpoint_path: null


dataset:
  train:
    csv_file: ${hydra:runtime.cwd}/data/train.csv
  eval:
    csv_file: ${hydra:runtime.cwd}/data/eval.csv


tokenizer:
  init:
    pretrained_model_name_or_path: "Unbabel/xlm-roberta-comet-small"
    truncation_side: "left"
  call:
    max_length: 512
    padding: longest
    truncation: true
    return_tensors: "pt"


training:
  output_dir: ${hydra:runtime.cwd}/checkpoints/${now:%Y-%m-%d}/${now:%H-%M}
  num_train_epochs: 50
  per_device_train_batch_size: 48
  per_device_eval_batch_size: 48
  learning_rate: 5e-5
  warmup_ratio: 0.1
  dataloader_num_workers: 8
  lr_scheduler_type: linear
  greater_is_better: true
  metric_for_best_model: "eval_global/f1-score"
  full_determinism: false
  no_cuda: false
  save_strategy: steps
  save_steps: 3000
  evaluation_strategy: steps
  eval_steps: 3000
  logging_strategy: steps
  logging_steps: 100
  optim: adamw_torch
  remove_unused_columns: false
  report_to: none
  load_best_model_at_end: true

The training script still seems to be slow, here’s the output from the profiler: profile

Hey @issamemari ,

I see that you have used report_to: null in your config file. It should be none instead of null. Can you update this as check if the issue is resolved?

Here’s the relevant API reference: https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/trainer#transformers.TrainingArguments.report_to

I am not logging model checkpoints to Neptune, I’m just saving them to disk. I tried what you suggested, here’s the code:

import logging

import hydra
from transformers import Trainer, TrainingArguments
from transformers.integrations import NeptuneCallback

from dataset import LukalficationDataset
from dataset.classes import LEVEL1, LEVEL2, LEVEL3
from metrics import MetricsLogger
from model import LukalificationModel
from preprocessing import MultiLabelOneHotEncoder, Processor

logger = logging.getLogger(__name__)


@hydra.main(config_path=".", config_name="config.yaml", version_base="1.2")
def main(config):
    logger.info("Creating dataset")
    train_dataset = LukalficationDataset(**config.dataset.train)
    eval_dataset = LukalficationDataset(**config.dataset.eval)

    logger.info("Creating preprocessor")
    processor = Processor(
        tokenizer_config=config.tokenizer,
        label_encoders=[
            MultiLabelOneHotEncoder(LEVEL1),
            MultiLabelOneHotEncoder(LEVEL2),
            MultiLabelOneHotEncoder(LEVEL3),
        ],
    )

    logger.info("Creating model")
    model = LukalificationModel([len(LEVEL1), len(LEVEL2), len(LEVEL3)], **config.model)

    training_args = TrainingArguments(**config.training)

    logger.info("Training model")
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=processor.data_collator,
        compute_metrics=MetricsLogger(
            label_encoders=processor.label_encoders
        ).compute_metrics,
        callbacks=[NeptuneCallback(log_checkpoints="best")],
    )

    trainer.train()

    logger.info("Training complete")


if __name__ == "__main__":
    main()

And here’s the config file:

model:
  freeze_backbone: false
  checkpoint_path: null


dataset:
  train:
    csv_file: ${hydra:runtime.cwd}/data/train.csv
  eval:
    csv_file: ${hydra:runtime.cwd}/data/eval.csv


tokenizer:
  init:
    pretrained_model_name_or_path: "Unbabel/xlm-roberta-comet-small"
    truncation_side: "left"
  call:
    max_length: 512
    padding: longest
    truncation: true
    return_tensors: "pt"


training:
  output_dir: ${hydra:runtime.cwd}/checkpoints/${now:%Y-%m-%d}/${now:%H-%M}
  num_train_epochs: 50
  per_device_train_batch_size: 48
  per_device_eval_batch_size: 48
  learning_rate: 5e-5
  warmup_ratio: 0.1
  dataloader_num_workers: 8
  lr_scheduler_type: linear
  greater_is_better: true
  metric_for_best_model: "eval_global/f1-score"
  full_determinism: false
  no_cuda: false
  save_strategy: steps
  save_steps: 3000
  evaluation_strategy: steps
  eval_steps: 3000
  logging_strategy: steps
  logging_steps: 100
  optim: adamw_torch
  remove_unused_columns: false
  report_to: null
  load_best_model_at_end: true

Here’s the output of my profiler with the Neptune callback: profile

And here’s the output without the Neptune callback (I explicitly called trainer.remove_callback(NeptuneCallback)): profile