neptune-client: BUG: Neptune v1.3.2 is slow when using forked data loaders in Linux
Describe the bug
I have a training script that uses the Trainer class from the HuggingFace Transformers library to train a text classification model. The Trainer class uses a NeptuneCallback to log training metrics to Neptune.
I noticed that my training script is inexplicably slow, so I profiled it using py-spy and found that neptune/internal/threading/daemon.py is taking more than 90% of the execution time of my script. I tried disabling the Neptune callback and saw a dramatic improvement in the speed of the training script.
Reproduction
Here’s my training code. I have the Neptune client installed and project name and API key in the environment variables so the Trainer class automatically send logs to Neptune using the NeptuneCallback class.
import logging
import hydra
from transformers import Trainer, TrainingArguments
from dataset import LukalficationDataset
from dataset.classes import LEVEL1, LEVEL2, LEVEL3
from metrics import MetricsLogger
from model import LukalificationModel
from preprocessing import MultiLabelOneHotEncoder, Processor
logger = logging.getLogger(__name__)
@hydra.main(config_path=".", config_name="config.yaml", version_base="1.2")
def main(config):
logger.info("Creating dataset")
train_dataset = LukalficationDataset(**config.dataset.train)
eval_dataset = LukalficationDataset(**config.dataset.eval)
logger.info("Creating preprocessor")
processor = Processor(
tokenizer_config=config.tokenizer,
label_encoders=[
MultiLabelOneHotEncoder(LEVEL1),
MultiLabelOneHotEncoder(LEVEL2),
MultiLabelOneHotEncoder(LEVEL3),
],
)
logger.info("Creating model")
model = LukalificationModel([len(LEVEL1), len(LEVEL2), len(LEVEL3)], **config.model)
training_args = TrainingArguments(**config.training)
logger.info("Training model")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=processor.data_collator,
compute_metrics=MetricsLogger(
label_encoders=processor.label_encoders
).compute_metrics,
)
trainer.train()
logger.info("Training complete")
if __name__ == "__main__":
main()
Expected behavior
I expect logging to Neptune to not impact the speed of my training script.
Environment
The output of pip list:
Package Version
-------------------------- ---------
accelerate 0.20.3
aiobotocore 2.5.2
aiohttp 3.8.4
aiohttp-retry 2.8.3
aioitertools 0.11.0
aiosignal 1.3.1
amqp 5.1.1
annotated-types 0.5.0
antlr4-python3-runtime 4.9.3
appdirs 1.4.4
appnope 0.1.3
arrow 1.2.3
asn1crypto 1.5.1
asttokens 2.2.1
async-timeout 4.0.2
asyncssh 2.13.2
atpublic 4.0
attrs 23.1.0
backcall 0.2.0
backoff 2.2.1
beautifulsoup4 4.12.2
billiard 4.1.0
boto3 1.26.161
botocore 1.29.161
bravado 11.0.3
bravado-core 5.17.1
celery 5.3.1
certifi 2023.5.7
cffi 1.15.1
charset-normalizer 3.2.0
click 8.1.4
click-didyoumean 0.3.0
click-plugins 1.1.1
click-repl 0.3.0
colorama 0.4.6
comm 0.1.3
configobj 5.0.8
cryptography 40.0.2
debugpy 1.6.7
decorator 5.1.1
dictdiffer 0.9.0
diskcache 5.6.1
distro 1.8.0
dpath 2.1.6
dulwich 0.21.5
dvc 3.5.1
dvc-data 2.5.0
dvc-http 2.30.2
dvc-objects 0.23.0
dvc-render 0.5.3
dvc-s3 2.23.0
dvc-studio-client 0.11.0
dvc-task 0.3.0
executing 1.2.0
filelock 3.12.2
flatten-dict 0.4.2
flufl.lock 7.1.1
fqdn 1.5.1
frozenlist 1.3.3
fsspec 2023.6.0
funcy 2.0
future 0.18.3
gitdb 4.0.10
GitPython 3.1.32
grandalf 0.8
huggingface-hub 0.16.4
hydra-core 1.3.2
idna 3.4
ipykernel 6.24.0
ipython 8.14.0
isoduration 20.11.0
iterative-telemetry 0.0.8
jedi 0.18.2
Jinja2 3.1.2
jmespath 1.0.1
joblib 1.3.1
jsonpointer 2.4
jsonref 1.1.0
jsonschema 4.18.0
jsonschema-specifications 2023.6.1
jupyter_client 8.3.0
jupyter_core 5.3.1
kombu 5.3.1
markdown-it-py 3.0.0
MarkupSafe 2.1.3
matplotlib-inline 0.1.6
mdurl 0.1.2
monotonic 1.6
mpmath 1.3.0
msgpack 1.0.5
multidict 6.0.4
nanotime 0.5.2
neptune 1.3.2
nest-asyncio 1.5.6
networkx 3.1
numpy 1.25.1
oauthlib 3.2.2
omegaconf 2.3.0
orjson 3.9.2
oscrypto 1.3.0
packaging 23.1
pandas 2.0.3
parso 0.8.3
pathspec 0.11.1
pexpect 4.8.0
pickleshare 0.7.5
Pillow 10.0.0
pip 23.1.2
platformdirs 3.8.1
prompt-toolkit 3.0.39
protobuf 3.19.6
psutil 5.9.5
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 10.0.1
pycparser 2.21
pycryptodomex 3.18.0
pydantic 2.0.2
pydantic_core 2.1.2
pydot 1.4.2
pygit2 1.12.2
Pygments 2.15.1
pygtrie 2.5.0
PyJWT 2.7.0
pyOpenSSL 23.2.0
pyparsing 3.1.0
python-dateutil 2.8.2
pytz 2023.3
PyYAML 6.0
pyzmq 25.1.0
referencing 0.29.1
regex 2023.6.3
requests 2.31.0
requests-oauthlib 1.3.1
rfc3339-validator 0.1.4
rfc3987 1.3.8
rich 13.4.2
rpds-py 0.8.10
ruamel.yaml 0.17.32
ruamel.yaml.clib 0.2.7
s3fs 2023.6.0
s3transfer 0.6.1
safetensors 0.3.1
scikit-learn 1.3.0
scipy 1.11.1
scmrepo 1.0.4
sentencepiece 0.1.99
setuptools 68.0.0
shortuuid 1.0.11
shtab 1.6.2
simplejson 3.19.1
six 1.16.0
smmap 5.0.0
snowflake-connector-python 3.0.4
sortedcontainers 2.4.0
soupsieve 2.4.1
sqltrie 0.7.0
stack-data 0.6.2
swagger-spec-validator 3.0.3
sympy 1.12
tabulate 0.9.0
threadpoolctl 3.1.0
tokenizers 0.13.3
tomlkit 0.11.8
torch 2.0.1
tornado 6.3.2
tqdm 4.65.0
traitlets 5.9.0
transformers 4.30.2
typing_extensions 4.7.1
tzdata 2023.3
uri-template 1.3.0
urllib3 1.26.16
vine 5.0.0
voluptuous 0.13.1
wcwidth 0.2.6
webcolors 1.13
websocket-client 1.6.1
wheel 0.40.0
wrapt 1.15.0
yarl 1.9.2
zc.lockfile 3.0.post1
The operating system you’re using: Linux
The output of python --version:
Python 3.11.3
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 34 (18 by maintainers)
We just released v1.4.1 which fixes this 🥳
I’ll be closing this issue now, but please feel free to let us know if you are still facing issues.
Hello all 👋
We have identified the root cause of this issue and have pushed a fix
v1.4.1rc1🎉With this fix, you will not need to resort to any of the above-mentioned workarounds. ✅
We will be publishing the stable version soon, but if you want to test the pre-release version, you can update neptune by running
pip install neptune==1.4.1rc1.Please let us know if this works 🙏
Yeah, I can confirm this occurs to me with 1.3.2 but not with 1.3.1.
Hey all, We have found a workaround that should work for most cases.
Instead of using
forkfor multiprocessing, we recommend usingspawnorforkserver.spawnis already the default in MacOS and Windows.forkserveris not supported on Windows.To use
spawn/forkserver, you will need to have a single entry point for your training script:Here is a minimal example of how you can do this --> Neptune slowdown issue workaround
Note that using
spawn/forkservercomes with some limitations. You can check all the limitations hereIf you do not want to make these changes, you can update the Neptune connection mode to “offline”, and then upload the data to Neptune servers using
neptune synconce training is complete.While we work on a more permanent solution, please let me know if either of these suggestions seem viable workarounds for now 🙏
@SiddhantSadangi I confirm that the issue does not happen when using neptune==1.3.1
I have sent this e-mail a reproduction of the issue in Colab.
It seems like I fixed my issue by clearing with Neptune clear, uninstalling Neptune, and installing neptune-client 0.16.16.
Note that neptune clear did not fix anything while still using 1.0.
I think others on this thread might fix their issues by using 0.16.16 too.
As additional information, here are screenshots of gpu usage of the same exact debug run (1 step in training and validation) with 1.0 and with 0.16.16:
0.16
Clearly something is bottlenecking the main training loop hard, maybe intensive disk operations interfering with IO? Might be the same bottleneck the Issue OP found in his profiling.
I don’t think any checkpoints are being logged because I’m always turning the training script off before it reaches 3000 steps (it’s too slow). When I look at the Artifacts tab it doesn’t show me any artifacts.
Here’s the run.
Thanks @SiddhantSadangi, I tried setting
report_totononeas you suggested. Here’s the code and config:Config:
The training script still seems to be slow, here’s the output from the profiler:
Hey @issamemari ,
I see that you have used
report_to: nullin your config file. It should benoneinstead ofnull. Can you update this as check if the issue is resolved?Here’s the relevant API reference: https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/trainer#transformers.TrainingArguments.report_to
I am not logging model checkpoints to Neptune, I’m just saving them to disk. I tried what you suggested, here’s the code:
And here’s the config file:
Here’s the output of my profiler with the Neptune callback:
And here’s the output without the Neptune callback (I explicitly called
trainer.remove_callback(NeptuneCallback)):