datasets: ImportError: cannot import name 'SchemaInferenceError' from 'datasets.arrow_writer' (/opt/conda/lib/python3.10/site-packages/datasets/arrow_writer.py)

Describe the bug

While importing from packages getting the error Code:

import os
import torch
from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
from huggingface_hub import login
import pandas as pd

Error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[5], line 14
      4 from transformers import (
      5     AutoModelForCausalLM,
      6     AutoTokenizer,
   (...)
     11     logging
     12 )
     13 from peft import LoraConfig, PeftModel
---> 14 from trl import SFTTrainer
     15 from huggingface_hub import login
     16 import pandas as pd

File /opt/conda/lib/python3.10/site-packages/trl/__init__.py:21
      8 from .import_utils import (
      9     is_diffusers_available,
     10     is_npu_available,
   (...)
     13     is_xpu_available,
     14 )
     15 from .models import (
     16     AutoModelForCausalLMWithValueHead,
     17     AutoModelForSeq2SeqLMWithValueHead,
     18     PreTrainedModelWrapper,
     19     create_reference_model,
     20 )
---> 21 from .trainer import (
     22     DataCollatorForCompletionOnlyLM,
     23     DPOTrainer,
     24     IterativeSFTTrainer,
     25     PPOConfig,
     26     PPOTrainer,
     27     RewardConfig,
     28     RewardTrainer,
     29     SFTTrainer,
     30 )
     33 if is_diffusers_available():
     34     from .models import (
     35         DDPOPipelineOutput,
     36         DDPOSchedulerOutput,
     37         DDPOStableDiffusionPipeline,
     38         DefaultDDPOStableDiffusionPipeline,
     39     )

File /opt/conda/lib/python3.10/site-packages/trl/trainer/__init__.py:44
     42 from .ppo_trainer import PPOTrainer
     43 from .reward_trainer import RewardTrainer, compute_accuracy
---> 44 from .sft_trainer import SFTTrainer
     45 from .training_configs import RewardConfig

File /opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:23
     21 import torch.nn as nn
     22 from datasets import Dataset
---> 23 from datasets.arrow_writer import SchemaInferenceError
     24 from datasets.builder import DatasetGenerationError
     25 from transformers import (
     26     AutoModelForCausalLM,
     27     AutoTokenizer,
   (...)
     33     TrainingArguments,
     34 )

ImportError: cannot import name 'SchemaInferenceError' from 'datasets.arrow_writer' (/opt/conda/lib/python3.10/site-packages/datasets/arrow_writer.py

transformers version: 4.36.2 python version: 3.10.12 datasets version: 2.16.1

Steps to reproduce the bug

  1. Install packages
!pip install -U datasets trl accelerate peft bitsandbytes transformers trl huggingface_hub
  1. import packages
import os
import torch
from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
from huggingface_hub import login
import pandas as pd

Expected behavior

No error while importing

Environment info

  • datasets version: 2.16.0
  • Platform: Linux-5.15.133±x86_64-with-glibc2.35
  • Python version: 3.10.12
  • huggingface_hub version: 0.20.1
  • PyArrow version: 11.0.0
  • Pandas version: 2.1.4
  • fsspec version: 2023.10.0

About this issue

  • Original URL
  • State: closed
  • Created 6 months ago
  • Comments: 15 (2 by maintainers)

Most upvoted comments

Can you try re-installing datasets ?

I tried re-installing. Still getting the same error.

In kaggle I used:

  • %pip install -U datasets and then restarted runtime and then everything works fine.

Can you try re-installing datasets ?

I tried re-installing. Still getting the same error.

In kaggle I used:

  • %pip install -U datasets and then restarted runtime and then everything works fine.

Yes, this is working. When I restart the runtime after installing packages, it’s working perfectly. Thank you so much. But why do we need to restart runtime every time after installing packages?

I have the same issue - using with datasets version 2.16.1. Also this is on a kaggle notebook - other people with the same issue also seem to be having it on kaggle?

Can you try re-installing datasets ?

I tried re-installing. Still getting the same error.

In kaggle I used:

  • %pip install -U datasets and then restarted runtime and then everything works fine.

Yes, this is working. When I restart the runtime after installing packages, it’s working perfectly. Thank you so much. But why do we need to restart runtime every time after installing packages? For some packages it is required. https://stackoverflow.com/questions/57831187/need-to-restart-runtime-before-import-an-installed-package-in-colab