private-gpt: python ingest.py error:install pypandoc wheels with included pandoc.

Macos 13.4/ intel i7

python -V Python 3.10.11

$ pip list

Package                 Version
----------------------- -----------
aiohttp                 3.8.4
aiosignal               1.3.1
anyio                   3.6.2
argilla                 1.7.0
async-timeout           4.0.2
attrs                   23.1.0
backoff                 2.2.1
beautifulsoup4          4.12.2
certifi                 2023.5.7
cffi                    1.15.1
chardet                 5.1.0
charset-normalizer      3.1.0
chromadb                0.3.22
click                   8.1.3
clickhouse-connect      0.5.24
colorclass              2.2.2
commonmark              0.9.1
compressed-rtf          1.0.6
cryptography            40.0.2
dataclasses-json        0.5.7
Deprecated              1.2.13
duckdb                  0.7.1
easygui                 0.98.3
ebcdic                  1.1.1
et-xmlfile              1.1.0
extract-msg             0.41.1
fastapi                 0.95.1
filelock                3.12.0
frozenlist              1.3.3
fsspec                  2023.5.0
greenlet                2.0.2
h11                     0.14.0
hnswlib                 0.7.0
httpcore                0.16.3
httptools               0.5.0
httpx                   0.23.3
huggingface-hub         0.14.1
idna                    3.4
IMAPClient              2.3.1
Jinja2                  3.1.2
joblib                  1.2.0
langchain               0.0.166
lark-parser             0.12.0
llama-cpp-python        0.1.48
lxml                    4.9.2
lz4                     4.3.2
Markdown                3.4.3
MarkupSafe              2.1.2
marshmallow             3.19.0
marshmallow-enum        1.5.1
monotonic               1.6
mpmath                  1.3.0
msg-parser              1.2.0
msoffcrypto-tool        5.0.1
multidict               6.0.4
mypy-extensions         1.0.0
networkx                3.1
nltk                    3.8.1
numexpr                 2.8.4
numpy                   1.23.5
olefile                 0.46
oletools                0.60.1
openapi-schema-pydantic 1.2.4
openpyxl                3.1.2
packaging               23.1
pandas                  1.5.3
pandoc                  2.3
pcodedmp                1.2.6
pdfminer.six            20221105
Pillow                  9.5.0
pip                     23.1.2
plumbum                 1.8.1
ply                     3.11
posthog                 3.0.1
pycparser               2.21
pydantic                1.10.7
Pygments                2.15.1
pygpt4all               1.1.0
pygptj                  2.0.3
pyllamacpp              2.1.3
pypandoc                1.11
pyparsing               2.4.7
python-dateutil         2.8.2
python-docx             0.8.11
python-dotenv           1.0.0
python-magic            0.4.27
python-pptx             0.6.21
pytz                    2023.3
pytz-deprecation-shim   0.1.0.post0
PyYAML                  6.0
red-black-tree-mod      1.20
regex                   2023.5.5
requests                2.30.0
rfc3986                 1.5.0
rich                    13.0.1
RTFDE                   0.0.2
scikit-learn            1.2.2
scipy                   1.10.1
sentence-transformers   2.2.2
sentencepiece           0.1.99
setuptools              67.7.2
six                     1.16.0
sniffio                 1.3.0
soupsieve               2.4.1
SQLAlchemy              2.0.13
starlette               0.26.1
sympy                   1.12
tabulate                0.9.0
tenacity                8.2.2
threadpoolctl           3.1.0
tokenizers              0.13.3
torch                   2.0.1
torchvision             0.15.2
tqdm                    4.65.0
transformers            4.29.1
typer                   0.9.0
typing_extensions       4.5.0
typing-inspect          0.8.0
tzdata                  2023.3
tzlocal                 4.2
unstructured            0.6.5
urllib3                 2.0.2
uvicorn                 0.22.0
uvloop                  0.17.0
watchfiles              0.19.0
websockets              11.0.3
wheel                   0.40.0
wrapt                   1.14.1
XlsxWriter              3.1.0
yarl                    1.9.2
zstandard               0.21.0

$ python ingest.py

Creating new vectorstore
Loading documents from source_documents
Loading new documents:   2%|▎                   | 2/131 [00:03<03:12,  1.50s/it][nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/51pwn/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
Loading new documents:   2%|▎                   | 2/131 [00:04<05:13,  2.43s/it]
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 89, in load_single_document
    return loader.load()[0]
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/unstructured.py", line 70, in load
    elements = self._get_elements()
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/epub.py", line 22, in _get_elements
    return partition_epub(filename=self.file_path, **self.unstructured_kwargs)
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/partition/epub.py", line 24, in partition_epub
    return convert_and_partition_html(
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/partition/html.py", line 124, in convert_and_partition_html
    html_text = convert_file_to_html_text(source_format=source_format, filename=filename, file=file)
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/file_utils/file_conversion.py", line 44, in convert_file_to_html_text
    html_text = convert_file_to_text(
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/file_utils/file_conversion.py", line 12, in convert_file_to_text
    text = pypandoc.convert_file(filename, target_format, format=source_format)
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pypandoc/__init__.py", line 168, in convert_file
    return _convert_input(discovered_source_files, format, 'path', to, extra_args=extra_args,
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pypandoc/__init__.py", line 324, in _convert_input
    _ensure_pandoc_path()
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pypandoc/__init__.py", line 750, in _ensure_pandoc_path
    raise OSError("No pandoc was found: either install pandoc and add it\n"
OSError: No pandoc was found: either install pandoc and add it
to your PATH or or call pypandoc.download_pandoc(...) or
install pypandoc wheels with included pandoc.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 167, in <module>
    main()
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 157, in main
    texts = process_documents()
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 119, in process_documents
    documents = load_documents(source_directory, ignored_files)
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 108, in load_documents
    for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
OSError: No pandoc was found: either install pandoc and add it
to your PATH or or call pypandoc.download_pandoc(...) or
install pypandoc wheels with included pandoc.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17

Most upvoted comments

Problem solved with : pip install pypandoc-binary

(no need to worry about the PATH with this command)

I ran into the same issue. You have to install pandoc, and add it to your PATH.

I’m on Win10. I ran the following py script. It will download and run the pandoc installer. Then add “C:\Users\Username\AppData\Local\Pandoc” to your PATH. That’s where mine got installed. Yours might be different.

from pypandoc.pandoc_download import download_pandoc
# see the documentation how to customize the installation path
# but be aware that you then need to include it in the `PATH`
download_pandoc()

I can run normally on Macos Inteli7, and this is my operation to share with everyone

However, I found that using it to try making AI search engines still falls far short of expectations

conda remove --name privateGPT --all -y
conda create -n privateGPT -y python=3.10
conda activate privateGPT
conda init zsh
export PATH="$HOME/anaconda3/envs/privateGPT/bin:$PATH"
which pip python
python -V
cat requirements.txt|xargs -I % pip install "%" -i https://mirror.baidu.com/pypi/simple
ARCHFLAGS="-arch x86_64"  
pip install langchain llama-cpp-python chromadb unstructured  -i https://mirror.baidu.com/pypi/simple
conda install -c conda-forge pypandoc
brew install pandoc

I’ve found solution. I need to install pandoc with brew first brew install pandoc More details: https://pandoc.org/installing.html

$ python ingest.py
Creating new vectorstore
Loading documents from source_documents
Loading new documents:   3%|▌                   | 4/131 [00:06<02:43,  1.29s/it][nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/51pwn/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/51pwn/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/51pwn/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
Loading new documents:   3%|▌                   | 4/131 [00:09<04:57,  2.34s/it]
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 89, in load_single_document
    return loader.load()[0]
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/unstructured.py", line 70, in load
    elements = self._get_elements()
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/epub.py", line 22, in _get_elements
    return partition_epub(filename=self.file_path, **self.unstructured_kwargs)
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/partition/epub.py", line 24, in partition_epub
    return convert_and_partition_html(
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/partition/html.py", line 124, in convert_and_partition_html
    html_text = convert_file_to_html_text(source_format=source_format, filename=filename, file=file)
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/file_utils/file_conversion.py", line 44, in convert_file_to_html_text
    html_text = convert_file_to_text(
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/file_utils/file_conversion.py", line 12, in convert_file_to_text
    text = pypandoc.convert_file(filename, target_format, format=source_format)
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pypandoc/__init__.py", line 164, in convert_file
    format = _identify_format_from_path(discovered_source_files[0], format)
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 167, in <module>
    main()
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 157, in main
    texts = process_documents()
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 119, in process_documents
    documents = load_documents(source_directory, ignored_files)
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 108, in load_documents
    for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
IndexError: list index out of range