private-gpt: Cannot ingest Chinese text file

Hi,

I’ve tried the following command: python ingest someTChinese.txt

However, it will return some error as shown below, any suggestion? Tried to modify the parameters (chunk_size=500, chunk_overlap=50) and nothing worked here.

llama.cpp: loading model from ./models/ggml-model-q4_0.bin
llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this
llama_model_load_internal: format     = 'ggml' (old version with low tokenizer quality and no mmap support)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113748.20 KB
llama_model_load_internal: mem required  = 5809.33 MB (+ 2052.00 MB per state)
...................................................................................................
.
llama_init_from_file: kv self size  =  512.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
Using embedded DuckDB with persistence: data will be stored in: db
llama_tokenize: too many tokens
Traceback (most recent call last):
  File "/Users/jmosx/devp/allAboutGPT/privateGPT/ingest.py", line 22, in <module>
    main()
  File "/Users/jmosx/devp/allAboutGPT/privateGPT/ingest.py", line 17, in main
    db = Chroma.from_documents(texts, llama, persist_directory=persist_directory)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 412, in from_documents
    return cls.from_texts(
           ^^^^^^^^^^^^^^^
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 380, in from_texts
    chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 158, in add_texts
    embeddings = self._embedding_function.embed_documents(list(texts))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/embeddings/llamacpp.py", line 111, in embed_documents
    embeddings = [self.client.embed(text) for text in texts]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/embeddings/llamacpp.py", line 111, in <listcomp>
    embeddings = [self.client.embed(text) for text in texts]
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/llama_cpp/llama.py", line 514, in embed
    return list(map(float, self.create_embedding(input)["data"][0]["embedding"]))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/llama_cpp/llama.py", line 478, in create_embedding
    tokens = self.tokenize(input.encode("utf-8"))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/llama_cpp/llama.py", line 200, in tokenize
    raise RuntimeError(f'Failed to tokenize: text="{text}" n_tokens={n_tokens}')
RuntimeError: Failed to tokenize: text="b'\xe7\x8b\x82\xe5\xaf\x86\xe8\x88\x87\xe7\x9c\x9f\xe5\xaf\x86\xe7\xac\xac\xe4\xb8\x80\xe8\xbc\xaf\n\n\xe3\x80\x80\n\n\xe5\xb9\xb3\xe5\xaf\xa6\xe5\xb1\x85\xe5\xa3\xab\xe8\x91\x97\n\n \xe6\x9c\x89\xe5\xb8\xab\xe4\xba\x91\xef\xbc\x9a\n\n   \xe3\x80\x8c\xe5\xaf\x86\xe5\xae\x97\xe6\x98\xaf\xe4\xb8\x80\xe5\x80\x8b\xe9\x87\x91\xe5\x89\x9b\xe9\x91\xbd\xe5\xa4\x96\xe5\x9c\x8d\xe6\x93\xba\xe6\xbb\xbf\xe4\xba\x86\xe9\x8d\x8d\xe9\x87\x91\xe5\x9e\x83\xe5\x9c\xbe\xe7\x9a\x84\xe5\xae\x97\xe6\x95\x99\xe3\x80\x82\xe3\x80\x8d\n\n\xe3\x80\x80\n\n \xe5\xb9\xb3\xe5\xaf\xa6\xe6\x96\xbc\xe6\xad\xa4\xe8\xa8\x80\xe5\xbe\x8c\xe6\x9b\xb4\xe5\x8a\xa0\xe8\xa8\xbb\xe8\x85\xb3\xef\xbc\x9a\n\n   \xe3\x80\x8c\xe9\x82\xa3\xe9\xa1\x86\xe9\x87\x91\xe5\x85\x89\xe9\x96\x83\xe9\x96\x83\xe7\x9a\x84\xe9\x91\xbd\xe7\x9f\xb3\xef\xbc\x8c\xe5\x8d\xbb\xe6\x98\xaf\xe7\x8e\xbb\xe7\x92\x83\xe6\x89\x93\xe7\xa3\xa8\xe8\x80\x8c\xe6\x88\x90\xef\xbc\x8c\xe4\xb8\x8d\xe5\xa0\xaa\xe6\xaa\xa2\xe9\xa9\x97\xe3\x80\x82\xe3\x80\x8d\n\n\xe3\x80\x80\n\n\xe8\xb7\x8b\n\n\xe6\x95\xb8\xe5\x8d\x81\xe5\xb9\xb4\xe4\xbe\x86\xef\xbc\x8c\xe8\xa5\xbf\xe8\x97\x8f\xe5\xaf\x86\xe5\xae\x97\xe7\xb3\xbb\xe7\xb5\xb1\xe8\xab\xb8\xe6\xb3\x95\xe7\x8e\x8b\xe8\x88\x87\xe8\xab\xb8\xe4\xb8\x8a\xe5\xb8\xab\xef\xbc\x8c\xe6\x82\x89\xe7\x9a\x86\xe5\x92\x90\xe5\x9b\x91\xe8\xab\xb8\xe5\xbc\x9f\xe5\xad\x90\xef\xbc\x9a\xe3\x80\x8c\xe8\x8b\xa5\xe6\x9c\x89\xe5\xa4\x96\xe4\xba\xba\xe8\xa9\xa2\xe5\x95\x8f\xe5\xaf\x86\xe5\xae\x97\xe6\x98\xaf\xe5\x90\xa6\xe6\x9c\x89\xe9\x9b\x99\xe8\xba\xab\xe4\xbf\xae\xe6\xb3\x95\xef\xbc\x9f\xe6\x87\x89\xe4\xb8\x80\xe5\xbe\x8b\xe7\xad\x94\xe5\xbe\xa9\xef\xbc\x9a\xe3\x80\x8e\xe5\x8f\xa4\xe6\x99\x82\xe5\xaf\x86\xe5\xae\x97\xe6\x9c\x89\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xef\xbc\x8c\xe4\xbd\x86\xe7\x8f\xbe\xe5\x9c\xa8\xe5\xb7\xb2\xe7\xb6\x93\xe5\xbb\xa2\xe6\xa3\x84\xe4\xb8\x8d\xe7\x94\xa8\xe3\x80\x82\xe7\x8f\xbe\xe5\x9c\xa8\xe4\xb8\x8a\xe5\xb8\xab\xe9\x83\xbd\xe5\x9a\xb4\xe7\xa6\x81\xe5\xbc\x9f\xe5\xad\x90\xe5\x80\x91\xe4\xbf\xae\xe5\xad\xb8\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xef\xbc\x8c\xe6\x89\x80\xe4\xbb\xa5\xe7\x8f\xbe\xe5\x9c\xa8\xe7\x9a\x84\xe5\xaf\x86\xe5\xae\x97\xe5\xb7\xb2\xe7\xb6\x93\xe6\xb2\x92\xe6\x9c\x89\xe4\xba\xba\xe5\xbc\x98\xe5\x82\xb3\xe5\x8f\x8a\xe4\xbf\xae\xe5\xad\xb8\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xe4\xba\x86\xe3\x80\x82\xe3\x80\x8f\xe3\x80\x8d\n\n\xe7\x84\xb6\xe8\x80\x8c\xe5\xaf\xa6\xe9\x9a\x9b\xe4\xb8\x8a\xef\xbc\x8c\xe8\xa5\xbf\xe8\x97\x8f\xe5\xaf\x86\xe5\xae\x97\xe4\xbb\x8d\xe6\x9c\x89\xe7\x94\x9a\xe5\xa4\x9a\xe4\xb8\x8a\xe5\xb8\xab\xe7\xb9\xbc\xe7\xba\x8c\xe5\xbc\x98\xe5\x82\xb3\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xef\xbc\x8c\xe4\xb8\xa6\xe9\x9d\x9e\xe6\x9c\xaa\xe5\x82\xb3\xef\xbc\x8c\xe4\xba\xa6\xe9\x9d\x9e\xe7\xa6\x81\xe5\x82\xb3\xef\xbc\x8c\xe8\x80\x8c\xe6\x98\xaf\xe7\xb9\xbc\xe7\xba\x8c\xe5\x9c\xa8\xe7\x89\xa9\xe8\x89\xb2\xe9\x81\xa9\xe5\x90\x88\xe4\xbf\xae\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xe4\xb9\x8b\xe5\xbc\x9f\xe5\xad\x90\xef\xbc\x8c\xe7\x82\xba\xe5\xbd\xbc\xe7\xad\x89\xe8\xab\xb8\xe4\xba\xba\xe6\x9a\x97\xe4\xb8\xad\xe4\xbd\x9c\xe7\xa7\x98\xe5\xaf\x86\xe7\x81\x8c\xe9\xa0\x82\xef\xbc\x8c\xe4\xbb\xa4\xe5\x85\xb6\xe6\x88\x90\xe7\x82\xba\xe5\x8b\x87\xe7\x88\xb6\xe8\x88\x87\xe7\xa9\xba\xe8\xa1\x8c\xe6\xaf\x8d\xef\xbc\x8c\xe7\x84\xb6\xe5\xbe\x8c\xe5\x90\x88\xe4\xbf\xae\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xe3\x80\x82\xe4\xbb\x8a\xe6\x99\x82\xe5\xaf\x86\xe5\xae\x97\xe5\xa6\x82\xe6\x98\xaf\xe7\xb9\xbc\xe7\xba\x8c\xe6\x9a\x97\xe4\xb8\xad\xe5\xbc\x98\xe5\x82\xb3\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xef\xbc\x8c\xe4\xbb\xa5\xe5\xbb\xb6\xe7\xba\x8c\xe5\x85\xb6\xe5\xaf\x86\xe6\xb3\x95\xe4\xb9\x8b\xe5\x91\xbd\xe8\x84\x88\xef\xbc\x9b\xe7\x9d\xbd\xe4\xba\x8e\xe9\x9b\xbb\xe8\xa6\x96\xe6\x96\xb0\xe8\x81\x9e\xe4\xb9\x8b\xe5\xa0\xb1\xe5\xb0\x8e\xe9\x99\xb3\xe5\xb1\xa5\xe5\xae\x89\xe5\x85\xac\xe5\xad\x90\xe9\x99\xb3\xe5\xae\x87\xe5\xbb\xb7\xef\xbc\x8c\xe6\xb1\x82\xe8\x97\x8f\xe5\xa5\xb3\xe7\x82\xba\xe5\x85\xb6\xe7\xa9\xba\xe8\xa1\x8c\xe6\xaf\x8d\xe7\xad\x89\xe8\xa8\x80\xef\xbc\x8c\xe4\xba\xa6\xe5\x8f\xaf\xe7\x9f\xa5\xe7\x9f\xa3\xe3\x80\x82\xe6\x98\xaf\xe6\x95\x85\xe5\xaf\x86\xe5\xae\x97\xe5\xbc\x9f\xe5\xad\x90\xe8\x88\x87\xe4\xb8\x8a\xe5\xb8\xab\xe7\xad\x89\xe4\xba\xba\xef\xbc\x8c\xe9\x9b\x96\xe7\x84\xb6\xe5\xb0\x8d\xe5\xa4\x96\xe5\x8f\xa3\xe5\xbe\x91\xe4\xb8\x80\xe8\x87\xb4\xef\xbc\x8c\xe6\x82\x89\xe7\x9a\x86\xe5\x80\xa1\xe8\xa8\x80\xef\xbc\x9a\xe3\x80\x8c\xe5\xaf\x86\xe5\xae\x97\xe7\x8f\xbe\xe5\x9c\xa8\xe5\xb7\xb2\xe6\x8d\xa8\xe6\xa3\x84\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xef\xbc\x8c\xe4\xbb\x8a\xe5\xb7\xb2\xe7\x84\xa1\xe4\xba\xba\xe5\xbc\x98\xe5\x82\xb3\xe6\x88\x96\xe4\xbf\xae\xe5\xad\xb8\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xe3\x80\x82\xe3\x80\x8d\xe5\x85\xb6\xe5\xaf\xa6\xe6\x98\xaf\xe6\xac\xba\xe7\x9e\x9e\xe7\xa4\xbe\xe6\x9c\x83\xe4\xb9\x8b\xe8\xa8\x80\xef\xbc\x8c\xe4\xbb\xa5\xe6\xad\xa4\xe4\xbb\xa4\xe7\xa4\xbe\xe6\x9c\x83\xe5\xa4\xa7\xe7\x9c\xbe\xe5\xb0\x8d\xe5\xaf\x86\xe5\xae\x97\xe4\xb8\x8d\xe8\x87\xb4\xe5\x8a\xa0\xe4\xbb\xa5\xe5\xa4\xaa\xe5\xa4\x9a\xe4\xb9\x8b\xe6\xb3\xa8\xe6\x84\x8f\xef\xbc\x8c\xe4\xbb\xa5\xe4\xbe\xbf\xe7\xb9\xbc\xe7\xba\x8c\xe6\x9a\x97\xe4\xb8\xad\xe5\xbc\x98\xe5\x82\xb3\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xe3\x80\x82'" n_tokens=-723

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (2 by maintainers)

Most upvoted comments

GanymedeNil/text2vec-large-chinese

It’s a derivative model of shibing624/text2vec-base-chinese, replace MacBERT with LERT, and keep other training conditions unchanged.

“shibing624/text2vec-base-chinese” is the SOTA sentence embedding model for Chinese vector so far.

Try Faiss see if it works

On Thu, 18 May 2023 at 01:18, hnuzhoulin @.***> wrote:

“shibing624/text2vec-base-chinese” is the SOTA sentence embedding model for Chinese vector so far.

when I use this,last error line is:

chromadb.errors.InvalidDimensionException: Dimensionality of (768) does not match index dimensionality (384)

— Reply to this email directly, view it on GitHub https://github.com/imartinez/privateGPT/issues/19#issuecomment-1551786347, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5OYG25SR6YAF4JS5I72LV3XGUCABANCNFSM6AAAAAAX4BD4XQ . You are receiving this because you commented.Message ID: @.***>