langchain: Error "InvalidArgument: 400 Request" by following tutorial for Document AI

System Info

I’m running this on a local machine Windows 10, Spyder 5.2.1 IDE with Anaconda package management, using python 3.10.

Who can help?

@leo-gan @holtskinner

Information

The official example notebooks/scripts
My own modified scripts

Related Components

LLMs/Chat Models
Embedding Models
Prompts / Prompt Templates / Prompt Selectors
Output Parsers
Document Loaders
Vector Stores / Retrievers
Memory
Agents / Agent Executors
Tools / Toolkits
Chains
Callbacks/Tracing
Async

Reproduction

Hi,

I’ve just started learning to code with python, working with LLMs and I’m following the tutorial for setting up LangChain with Google Document AI and I’m getting this error “InvalidArgument: 400 Request contains an invalid argument.” with this line of code docs = list(parser.lazy_parse(blob))

Here are the things I’ve tried so far: • Setting up gcloud ADC so I can run this as authroized session, code wouldn’t work otherwise • Set the permission in the GSC bucket to Storage Admin so I can read/write • Chatgpt wrote test to see if current credentials are working, it is • Chatgpt wrote test to see if DocAIParser object is working, it is

I think there’s some issue with the output path for “lazy_parse” but I can’t get it to work. I’ve looked into the documentation but I can’t tell if I’m missing something or not. How do I get this working?

See full code and full error message below:


import pprint

from google.auth.transport.requests import AuthorizedSession
from google.auth import default
from google.cloud import documentai
from langchain.document_loaders.blob_loaders import Blob
from langchain.document_loaders.parsers import DocAIParser

PROJECT = "[replace with project name]"
GCS_OUTPUT_PATH = "gs://[replace with bucket path]"
PROCESSOR_NAME = "https://us-documentai.googleapis.com/v1/projects/[replace with processor name]"

# Get the credentials object using ADC.
credentials, _ = default()
session = AuthorizedSession(credentials=credentials)

# Create a Document AI client object.
client = documentai.DocumentProcessorServiceClient(credentials=credentials)

"""Tests if the current credentials are working in gcloud."""
import google.auth

def test_credentials():
  

  try:
    # Try to authenticate to the Google Cloud API.
    google.auth.default()
    print("Credentials are valid.")
  except Exception as e:
    print("Credentials are invalid:", e)

if __name__ == "__main__":
  test_credentials()

import logging
from google.cloud import documentai

# Set up logging
logging.basicConfig(level=logging.DEBUG) 

# Create DocumentAI client
client = documentai.DocumentProcessorServiceClient()

# Print out actual method call  
logging.debug("Calling client.batch_process_documents(%s, %s)", 
              PROCESSOR_NAME, GCS_OUTPUT_PATH)

"""Test of DocAIParser object is working"""
# Try to create a DocAIParser object.
try:
  parser = DocAIParser(
      processor_name=PROCESSOR_NAME,
      gcs_output_path=GCS_OUTPUT_PATH,
      client=client,
  )

  # If the DocAIParser object was created successfully, then the Google is accepting the parameters.
  print("Google is accepting the parameters.")

except Exception as e:
  # If the DocAIParser object fails to be created, then the Google is not accepting the parameters.
  print("Google is not accepting the parameters:", e)

parser = DocAIParser(
    processor_name=PROCESSOR_NAME,
    gcs_output_path=GCS_OUTPUT_PATH,
    client=client,
    )

blob = Blob(path="gs://foia_doc_bucket/input/goog-exhibit-99-1-q1-2023-19.pdf")
docs = list(parser.lazy_parse(blob))

print(len(docs))

Full error message:

DEBUG:google.auth._default:Checking None for explicit credentials as part of auth process...
DEBUG:google.auth._default:Checking Cloud SDK credentials as part of auth process...
DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG:google.auth.transport.requests:Making request: POST https://oauth2.googleapis.com/token
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443
DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
Traceback (most recent call last):

  File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\google\api_core\grpc_helpers.py", line 75, in error_remapped_callable
    return callable_(*args, **kwargs)

  File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\grpc\_channel.py", line 1161, in __call__
    return _end_unary_response_blocking(state, call, False, None)

  File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\grpc\_channel.py", line 1004, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.INVALID_ARGUMENT
	details = "Request contains an invalid argument."
	debug_error_string = "UNKNOWN:Error received from peer ipv4:142.250.31.95:443 {created_time:"2023-10-04T19:56:49.9162929+00:00", grpc_status:3, grpc_message:"Request contains an invalid argument."}"
>


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "C:\Users\Inspiron 15 amd 5505\Dropbox\[...]\local_doc_upload.py", line 80, in <module>
    docs = list(parser.lazy_parse(blob))

  File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\langchain\document_loaders\parsers\docai.py", line 91, in lazy_parse
    yield from self.batch_parse([blob], gcs_output_path=self._gcs_output_path)

  File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\langchain\document_loaders\parsers\docai.py", line 122, in batch_parse
    operations = self.docai_parse(blobs, gcs_output_path=output_path)

  File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\langchain\document_loaders\parsers\docai.py", line 268, in docai_parse
    operations.append(self._client.batch_process_documents(request))

  File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\google\cloud\documentai_v1\services\document_processor_service\client.py", line 786, in batch_process_documents
    response = rpc(

  File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\google\api_core\gapic_v1\method.py", line 131, in __call__
    return wrapped_func(*args, **kwargs)

  File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\google\api_core\retry.py", line 366, in retry_wrapped_func
    return retry_target(

  File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\google\api_core\retry.py", line 204, in retry_target
    return target()

  File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\google\api_core\timeout.py", line 120, in func_with_timeout
    return func(*args, **kwargs)

  File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\google\api_core\grpc_helpers.py", line 77, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc

InvalidArgument: 400 Request contains an invalid argument.

Expected behavior

Suppose to output “11” based on the number of pages in this pdf per the Doc AI tutorial

About this issue

Original URL
State: closed
Created 9 months ago
Comments: 18 (1 by maintainers)

Commits related to this issue

integration test for DocAI parser (#11424) - **Description:** added an integration test - **Issue:** #11407 @baskaryan — committed to langchain-ai/langchain by lkuligin 9 months ago
Added a better error description if processor name is wrong. (#11488) Replace this entire comment with: - **Description:** added a better error description for this error - **Issue:** #11407 ... — committed to langchain-ai/langchain by lkuligin 9 months ago
fixed error message and a check for processor name (#12200) Replace this entire comment with: - **Description:** a small fix on error description / a check for processor name - **Issue:** the is... — committed to langchain-ai/langchain by lkuligin 8 months ago
fixed error message and a check for processor name (#12200) Replace this entire comment with: - **Description:** a small fix on error description / a check for processor name - **Issue:** the is... — committed to schadem/langchain by lkuligin 8 months ago
Added a better error description if processor name is wrong. (#11488) Replace this entire comment with: - **Description:** added a better error description for this error - **Issue:** #11407 ... — committed to hoanq1811/langchain by lkuligin 9 months ago
fixed error message and a check for processor name (#12200) Replace this entire comment with: - **Description:** a small fix on error description / a check for processor name - **Issue:** the is... — committed to hoanq1811/langchain by lkuligin 8 months ago

Most upvoted comments

🤖

Hello,

Thank you for providing detailed information about the issue you’re facing. The error message “InvalidArgument: 400 Request contains an invalid argument” suggests that one or more of the arguments you’re passing to the DocAIParser or batch_process_documents method is incorrect.

From the context provided, the processor_name parameter in the DocAIParser object is expected to be a string representing the name of a processor. The gcs_output_path parameter is also expected to be a string representing a path on Google Cloud Storage (GCS) where the parsing results will be stored.

Looking at your code, you have placeholders for these values:

PROJECT = "[replace with project name]"
GCS_OUTPUT_PATH = "gs://[replace with bucket path]"
PROCESSOR_NAME = "https://us-documentai.googleapis.com/v1/projects/[replace with processor name]"

Please ensure that you have replaced these placeholders with the actual values. The processor_name should be the name of the processor you have created in the Google Cloud Console. It should be in the format projects/{project_id}/locations/{location}/processors/{processor_id}. The gcs_output_path should be the path to a bucket in your Google Cloud Storage, in the format gs://{bucket_name}/{optional_subdirectory}.

If you have already replaced these placeholders with the correct values, then the issue might be with the permissions or the format of the PDF file you’re trying to parse. Please ensure that the service account you’re using has the necessary permissions to access the Document AI API, the GCS bucket, and the specific PDF file.

I hope this helps! If you’re still facing issues, please provide more information about the processor you’re using and the permissions of your service account.

Best, Dosu

Sources

libs/langchain/langchain/document_loaders/parsers/docai.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] on Oct 4, 2023

🤖

Hello,

Thank you for providing a detailed description of your issue. The “InvalidArgument: 400 Request contains an invalid argument” error usually occurs when the request made to the Google Document AI service contains an argument that is not valid or not correctly formatted.

In your case, the error could be due to several reasons:

The blob argument passed to the lazy_parse method is not correctly formatted or does not exist.
The gcs_output_path provided during the initialization of the DocAIParser instance is not correctly formatted or does not exist.
The processor_name provided during the initialization of the DocAIParser instance is not correctly formatted or does not exist.

To resolve this issue, you should:

Ensure that the blob argument passed to the lazy_parse method is a valid Blob instance and that the file it points to exists.
Ensure that the gcs_output_path provided during the initialization of the DocAIParser instance is a valid Google Cloud Storage path and that you have the necessary permissions to write to this path.
Ensure that the processor_name provided during the initialization of the DocAIParser instance is a valid processor name and that you have the necessary permissions to use this processor.

Here is an example of how to correctly initialize a DocAIParser instance and call the lazy_parse method:

from langchain.document_loaders.blob_loaders import Blob
from langchain.document_loaders.parsers import DocAIParser

# Initialize the Blob
blob = Blob(path="gs://vertex-pgt/examples/goog-exhibit-99-1-q1-2023-19.pdf")

# Initialize the DocAIParser
parser = DocAIParser(
    location="us",
    processor_name="projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID",
    gcs_output_path="gs://my-bucket/output/"
)

# Call the lazy_parse method
docs = list(parser.lazy_parse(blob))

In this example, replace "gs://vertex-pgt/examples/goog-exhibit-99-1-q1-2023-19.pdf", "gs://my-bucket/output/", and "projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID" with your actual file path, output path, and processor name, respectively.

Also, ensure that you have installed the necessary libraries (google-cloud-documentai and google-cloud-documentai-toolbox) and set up a GCS bucket and created your own OCR processor as mentioned in the context shared.

For more information, you can refer to the following sources:

I hope this helps! If you have any other questions, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] on Oct 4, 2023