langchain: Error "InvalidArgument: 400 Request" by following tutorial for Document AI
System Info
I’m running this on a local machine Windows 10, Spyder 5.2.1 IDE with Anaconda package management, using python 3.10.
Who can help?
Information
- The official example notebooks/scripts
- My own modified scripts
Related Components
- LLMs/Chat Models
- Embedding Models
- Prompts / Prompt Templates / Prompt Selectors
- Output Parsers
- Document Loaders
- Vector Stores / Retrievers
- Memory
- Agents / Agent Executors
- Tools / Toolkits
- Chains
- Callbacks/Tracing
- Async
Reproduction
Hi,
I’ve just started learning to code with python, working with LLMs and I’m following the tutorial for setting up LangChain with Google Document AI and I’m getting this error “InvalidArgument: 400 Request contains an invalid argument.” with this line of code
docs = list(parser.lazy_parse(blob))
Here are the things I’ve tried so far: • Setting up gcloud ADC so I can run this as authroized session, code wouldn’t work otherwise • Set the permission in the GSC bucket to Storage Admin so I can read/write • Chatgpt wrote test to see if current credentials are working, it is • Chatgpt wrote test to see if DocAIParser object is working, it is
I think there’s some issue with the output path for “lazy_parse” but I can’t get it to work. I’ve looked into the documentation but I can’t tell if I’m missing something or not. How do I get this working?
See full code and full error message below:
import pprint
from google.auth.transport.requests import AuthorizedSession
from google.auth import default
from google.cloud import documentai
from langchain.document_loaders.blob_loaders import Blob
from langchain.document_loaders.parsers import DocAIParser
PROJECT = "[replace with project name]"
GCS_OUTPUT_PATH = "gs://[replace with bucket path]"
PROCESSOR_NAME = "https://us-documentai.googleapis.com/v1/projects/[replace with processor name]"
# Get the credentials object using ADC.
credentials, _ = default()
session = AuthorizedSession(credentials=credentials)
# Create a Document AI client object.
client = documentai.DocumentProcessorServiceClient(credentials=credentials)
"""Tests if the current credentials are working in gcloud."""
import google.auth
def test_credentials():
try:
# Try to authenticate to the Google Cloud API.
google.auth.default()
print("Credentials are valid.")
except Exception as e:
print("Credentials are invalid:", e)
if __name__ == "__main__":
test_credentials()
import logging
from google.cloud import documentai
# Set up logging
logging.basicConfig(level=logging.DEBUG)
# Create DocumentAI client
client = documentai.DocumentProcessorServiceClient()
# Print out actual method call
logging.debug("Calling client.batch_process_documents(%s, %s)",
PROCESSOR_NAME, GCS_OUTPUT_PATH)
"""Test of DocAIParser object is working"""
# Try to create a DocAIParser object.
try:
parser = DocAIParser(
processor_name=PROCESSOR_NAME,
gcs_output_path=GCS_OUTPUT_PATH,
client=client,
)
# If the DocAIParser object was created successfully, then the Google is accepting the parameters.
print("Google is accepting the parameters.")
except Exception as e:
# If the DocAIParser object fails to be created, then the Google is not accepting the parameters.
print("Google is not accepting the parameters:", e)
parser = DocAIParser(
processor_name=PROCESSOR_NAME,
gcs_output_path=GCS_OUTPUT_PATH,
client=client,
)
blob = Blob(path="gs://foia_doc_bucket/input/goog-exhibit-99-1-q1-2023-19.pdf")
docs = list(parser.lazy_parse(blob))
print(len(docs))
Full error message:
DEBUG:google.auth._default:Checking None for explicit credentials as part of auth process...
DEBUG:google.auth._default:Checking Cloud SDK credentials as part of auth process...
DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG:google.auth.transport.requests:Making request: POST https://oauth2.googleapis.com/token
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443
DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
Traceback (most recent call last):
File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\google\api_core\grpc_helpers.py", line 75, in error_remapped_callable
return callable_(*args, **kwargs)
File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\grpc\_channel.py", line 1161, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\grpc\_channel.py", line 1004, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Request contains an invalid argument."
debug_error_string = "UNKNOWN:Error received from peer ipv4:142.250.31.95:443 {created_time:"2023-10-04T19:56:49.9162929+00:00", grpc_status:3, grpc_message:"Request contains an invalid argument."}"
>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\Inspiron 15 amd 5505\Dropbox\[...]\local_doc_upload.py", line 80, in <module>
docs = list(parser.lazy_parse(blob))
File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\langchain\document_loaders\parsers\docai.py", line 91, in lazy_parse
yield from self.batch_parse([blob], gcs_output_path=self._gcs_output_path)
File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\langchain\document_loaders\parsers\docai.py", line 122, in batch_parse
operations = self.docai_parse(blobs, gcs_output_path=output_path)
File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\langchain\document_loaders\parsers\docai.py", line 268, in docai_parse
operations.append(self._client.batch_process_documents(request))
File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\google\cloud\documentai_v1\services\document_processor_service\client.py", line 786, in batch_process_documents
response = rpc(
File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\google\api_core\gapic_v1\method.py", line 131, in __call__
return wrapped_func(*args, **kwargs)
File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\google\api_core\retry.py", line 366, in retry_wrapped_func
return retry_target(
File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\google\api_core\retry.py", line 204, in retry_target
return target()
File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\google\api_core\timeout.py", line 120, in func_with_timeout
return func(*args, **kwargs)
File "C:\Users\Anaconda\envs\python3_10\lib\site-packages\google\api_core\grpc_helpers.py", line 77, in error_remapped_callable
raise exceptions.from_grpc_error(exc) from exc
InvalidArgument: 400 Request contains an invalid argument.
Expected behavior
Suppose to output “11” based on the number of pages in this pdf per the Doc AI tutorial
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 18 (1 by maintainers)
Commits related to this issue
- integration test for DocAI parser (#11424) - **Description:** added an integration test - **Issue:** #11407 @baskaryan — committed to langchain-ai/langchain by lkuligin 9 months ago
- Added a better error description if processor name is wrong. (#11488) Replace this entire comment with: - **Description:** added a better error description for this error - **Issue:** #11407 ... — committed to langchain-ai/langchain by lkuligin 9 months ago
- fixed error message and a check for processor name (#12200) Replace this entire comment with: - **Description:** a small fix on error description / a check for processor name - **Issue:** the is... — committed to langchain-ai/langchain by lkuligin 8 months ago
- fixed error message and a check for processor name (#12200) Replace this entire comment with: - **Description:** a small fix on error description / a check for processor name - **Issue:** the is... — committed to schadem/langchain by lkuligin 8 months ago
- Added a better error description if processor name is wrong. (#11488) Replace this entire comment with: - **Description:** added a better error description for this error - **Issue:** #11407 ... — committed to hoanq1811/langchain by lkuligin 9 months ago
- fixed error message and a check for processor name (#12200) Replace this entire comment with: - **Description:** a small fix on error description / a check for processor name - **Issue:** the is... — committed to hoanq1811/langchain by lkuligin 8 months ago
🤖
Hello,
Thank you for providing detailed information about the issue you’re facing. The error message “InvalidArgument: 400 Request contains an invalid argument” suggests that one or more of the arguments you’re passing to the
DocAIParser
orbatch_process_documents
method is incorrect.From the context provided, the
processor_name
parameter in theDocAIParser
object is expected to be a string representing the name of a processor. Thegcs_output_path
parameter is also expected to be a string representing a path on Google Cloud Storage (GCS) where the parsing results will be stored.Looking at your code, you have placeholders for these values:
Please ensure that you have replaced these placeholders with the actual values. The
processor_name
should be the name of the processor you have created in the Google Cloud Console. It should be in the formatprojects/{project_id}/locations/{location}/processors/{processor_id}
. Thegcs_output_path
should be the path to a bucket in your Google Cloud Storage, in the formatgs://{bucket_name}/{optional_subdirectory}
.If you have already replaced these placeholders with the correct values, then the issue might be with the permissions or the format of the PDF file you’re trying to parse. Please ensure that the service account you’re using has the necessary permissions to access the Document AI API, the GCS bucket, and the specific PDF file.
I hope this helps! If you’re still facing issues, please provide more information about the processor you’re using and the permissions of your service account.
Best, Dosu
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
Hello,
Thank you for providing a detailed description of your issue. The “InvalidArgument: 400 Request contains an invalid argument” error usually occurs when the request made to the Google Document AI service contains an argument that is not valid or not correctly formatted.
In your case, the error could be due to several reasons:
blob
argument passed to thelazy_parse
method is not correctly formatted or does not exist.gcs_output_path
provided during the initialization of theDocAIParser
instance is not correctly formatted or does not exist.processor_name
provided during the initialization of theDocAIParser
instance is not correctly formatted or does not exist.To resolve this issue, you should:
blob
argument passed to thelazy_parse
method is a validBlob
instance and that the file it points to exists.gcs_output_path
provided during the initialization of theDocAIParser
instance is a valid Google Cloud Storage path and that you have the necessary permissions to write to this path.processor_name
provided during the initialization of theDocAIParser
instance is a valid processor name and that you have the necessary permissions to use this processor.Here is an example of how to correctly initialize a
DocAIParser
instance and call thelazy_parse
method:In this example, replace
"gs://vertex-pgt/examples/goog-exhibit-99-1-q1-2023-19.pdf"
,"gs://my-bucket/output/"
, and"projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID"
with your actual file path, output path, and processor name, respectively.Also, ensure that you have installed the necessary libraries (
google-cloud-documentai
andgoogle-cloud-documentai-toolbox
) and set up a GCS bucket and created your own OCR processor as mentioned in the context shared.For more information, you can refer to the following sources:
I hope this helps! If you have any other questions, feel free to ask.
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.