google-cloud-php: [spanner] : STATUS_UNAVAILABLE isn't retried in some cases in Result->rows()

Environment details

  • OS:linux/gke/gke-1254-gke2100-cos-101-17162-40-34-v221207-c-cgpv1-pre
  • PHP version: 8.1.12
  • Package name and version: v1.54.2, v1.54.0

Steps to reproduce

it happens very rarely , once per day or week, on relatively loaded clusters and with simple random queries and I can’t reproduce it

 PHP Notice: Google\Cloud\Core\Exception\ServiceException: {
    "message": "Broken pipe",
    "code": 14,
    "status": "UNAVAILABLE",
    "details": []
} in /var/www/vendor/google/cloud-core/src/GrpcRequestWrapper.php:257
Stack trace:
#0 /var/www/vendor/google/cloud-core/src/GrpcRequestWrapper.php(194): Google\Cloud\Core\GrpcRequestWrapper->convertToGoogleException(Object(Google\ApiCore\ApiException))
#1 [internal function]: Google\Cloud\Core\GrpcRequestWrapper->handleStream(Object(Google\ApiCore\ServerStream))
#2 /var/www/vendor/google/cloud-spanner/src/Result.php(228): Generator->next()
#3 [internal function]: Google\Cloud\Spanner\Result->rows()

according to https://cloud.google.com/spanner/docs/error-codes the call should be retried

UNAVAILABLE | The server is currently unavailable | Retry using exponential backoff. Note that it is not always safe to retry non-idempotent operations.

but according to source https://github.com/googleapis/google-cloud-php/blob/fda8c6c0061184e889d039eb4bbb2e88e35be9a7/Spanner/src/Result.php#L231 there are cases when call won’t be retried

                $hasResumeToken = $this->isSetAndTrue($result, 'resumeToken');
                if ($hasResumeToken || count($bufferedResults) >= self::BUFFER_RESULT_LIMIT) {
                ...
                    $shouldRetry = $hasResumeToken;
                ...    
                $generator->next();
                $valid = $generator->valid();
            } catch (ServiceException $ex) {
                if ($shouldRetry && $ex->getCode() === Grpc\STATUS_UNAVAILABLE) {
                    // Attempt to resume using our last stored resume token. If we
                    // successfully resume, flush the buffer.
                    $generator = $backoff->execute($call, [$this->resumeToken]);
                    $bufferedResults = [];

                    continue;
                }

i.e. it retried only when there is set $resumeToken , but it never set for queries with small results

array (
  'metadata' => 
  array (
    'rowType' => 
    array (
      'fields' => 
      array (
        0 => 
        array (
          'name' => '',
          'type' => 
          array (
            'code' => 2,
            'typeAnnotation' => 0,
          ),
        ),
      ),
    ),
    'transaction' => 
    array (
      'id' => '',
    ),
  ),
  'values' => 
  array (
    0 => '1',
  ),
  'chunkedValue' => false,
  'resumeToken' => '',
)   

sample code I used

        $project = '';
        $instance = '';
        $database = '';
        $spanner = new SpannerClient([ 'projectId' => $project ]);
        $connection = $spanner->connect($instance, $database);
        $generator = $connection
            ->execute('SELECT 1')
            ->rows();

        iterator_to_array($generator);

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 15 (10 by maintainers)

Most upvoted comments

(Caution, not a PHP expert here, so I might be reading something wrong in this code)

UNAVAILABLE errors that happen for a SELECT statement can generally be retried safely. The logic should be:

  1. If the initial call to ExecuteStreamingSql fails with an UNAVAILABLE error, the initial call to ExecuteStreamingSql can be retried without a resume token.
  2. The method that reads the stream that is returned by ExecuteStreamingSql should buffer all rows it receives until it sees a resume token, or until its buffer is full. Normally the first will happen. It then emits the rows it has seen so far. This is also the logic that is implemented in the PHP client library.
  3. If the stream fails halfway, the ExecuteStreamingSql call should be retried with the last seen resume token. If there is no resume token and the stream has emitted rows, then retrying is not safe. This is a very rare case. If the stream fails halfway and there is no resume token, but the method has also not emitted any rows, then it’s also safe to just retry the initial ExecuteStreamingSql call without a resume token.

From what I can see the only thing that needs to be modified here is this line: https://github.com/googleapis/google-cloud-php/blob/fda8c6c0061184e889d039eb4bbb2e88e35be9a7/Spanner/src/Result.php#L180

It is safe to retry the initial call as long as the method has not returned any rows to the caller.

I raised a fix to retry when no results have yielded, with or without the resume token.

@vishwarajanand Now that we understand how dotnet and js libraries handle this, would it be possible to get a fix for this?

Wow! Thank you so much for the quick response! 👏🥹

Hi @taka-oyama Thank you

Within transaction -> Retry the whole transaction through below.

For us there weren’t any exceptions inside explicit transactions. just within transaction-less selects. I suspect transactions somehow are retried already

Nodejs version seems to have fixed this back in 2020. https://github.com/googleapis/nodejs-spanner/pull/795

Dotnet has fixed a similar issue. https://github.com/googleapis/google-cloud-dotnet/issues/5977

I just checked the logs for our service and saw similar errors at the rate of 1~2 an hour.

    "message": "Connection reset by peer",
    "code": 14,
    "status": "UNAVAILABLE",
    "details": []

Can we do something like

Within transaction -> Retry the whole transaction through below.

https://github.com/googleapis/google-cloud-php/blob/104537e8664db09fc07d927633f70706b0a0c7a6/Spanner/src/Database.php#L845

Outside the transaction -> Retry individual queries through below.

https://github.com/googleapis/google-cloud-php/blob/104537e8664db09fc07d927633f70706b0a0c7a6/Spanner/src/Result.php#L175