google-cloud-php: [spanner] : STATUS_UNAVAILABLE isn't retried in some cases in Result->rows()
Environment details
- OS:linux/gke/gke-1254-gke2100-cos-101-17162-40-34-v221207-c-cgpv1-pre
- PHP version: 8.1.12
- Package name and version: v1.54.2, v1.54.0
Steps to reproduce
it happens very rarely , once per day or week, on relatively loaded clusters and with simple random queries and I can’t reproduce it
PHP Notice: Google\Cloud\Core\Exception\ServiceException: {
"message": "Broken pipe",
"code": 14,
"status": "UNAVAILABLE",
"details": []
} in /var/www/vendor/google/cloud-core/src/GrpcRequestWrapper.php:257
Stack trace:
#0 /var/www/vendor/google/cloud-core/src/GrpcRequestWrapper.php(194): Google\Cloud\Core\GrpcRequestWrapper->convertToGoogleException(Object(Google\ApiCore\ApiException))
#1 [internal function]: Google\Cloud\Core\GrpcRequestWrapper->handleStream(Object(Google\ApiCore\ServerStream))
#2 /var/www/vendor/google/cloud-spanner/src/Result.php(228): Generator->next()
#3 [internal function]: Google\Cloud\Spanner\Result->rows()
according to https://cloud.google.com/spanner/docs/error-codes the call should be retried
UNAVAILABLE | The server is currently unavailable | Retry using exponential backoff. Note that it is not always safe to retry non-idempotent operations.
but according to source https://github.com/googleapis/google-cloud-php/blob/fda8c6c0061184e889d039eb4bbb2e88e35be9a7/Spanner/src/Result.php#L231 there are cases when call won’t be retried
$hasResumeToken = $this->isSetAndTrue($result, 'resumeToken');
if ($hasResumeToken || count($bufferedResults) >= self::BUFFER_RESULT_LIMIT) {
...
$shouldRetry = $hasResumeToken;
...
$generator->next();
$valid = $generator->valid();
} catch (ServiceException $ex) {
if ($shouldRetry && $ex->getCode() === Grpc\STATUS_UNAVAILABLE) {
// Attempt to resume using our last stored resume token. If we
// successfully resume, flush the buffer.
$generator = $backoff->execute($call, [$this->resumeToken]);
$bufferedResults = [];
continue;
}
i.e. it retried only when there is set $resumeToken , but it never set for queries with small results
array (
'metadata' =>
array (
'rowType' =>
array (
'fields' =>
array (
0 =>
array (
'name' => '',
'type' =>
array (
'code' => 2,
'typeAnnotation' => 0,
),
),
),
),
'transaction' =>
array (
'id' => '',
),
),
'values' =>
array (
0 => '1',
),
'chunkedValue' => false,
'resumeToken' => '',
)
sample code I used
$project = '';
$instance = '';
$database = '';
$spanner = new SpannerClient([ 'projectId' => $project ]);
$connection = $spanner->connect($instance, $database);
$generator = $connection
->execute('SELECT 1')
->rows();
iterator_to_array($generator);
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 15 (10 by maintainers)
(Caution, not a PHP expert here, so I might be reading something wrong in this code)
UNAVAILABLEerrors that happen for aSELECTstatement can generally be retried safely. The logic should be:ExecuteStreamingSqlfails with anUNAVAILABLEerror, the initial call toExecuteStreamingSqlcan be retried without a resume token.ExecuteStreamingSqlshould buffer all rows it receives until it sees a resume token, or until its buffer is full. Normally the first will happen. It then emits the rows it has seen so far. This is also the logic that is implemented in the PHP client library.ExecuteStreamingSqlcall should be retried with the last seen resume token. If there is no resume token and the stream has emitted rows, then retrying is not safe. This is a very rare case. If the stream fails halfway and there is no resume token, but the method has also not emitted any rows, then it’s also safe to just retry the initialExecuteStreamingSqlcall without a resume token.From what I can see the only thing that needs to be modified here is this line: https://github.com/googleapis/google-cloud-php/blob/fda8c6c0061184e889d039eb4bbb2e88e35be9a7/Spanner/src/Result.php#L180
It is safe to retry the initial call as long as the method has not returned any rows to the caller.
I raised a fix to retry when no results have yielded, with or without the resume token.
@vishwarajanand Now that we understand how dotnet and js libraries handle this, would it be possible to get a fix for this?
Wow! Thank you so much for the quick response! 👏🥹
Hi @taka-oyama Thank you
For us there weren’t any exceptions inside explicit transactions. just within transaction-less selects. I suspect transactions somehow are retried already
Nodejs version seems to have fixed this back in 2020. https://github.com/googleapis/nodejs-spanner/pull/795
Dotnet has fixed a similar issue. https://github.com/googleapis/google-cloud-dotnet/issues/5977
I just checked the logs for our service and saw similar errors at the rate of 1~2 an hour.
Can we do something like
Within transaction -> Retry the whole transaction through below.
https://github.com/googleapis/google-cloud-php/blob/104537e8664db09fc07d927633f70706b0a0c7a6/Spanner/src/Database.php#L845
Outside the transaction -> Retry individual queries through below.
https://github.com/googleapis/google-cloud-php/blob/104537e8664db09fc07d927633f70706b0a0c7a6/Spanner/src/Result.php#L175