apollo-client: Queries hang and are not issued on network requests

Intended outcome:

Queries should be dispatched to the network.

Actual outcome:

Queries are not dispatched to the network, they just hang.

While using apollo-angular@3.0.1 with @apollo/client@3.6.x it appears that some queries start hanging after a while.

I’m also using graphql-code-generator to generate an apollo-angular data service, so usage looks like this:

this.someQuery
      .fetch({
        myVariable: 1
      })
      .pipe(map((res) => res.data.someQuery))
      .subscribe(foo => console.log(foo));

After a while (within 1-2 minutes), the subscribe block will not be called any more. This can also be used with rxjs’s await firstValueFrom(...) and the promise will never resolve.

Downgrading to ~~3.6.2~~ 3.5.10 seems to fix the problem.

How to reproduce the issue:

Unfortunately no repro, but hopefully it might be easy to pinpoint what would affect these by comparing ~~3.6.2 to 3.6.3.~~ 3.6.0 and 3.5.10.

Versions

System:
    OS: macOS 12.3.1
  Binaries:
    Node: 16.14.2 - ~/.nvm/versions/node/v16.14.2/bin/node
    Yarn: 3.2.0-git.20220329.hash-0764215 - ~/.nvm/versions/node/v16.14.2/bin/yarn
  Browsers:
    Chrome: 101.0.4951.54
    Safari: 15.4
  npmPackages:
    @apollo/client: 3.6.3 => 3.6.3
    apollo-angular: ^3.0.1 => 3.0.1 
    apollo-link-scalars: ^4.0.1 => 4.0.1

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 5
Comments: 45 (32 by maintainers)

Commits related to this issue

Add basic regression test for issues #7608 and #9690. — committed to apollographql/apollo-client by benjamn 2 years ago
Add basic regression test for issues #7608 and #9690. — committed to apollographql/apollo-client by benjamn 2 years ago
Guarantee `Concast` cleanup without `Observable cancelled prematurely` rejection (#9701) Should fix/improve issues #7608 and #9690, and possibly others. — committed to apollographql/apollo-client by benjamn 2 years ago
Implement concast.beforeNext as replacement for concast.cleanup. Doing this right would potentially resolve issue #9690. — committed to apollographql/apollo-client by benjamn 2 years ago
Implement concast.beforeNext as replacement for concast.cleanup. Doing this right would potentially resolve issue #9690. — committed to apollographql/apollo-client by benjamn 2 years ago
Implement concast.beforeNext as replacement for concast.cleanup. Doing this right would potentially resolve issue #9690. — committed to apollographql/apollo-client by benjamn 2 years ago
Backport PR #9793 from apollographql/issue-9773-unbreak-BatchHttpLink PR #9793 was first released in v3.7.0-beta.3 for testing, and now (in this PR) will be backported to the `main` branch, to be rel... — committed to apollographql/apollo-client by benjamn 2 years ago

Most upvoted comments

@jdmoody I appreciate all the details, and while I share your uncertainty about how many different issues we’re talking about here, I think there must be something wrong with BatchHttpLink, probably a bug that’s been there for a while but was only revealed by other changes recently, so that’s what I’m currently looking into.

benjamn on May 31, 2022

Run some e2e tests on 3.7.0-beta.3 and they are all green. Thanks for the update @benjamn

andrew-hu368 on Jun 8, 2022

I am also having this issue.

Some queries hang are never dispatched with 3.6.* - but only when using BatchHttpLink - the issue does not occur when using HttpLink Just tried 3.7.0-alpha.5 - no joy.

Last version I can get to work with my BatchHttpLink setup is 3.5.10

Don’t have time to put together a repro - also it seems intermittent - but can answer questions if it helps…

bentron2000 on May 17, 2022

I also have hanging queries as described in this issue. Some details:

I’m upgrading from 3.2.2 to 3.6.5 (other issues prevent me from upgrading to versions between 3.2 and 3.6)
I had the same issue with 3.7.0-alpha.4
I’m currently using React 18 in legacy mode with v3.2.2 without any issues
I’m using BatchHttpLink and the issue goes away when I replace BatchHttpLink with apollo client’s HttpLink
It seems to always be the last useQuery call made that hangs (I can go into more detail about how I measured this if helpful).

This has been especially hard to debug because I haven’t been able to reproduce it in a dev environment. I’m only able to reproduce it when deploying my app to a production-like environment. Then, when I try to attach a debugger to the box, I’m no longer able to reproduce it 😵

Afaict, this is the only issue blocking me from upgrading to v3.6, which is the only thing blocking me from using certain React 18 features.

It’s also unclear to me whether all the behaviors described by folks in this issue are indeed the same issue. If it would be helpful for me to create a separate issue, or if there’s any other info I can provide, please let me know 🙏🏻

jdmoody on May 27, 2022

Alright, I believe this regression stems originally from PR #9248, which made the batch link capable of cancelling in-flight batched operations when the underlying observable is terminated.

This theory about the source of the regression is consistent with @andrew-hu368’s comment about the old apollo-link-batch-http version of BatchHttpLink still working (😂), since that version does not contain PR #9248, which was released more recently (first in v3.6.0-beta.4, and then officially in v3.6.0).

While I am open to making the changes from PR #9248 more purely opt-in, I believe my PR #9793 fixes the problem without completely abandoning #9248.

@andreialecu @jdmoody @bentron2000 @doflo-dfa @nikhilgupta16 @vieira @andrew-hu368 (and anyone else I missed) Please try running npm i @apollo/client@beta to get version 3.7.0-beta.3, which includes #9793.

If that doesn’t work, please try npm i @apollo/client@3.7.0-beta.2 (note: 2 not 3) to see if the full reversion of PR #9248 (described in #9793) makes any difference for you. If you see any differences in behavior between these two versions, please describe the differences here in detail. I’m hoping the simpler changes in @apollo/client@3.7.0-beta.3 are enough to fix the regression, without completely undoing #9248.

benjamn on Jun 7, 2022

I am not sure if it is related. We’ve recently upgraded from apollo v2 to apollo v3 (latest version). If I import the old BatchHttpLink from apollo-link-batch-http the queries are successfully run, while the new version doesn’t send requests.

We downgraded to 3.5.10 and it seems to work as expected.

andrew-hu368 on Jun 7, 2022

Hello, I have tried 3.6.6 which potentially resolves a possible cause for this issue (#9718) but after some testing the issue is still present.

As everyone else, we are also using BatchHttpLink and the last version that is working without this issue for us is 3.5.10.

vieira on Jun 7, 2022

With deduplication disabled, the initial query may still get stuck randomly, but subsequent ones will go through even if they use the same variables.

They’re still leaking and getting stuck here in inFlightLinkObservables: Screenshot 2022-05-11 at 15 47 52

However, because some queries are still getting stuck, this causes all sorts of issues.

andreialecu on May 11, 2022

Once the bug starts happening any subsequent queries with the exact same variables get stuck.

They’re never removed and none of the promises/observables for that specific query ever resolve.

I have one particular query where I can reproduce it very easily (because it is issued very frequently). But I have seen it hang for other queries as well.

To clarify: those 41 queries are for the same operation and same variables.

andreialecu on May 10, 2022

@benjamn unfortunately the bug in the OP still exists in 3.7.0-alpha.3

Notice how observable contains something, is deduplicated and not issued. Also notice how the Concast got to 41 (stuck) observers.

As mentioned previously, the issue started between 3.5.10 and 3.6.0 but there are too many commits for me to be able to pinpoint one easily. So if you have any clue which one might’ve touched anything in this area, please let me know so I can try reverting it.

Alternatively, if you have any suggestions where to set any breakpoints and what to inspect, that would also be great.

andreialecu on May 9, 2022

Thanks for all the details @andreialecu! I think you’re on the right track, and the Observable canceled prematurely error is a long-standing hard-to-pin-down issue, so it would be great to fix that finally as well. 🤞

I can do some digging/testing today with this information. I’ll report back when I have news.

benjamn on May 9, 2022

So it appears that there’s a callback being added to concast.cleanup() that is supposed to remove the observable from that inFlightLinkObservables map.

I think it’s not being called most likely due to a race condition/or teardown situation similar to the one causing the “Observable cancelled prematurely” error.

The problem seems to be related to Concast not cleaning up properly:

Possibly relevant: https://github.com/apollographql/apollo-client/blob/da3355ce794e105ad7f2652595fc33527a8a461b/src/core/QueryManager.ts#L991-L996

https://github.com/apollographql/apollo-client/blob/da3355ce794e105ad7f2652595fc33527a8a461b/src/core/QueryManager.ts#L1159-L1165

I can reproduce it consistently, if it would help to set up a screen sharing session at some point feel free to DM me on Twitter.

andreialecu on May 7, 2022

Update: in QueryManager, getObservableFromLink seems to have the stuck query in inFlightLinkObservables_1 and because of deduplication it is not issued any more.

I think that’s what’s causing the issue.

Now as to why it gets stuck there, will need to investigate further.

andreialecu on May 7, 2022

Actually the plot thickens.

I’ve changed some things so that those observables don’t resubscribe. This prevented the “Observable cancelled prematurely” from happening, but the queries still hang. 🤔

andreialecu on May 7, 2022

I have an update.

It appears this also happens on 3.6.2 (deployed it in production earlier) so is not a recent regression in 3.6.3.

It seems to be related to peak times somehow (a lot of subscriptions chatter at least).

I have reverted to 3.4.17 for now where everything seems fine. I haven’t yet checked 3.5.x.

Since this happens in apollo angular I assume it’s an issue in the core and not the react part.

andreialecu on May 7, 2022

@benjamn somehow I’m not able to reproduce this early today, so I’m very confused.

Yesterday it was reproducible every single time in production, staging and development - but it was occurring during peak times.

We use graphql subscriptions and there’s a lot of chatter over them, that’s the only thing that could be relevant. One of the subscriptions then triggers a query on a certain condition.

I’ll try to reproduce it during the next peak.

The problems cleared up for all of our customers after deploying a downgrade to 3.4.17 initially as per my comment in https://github.com/apollographql/apollo-client/issues/9456#issuecomment-1119942293

I’ve then tried upgrading up until 3.6.2 and I couldn’t reproduce it. On 3.6.3 it was reproducible again.

Peak times were fading by that point, so it might be possible the problem isn’t exactly between 3.6.2 and 3.6.3 but could be earlier. I’ll continue monitoring this over the next few days.

andreialecu on May 7, 2022