fix(gateway): Prevent inoperable state on initial failure to load configuration #4277

trevor-scheer · 2020-06-16T22:33:19Z

In its current state, if the gateway fails to compose successfully in managed mode, it will continue to exist in a state where polling never kicked off and it can't serve requests successfully.

After these changes, the gateway will:

In managed mode, still kick off polling and continue to try to load config on the poll interval if it fails the first time.
In unmanaged mode and with polling disabled, it will process.exit() if it fails to load.

I'm not sure why I had marked this as a TODO and future breaking change. This change simplifies the `updateComposition` API and makes `load` (called once ever) responsible for setting config.

This commit introduces behavior such that the gateway will `process.exit(1)` in the event that it isn't polling and fails to compose a valid schema on startup.

glasser · 2020-06-16T22:52:35Z

(not doing a full review unless requested, just curious)

Does this state integrate with the Apollo Server built in health check at all (by default or opt in)? It really seems like it would be nice to be able to not consider a gateway server healthy until it has loaded its schema.

abernix · 2020-06-17T17:35:45Z

@glasser We aren't changing the existing default behavior of the AS server health check, but we can provide a pre-written function that implements it. That would be opt-in via documentation suggestion for now, and then a default mode of operation in AS3.

glasser · 2020-06-17T17:46:16Z

Great, I think we definitely want that for our own use. Let me know if that's something your team is working on or if we should write it.

abernix · 2020-06-17T17:51:30Z

@glasser Thoughts on this?
I wrote this some months back, but I think it works well, still? https://gist.github.com/abernix/50de4ce87278a3d8c23bd3d174ae0b6c

glasser · 2020-06-17T19:50:05Z

That sounds good though I was actually thinking of just starting with the much simpler "if we have never successfully loaded the schema, unhealthy, otherwise healthy" (which honestly seems reasonable to me to be on by default, but I trust that your generally more conservative nature around these sorts of changes is based on far more experience with this project than mine).

Ie base it on "can we even conceivably run any GraphQL query", not "at the moment are backends happy".

abernix

I realize this is WIP, but I thought I'd leave some thoughts about it since from what I understand, this has fixed the Gateway startup problem we've seen internally when there is downstream service unavailability! Put another way, I'm excited to ship it!

packages/apollo-gateway/src/__tests__/gateway/executor.test.ts

abernix · 2020-07-01T13:10:52Z

packages/apollo-gateway/src/__tests__/gateway/lifecycle-hooks.test.ts

+let logger: Logger;
+
+beforeEach(() => {
+  const warn = jest.fn();
+  const debug = jest.fn();
+  const error = jest.fn();
+  const info = jest.fn();
+
+  logger = {
+    warn,
+    debug,
+    error,
+    info,
+  };
+});
+


I have also been repeating this pattern in many places. Perhaps at some point soon (not now), we should just make a spyableLogger?

packages/apollo-gateway/src/__tests__/integration/networkRequests.test.ts

packages/apollo-gateway/src/index.ts

This reverts commit 34115ee.

glasser · 2020-07-27T21:33:43Z

packages/apollo-gateway/src/index.ts


-    if (isManagedConfig(this.config) || this.experimental_pollInterval) {
-      if (!this.pollingTimer) this.pollServices();
+    await this.updateComposition();


I'm not sure that this PR accomplishes its goal. If this call throws, no polling will happen.

Also note that even once this is fixed, there are a couple issues to resolve:

We need special logic to ensure that the serverWillStart plugins get called, if they got skipped because the original schemaDerivedData threw

The Gateway successfully loaded schema message in load won’t ever show up if the first update failed. It might make sense to move that log line into updateComposition from load, eg putting it between assigning to this.schema and notifying the listeners, if !previousSchema

…figuration (apollographql/apollo-server#4277) In managed mode, kick off polling and continue to try to load config on the poll interval even if it fails the first time. Apollo-Orig-Commit-AS: apollographql/apollo-server@16b7884

trevor-scheer added 4 commits June 16, 2020 14:52

Relocate engine config setting

f55688c

I'm not sure why I had marked this as a TODO and future breaking change. This change simplifies the `updateComposition` API and makes `load` (called once ever) responsible for setting config.

Gateway should crash in unmanaged mode

ca9911b

This commit introduces behavior such that the gateway will `process.exit(1)` in the event that it isn't polling and fails to compose a valid schema on startup.

await gateway load

add3cc7

REVERT ME: Skip flaky tests for now

34115ee

trevor-scheer changed the title ~~fix(gateway): Prevent inoperable state on initial failure to load configuration~~ WIP - fix(gateway): Prevent inoperable state on initial failure to load configuration Jun 16, 2020

Merge branch 'master' into trevor/gateway-exit-correctly

918c041

Base automatically changed from master to main June 24, 2020 18:18

abernix reviewed Jul 1, 2020

View reviewed changes

trevor-scheer and others added 5 commits July 1, 2020 09:54

Merge branch 'main' into trevor/gateway-exit-correctly

c863389

Merge branch 'main' into trevor/gateway-exit-correctly

1b73b80

Don't process.exit() on failure to load

afa6746

Revert "REVERT ME: Skip flaky tests for now"

992ba62

This reverts commit 34115ee.

Don't pass functions directly into setTimeout

0b8b46c

trevor-scheer added the 👩‍🚀 federation label Jul 27, 2020

Merge branch 'main' into trevor/gateway-exit-correctly

932aaa3

trevor-scheer changed the title ~~WIP - fix(gateway): Prevent inoperable state on initial failure to load configuration~~ fix(gateway): Prevent inoperable state on initial failure to load configuration Jul 27, 2020

trevor-scheer and others added 3 commits July 27, 2020 13:15

Skip flaky tests

bb83f17

Update changelog

dae1db8

Merge branch 'main' into trevor/gateway-exit-correctly

81ae428

trevor-scheer merged commit 16b7884 into main Jul 27, 2020

trevor-scheer deleted the trevor/gateway-exit-correctly branch July 27, 2020 20:38

glasser reviewed Jul 27, 2020

View reviewed changes

glasser mentioned this pull request Jan 15, 2021

Gateway behaves poorly when the first updateCompositionConfig call throws apollographql/federation#335

Closed

github-actions bot locked as resolved and limited conversation to collaborators Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): Prevent inoperable state on initial failure to load configuration #4277

fix(gateway): Prevent inoperable state on initial failure to load configuration #4277

trevor-scheer commented Jun 16, 2020

glasser commented Jun 16, 2020

abernix commented Jun 17, 2020

glasser commented Jun 17, 2020

abernix commented Jun 17, 2020

glasser commented Jun 17, 2020

abernix left a comment

abernix Jul 1, 2020

glasser Jul 27, 2020

glasser Jul 27, 2020

fix(gateway): Prevent inoperable state on initial failure to load configuration #4277

fix(gateway): Prevent inoperable state on initial failure to load configuration #4277

Conversation

trevor-scheer commented Jun 16, 2020

glasser commented Jun 16, 2020

abernix commented Jun 17, 2020

glasser commented Jun 17, 2020

abernix commented Jun 17, 2020

glasser commented Jun 17, 2020

abernix left a comment

Choose a reason for hiding this comment

abernix Jul 1, 2020

Choose a reason for hiding this comment

glasser Jul 27, 2020

Choose a reason for hiding this comment

glasser Jul 27, 2020

Choose a reason for hiding this comment