Skip to content

Add proper categorization for client connection closing error #6844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 30, 2025

Conversation

timl3136
Copy link
Member

What changed?

  • Added proper error categorization for gRPC client connection closing errors in the frontend metered wrapper
  • Changed the logging level from error to warn for these specific errors to reduce noise in logs

Why?
Previously, gRPC client connection closing errors were being treated as uncategorized errors, which:

  • Caused unnecessary noise in error logs since these are expected during normal operation
  • Made it difficult to track actual uncategorized errors
  • Could lead to false alerts in monitoring systems
    These errors are common during normal operation (e.g., when clients disconnect) and should be handled gracefully

How did you test it?
Unit tests

Potential risks

Release notes

Documentation Changes


// Check for gRPC connection closing error
if strings.Contains(err.Error(), constants.GRPCConnectionClosingError) {
logger.Warn(constants.GRPCConnectionClosingError, tag.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other failure types have a corresponding metric. let's define one for this to be able to keep track.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is normal, should we still have as a warn and not info?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this error is not within our control, I think making it into a warn level can help during events like network outage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't use errors.Is/As so we can handle wrapping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have change the code to use error.Is instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors.Is expect same type of error. I'm not sure the actual error we receive will match with your inline errors.New(constants.GRPCConnectionClosingError). Can you try to repro the issue locally to make sure this part works (stop matching while there's a long poll request made by frontend)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after some discussion offline I agree it might be slightly fiddly to be able to match it. I think it might be possible, but is not worth blocking on. It's not 100% clear to me if the error bubbling up is a YARPC error or a GPRC error and so any attempt to catch the wrapped variant would need to handle this.

Copy link
Member

@shijiesheng shijiesheng Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From grpc manual, it's cancelled error. Yarpc must have a mapping of canceled error type. Thus it should be handled like

if err.Code() == yarpcerrors.Canceled

https://grpc.io/docs/guides/error/#general-errors

// Check for gRPC connection closing error
if strings.Contains(err.Error(), constants.GRPCConnectionClosingError) {
logger.Warn(constants.GRPCConnectionClosingError, tag.Error(err))
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the implication of returning err as is vs returning frontendInternalServiceError from this function? does that change retry behavior on the caller side?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frontendInternalServiceError is simply a wrapper of fmt.Errorf and make sure the error not being a thrift error. It does not change any retry behavior on the caller side.

metricsClient := metrics.NewClient(testScope, metrics.Frontend)
handler := &apiHandler{}

err := handler.handleErr(errors.New(constants.GRPCConnectionClosingError), metricsClient.Scope(0), logger)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not covering the yarpc canceled error branch.

@timl3136 timl3136 enabled auto-merge (squash) April 30, 2025 20:28
@timl3136 timl3136 merged commit bca03db into cadence-workflow:master Apr 30, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants