Skip to content

Config reload [WIP] #499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jun 6, 2017
Merged

Config reload [WIP] #499

merged 13 commits into from
Jun 6, 2017

Conversation

tylertreat
Copy link
Contributor

@tylertreat tylertreat commented May 31, 2017

This is a rough first pass at config reload (issue #338). This is just a tracer that implements reload support for the trace config option to prove out a working end-to-end example. If we like the overall approach, we can start implementing support for additional config options in a piecemeal fashion. This PR is mostly just to get early feedback.

This PR does a few things:

  • Adds a Reload function to Server which can be called to reload configuration from a file.
  • Adds support for reloading trace option. Attempting to change any other options will result in a reload failure.
  • If reload fails, the server continues using the previous config.
  • Adds a SIGHUP handler which invokes config reload.
  • Some small refactors to reuse code and minimize changes.

Open questions:

  • If the new config file is invalid, should we dump the last good config to a .bak file (or do something else)?
  • How should we signal config reload on Windows?

@nats-io/core

@tylertreat
Copy link
Contributor Author

tylertreat commented May 31, 2017

Anyone know what's up with the unit tests? go test is returning with exit status 1, but it looks like all the tests passed.

@kozlovic
Copy link
Member

All builds fail at "TestRouteTLSHandshakeError", you may want to look if your changes have introduced a call to os.Exit() or something that this test may trigger.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 91.003% when pulling 43ee0ce on config_reload into d606944 on master.

@tylertreat tylertreat mentioned this pull request May 31, 2017
2 tasks
server/log.go Outdated
@@ -15,9 +15,9 @@ var trace int32
var debug int32

var log = struct {
sync.Mutex
*sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test is storing the value of log before the test and replacing it after to prevent side effects, so govet complains because it's copying the mutex value (motivation for #501).

server/server.go Outdated
@@ -148,6 +147,18 @@ func New(opts *Options) *Server {
return s
}

func (s *Server) getOpts() *Options {
s.optsMu.RLock()
defer s.optsMu.RUnlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be expensive for when getOpts is in a fast path. Defer is better, but still slow. Just explicitly unlock like setOpts does.

server/signal.go Outdated
return
}
c := make(chan os.Signal, 1)

signal.Notify(c, syscall.SIGINT, syscall.SIGUSR1)
signal.Notify(c, syscall.SIGINT, syscall.SIGUSR1, syscall.SIGHUP)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not add this in until we have programmatic tests via API for all options we care about etc. and good test coverage. We should not expose to end users too soon IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @derekcollison. let's build up the features for supporting it and then add in the reload once we have the critical items (Certs and Auth).

Derek, you agree we can merge all the bits to support in prior to having the signal and making the feature available in the server, correct?

Defer adds a bit of overhead which can affect fast-paths.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 91.023% when pulling c8280aa on config_reload into d606944 on master.

Copy link
Contributor

@petemiron petemiron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve once we have agreement on merge/release strategy and confirmation of symlink working xplatform.

server/reload.go Outdated
}
switch strings.ToLower(field.Name) {
case "trace":
opts = append(opts, &traceOption{newValue.(bool)})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer to name opts something like diff_opts. took me a second to get why you were appending.

}
processOptions(golden)

if err := os.Symlink("./configs/reload/test.conf", config); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work as expected on windows? cc @ColinSullivan1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should (https://golang.org/src/os/file_windows.go#L522), though I don't have a Windows box to test on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works on windows so long as you have correct privileges - you'll want to add a note someplace, probably the README.md.

Without the right privileges:

reload_test.go:175: Error creating symlink: symlink .\configs\reload\test.conf g:\src\github.com\nats-io\gnatsd\server\tmp.conf: A required privilege is not held by the client.
FAIL

On a network drive (generally not allowed by default):

--- FAIL: TestConfigReload (0.00s)
        reload_test.go:175: Error creating symlink: symlink .\configs\reload\test.conf g:\src\github.com\nats-io\gnatsd\server\tmp.conf: Incorrect function.
FAIL

Locally, with privileges (simple case running on C: as admin), you'll succeed.

=== RUN   TestConfigReload
[3792] [INF] Reloaded server configuration
--- PASS: TestConfigReload (0.00s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ColinSullivan1 Thanks for checking that. I can add a note to the README if we think this is acceptable. Since this is just for development purposes, this seems OK to me. What do you guys think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is OK; nothing unusual (at least for Windows).

server/signal.go Outdated
return
}
c := make(chan os.Signal, 1)

signal.Notify(c, syscall.SIGINT, syscall.SIGUSR1)
signal.Notify(c, syscall.SIGINT, syscall.SIGUSR1, syscall.SIGHUP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @derekcollison. let's build up the features for supporting it and then add in the reload once we have the critical items (Certs and Auth).

Derek, you agree we can merge all the bits to support in prior to having the signal and making the feature available in the server, correct?

Tyler Treat added 2 commits June 1, 2017 14:11
Do not expose config reloading yet until we are ready.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 90.997% when pulling 677ff0a on config_reload into d606944 on master.

@ColinSullivan1
Copy link
Member

Regarding a config reload on windows. some other mechanism will be needed (that signal is not really supported). Either a REST API, or something else. We have the same issue with log rotation.

@tylertreat
Copy link
Contributor Author

@ColinSullivan1 Yep, since we're not exposing the signals yet, I think we can defer the decision for what we want to do on Windows for now. That can come at a later time when we're ready to expose the feature.

Add a note clarifying symlink behavior on Windows for developers.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 90.906% when pulling cef5a70 on config_reload into d606944 on master.

README.md Outdated
FAIL
```

Similarly, this can fail when creating a symlink on a network drive, which is typically now allowed by default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, "not"

@@ -0,0 +1,99 @@
package server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a copyright?

// Copyright 2017 Apcera Inc. All rights reserved.

@@ -0,0 +1,210 @@
package server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright.

processOptions(golden)

if err := os.Symlink("./configs/reload/test.conf", config); err != nil {
t.Fatalf("Error creating symlink: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn on this, but maybe skip the test here (and in others below), with some output? This makes assumptions about the privileges of the user the test is running under, where the test is running, etc. It's convenience vs ensuring a test runs locally. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's better to fail loudly, otherwise the test might get skipped and the user wouldn't notice the warning message. We could add additional information to the Fatal message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can go with that... IMO an "ensure you have privileges" would go a long way.

server/server.go Outdated
@@ -584,7 +596,7 @@ func (s *Server) createClient(conn net.Conn) *client {
}
// If there is a max connections specified, check that adding
// this new client would not push us over the max
if s.opts.MaxConn > 0 && len(s.clients) >= s.opts.MaxConn {
if s.getOpts().MaxConn > 0 && len(s.clients) >= s.getOpts().MaxConn {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is worth optimizing in this function to snapshot MaxConn, TLS options, and AuthTimeout? I realize establishing the connection and TLS are already expensive, but am thinking of a thundering herd of clients reconnecting.

@derekcollison
Copy link
Member

@tylertreat Let's rebase to master. Keep signalling removed and make sure we cover Windows builds, etc. Then we are good to go.

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concerns are about making a deep copy of Options, or at least be sure of the way options are accessed, and the use of multiple getOpts() within the same function. Regardless of the safety of accessing a copy or the "live" options, it may be logically incorrect to get the user name and password from two calls to getOpts() when the options are changed in between.

main.go Outdated
@@ -149,13 +148,16 @@ func main() {
usage()
}

// Snapshot flag options.
*server.FlagSnapshot = *opts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to copy the values since they may change after? If so, then we need a clone function for options. This assignment simply copy some of the fields, but for instance, the array of users will still point to the original, same for the tls.Config, or if we later add maps, etc...

server/auth.go Outdated
@@ -30,13 +30,13 @@ func (s *Server) configureAuthorization() {
}
// Check for multiple users first
// This just checks and sets up the user map if we have multiple users.
if s.opts.Users != nil {
if s.getOpts().Users != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we have replaced all references of s.opts with s.getOpts(), is it because s.opts pointer can change during the reload? Again, think of some of the types in the options that are currently not copied. You are protecting against the value of the pointer not being accessed concurrently, but fields pointed by s.opts or s.getOpts would still point to objects that are shared and may incorrectly be concurrently accessed. Of course, if we clone and make sure that values set/reset during reload are on a copy and then swap the pointer, we are ok.

I also have an issue with the same function invoking getOpts(), not only for performance - which still may make a difference - but because if conceptually we assume that the options may change between 2 calls to getOpts(), then we should snapshot once per function. Again, this assumes that the getOpts() would return an immutable structure, otherwise, there is no reason to do that.

}
s.setOpts(newOpts)
s.applyOptions(changed)
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be cases that applying an option may fail - in the long run - so I would make applyOptions return an error, and therefore move the s.setOpts() only after we know that we have a value new options set to swap with the old one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with failing an option application is we could end up in an inconsistent state—some new options are applied, others are not—unless we also introduce a "revert" operation to revert any applied options.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because we apply the options to a temporary Options structure, and do not swap the server's options with the new one until after success.

Copy link
Contributor Author

@tylertreat tylertreat Jun 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two parts to swapping Options: 1) swapping the Options pointer itself and 2) mutating the server as necessary (e.g. for trace we need to update the server logger. 2) could result in an inconsistent state of the server if something fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but if we apply the new options to the server's options that a caller has a pointer to (getOpts()), there is a risk that the caller accesses values that are being changed on the fly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, there is, but I'm not sure how we can apply a new config atomically.

server/reload.go Outdated
// error.
func (s *Server) diffOptions(newOpts *Options) ([]option, error) {
var (
oldConfig = reflect.ValueOf(s.opts).Elem()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either we make the entry API threadsafe (with special mutex) or we don't access s.opts directly here. I think it would be safer to make the whole Reload() API protected by its own mutex?

return diffOpts, nil
}

func (s *Server) applyOptions(opts []option) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As indicated above, I believe that there will be cases where applying an option may produce an error. We could add later of course, since those are internal APIs.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.5%) to 91.3% when pulling 1e79911 on config_reload into cba3170 on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.6%) to 91.407% when pulling 1e79911 on config_reload into cba3170 on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 90.925% when pulling 1e79911 on config_reload into cba3170 on master.

@tylertreat
Copy link
Contributor Author

I believe I've addressed everyone's feedback. @kozlovic can you take another look?

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerns addressed except for the question about when we switch the server's options when applying the changes, which I understand is a moot point since there is always the risk that a part of the code is accessing options while options are being updated.

server/log.go Outdated
s.SetLogger(fileLog, s.opts.Debug, s.opts.Trace)
fileLog := logger.NewFileLogger(opts.LogFile,
opts.Logtime, s.getOpts().Debug, opts.Trace, true)
s.SetLogger(fileLog, s.getOpts().Debug, opts.Trace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still using s.getOpts() here...

@coveralls
Copy link

Coverage Status

Coverage increased (+0.4%) to 91.193% when pulling 4f28bf7 on config_reload into cba3170 on master.

@tylertreat tylertreat dismissed ColinSullivan1’s stale review June 6, 2017 14:02

Feedback addressed

@tylertreat
Copy link
Contributor Author

@derekcollison I think this is good to go.

@derekcollison derekcollison merged commit 2b7ab35 into master Jun 6, 2017
@derekcollison derekcollison deleted the config_reload branch June 6, 2017 15:12
@tylertreat tylertreat mentioned this pull request Jun 27, 2017
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants