-
-
Notifications
You must be signed in to change notification settings - Fork 65
Forecast exogenous vars bug fix #510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forecast exogenous vars bug fix #510
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look great! Left some ideas to simplify the code a bit
if scenario is not None: | ||
sub_dict = { | ||
forecast_model[data_name]: pt.as_tensor_variable( | ||
x=np.atleast_2d(self._exog_data_info[data_name]["value"].T).T, | ||
name=data_name, | ||
) | ||
for data_name in self.data_names | ||
} | ||
|
||
# Will this always be named "data"? | ||
sub_dict[forecast_model["data"]] = pt.as_tensor_variable( | ||
np.atleast_2d(self._fit_data.T).T, name="data" | ||
) | ||
else: | ||
# same here will it always be named data? | ||
sub_dict = { | ||
forecast_model["data"]: pt.as_tensor_variable( | ||
np.atleast_2d(self._fit_data.T).T, name="data" | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simplify all this by using the model data_vars
property. PyMC models store all the symbolic variables created with pm.Data
here. They are pytensor SharedVariables
, so the data lives inside them. You can get SharedVariable
data with .get_value()
. That will also ensure the shapes are correct.
So you can do something like:
sub_dict = {
data_var: pt.as_tensor_variable(data_var.get_value(), name="data") for data_var in forecast_model.data_vars
}
We want free absolutely all data, so this should always do what we want. When there's no self.data_names
, it should still find the main data (even if we change the name later). When there is, it will freeze that as well.
If you want, you can add a sanity check after that makes sure the names of the variables in the keys of sub_dict
are in self.data_names + ['data']
) | ||
} | ||
|
||
mu, cov = graph_replace([mu, cov], replace=sub_dict, strict=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe give new names, like frozen_mu
, or mu_given_training_data
? Makes it clear what we're doing here.
if scenario is not None: | ||
for name in self.data_names: | ||
if name in scenario.keys(): | ||
pm.set_data({name: scenario[name]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pm.set_data
can take a dictionary of multiple things, so no need to loop here. Just directly do pm.set_data(scenario)
. I think the scenario has already been validated by this point (by self._validate_scenario_data
), so there's no need to make sure the keys are in self.data_names
I'm pretty sure you can do something like:
if scenario is not None:
pm.set_data(scenario | {'data': dummy_obs_data},
coords={"data_time": np.arange(len(forecast_index))})
… for the graph replacements
Thank you, Jesse! Your suggestion cleaned the code up nicely. I added in a sanity check for the graph replacements let me know if that looks alright to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great!
I guess the last thing I want to ask for is a test that makes sure this works. Basically we want to make sure that if we:
- Make a really simple model (LevelTrend + Regression)
- Fit it
- Run a forecast with two scenarios (it doesn't matter what the are)
- Check that the the mean of all the data before the forecast starts is all the same (you can seed the forecast rng to avoid the issue of random draws)
I think this will test the issue we've been having?
If you have a more clear idea, please go for it. I just want some kind of regression test to make sure this issue is dead once and for all.
Yes, I will update the test When I write the test, is it okay to use lower level components? Otherwise if I just use the |
If you think the refactor makes sense, we can actually just call the test Good point about not being able to just pull out what we need from the We could then just use |
I did the refactor, could you see if it's enough to get a test going? It's a bit sloppy -- feel free to adjust |
The test is a good start. I think you're right that we probably only need one scenario, so let's do that just to simplify matters. You're going to need to test a few extra things:
from pytensor.graph.basic import graph_inputs
from pytensor.compile import SharedVariable
frozen_shared_inputs = [inpt for inpt in
graph_inputs([test_forecast_model.x0_slice, test_forecast_model.P0_slice])
if isinstance(inpt, SharedVariable) and not isinstance(inpt.get_value(),
np.random.Generator)]
assert len(frozen_shared_inputs) == 0
unfrozen_shared_inputs = [inpt for inpt in graph_inputs([test_forecast_model.forecast_combined])
if isinstance(inpt, SharedVariable) and not isinstance(inpt.get_value(),
np.random.Generator)]
assert len(unfrozen_shared_inputs) == 1
assert unfrozen_shared_inputs[0].name == 'data_exog'
with test_forecast_model:
dummy_obs_data = np.zeros((len(forecast_index), exog_ss_mod.k_endog))
pm.set_data(
{'exog_data': scenario} | {"data": dummy_obs_data},
coords={"data_time": np.arange(len(forecast_index))},
)
# TODO: Why is the reshape necessary?
np.testing.assert_allclose(unfrozen_shared_inputs[0].get_value(), scenario['x1'].values.reshape((-1, 1))) Another thought I had with multiple scenarios was to vary the starting value of the scenario, and make sure that To be clear, I had the idea of several scenarios because I wanted to vary the starting time Some other notes:
from pymc.testing import mock_sample_setup_and_teardown
mock_pymc_sample = pytest.fixture(scope="session")(mock_sample_setup_and_teardown) Then to every fixture or test that calls @pytest.fixture(scope="session")
def idata(pymc_mod, rng, mock_pymc_sample):
with pymc_mod:
idata = pm.sample(draws=10, tune=0, chains=1, random_seed=rng)
idata_prior = pm.sample_prior_predictive(draws=10, random_seed=rng)
idata.extend(idata_prior)
return idata This will monkeypatch pm.sample with pm.sample_prior_predictive, and rejigger the outputs to look like real sample outputs. This should increase the speed of the tests by a factor of roughly infinity.
|
Hey Jesse, I tried adding |
Is your pymc up to date? It was released in 5.23.0 (check the new features) |
oh, duh! I tried updating the wrong module. Okay, updated the correct module and added in the mock sampling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I think we've finally squashed this bug. As one last test, can you confirm that your hurricane notebook works on this PR without the coord hack, and that the forecasts are correct?
Once you do, please feel free to squash and merge :)
This PR addresses a bug in the Statespace module where erroneous forecasts were being produced when exogenous variables were present in the model. This PR addresses issue #491.
The issue stemmed from updating the model's data prior to running the model forward up to the forecast's initial time index as specified by the user. This caused the forecasts to be produced using incorrect x0 and P0 initializations at the forecast initial time index.
The fix ensures that the model is run forward up to the forecast time index and the states (filtered, predicted or smoothed) at that point are frozen using all the observed data. After which new data replacements are made to generate forecasts.