Skip to content

Add support and unit test for PyOD models #34709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 23, 2025

Conversation

shunping
Copy link
Collaborator

@shunping shunping commented Apr 22, 2025

Additionally, it includes:

  • A minor fix for error messages in the specifiable module.
  • Support for scoring offline detectors on a subset of input features.

Additionally, it includes:
- A minor fix for error messages in the `specifiable` module.
- Support for scoring offline detectors on a subset of features.
@shunping shunping force-pushed the anomaly-pyod-models branch from 0c1f2fc to b838cee Compare April 22, 2025 17:57
@shunping shunping self-assigned this Apr 22, 2025
@shunping shunping requested a review from claudevdm April 22, 2025 17:59
class PyODModelHandler(ModelHandler[beam.Row,
PredictionResult,
PyODBaseDetector]):
"""Implementation of the ModelHandler interface for PyOD [#]_ Models.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does [#]_ mean?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh does it create a footnote?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise unittest.SkipTest('PyOD dependencies are not installed')


class PyODIForestTest(unittest.TestCase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests are skipped because you need to add the dependencies to

'ml_test': [

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Let me add pyod there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

from apache_beam.ml.inference.base import KeyedModelHandler
from apache_beam.ml.inference.base import ModelHandler
from apache_beam.ml.inference.base import PredictionResult
from apache_beam.ml.inference.base import _PostProcessingModelHandler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove underscores from these (e.g. _PostProcessingModelHandler -> PostProcessingModelHandler) since they are being imported?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one cannot be changed, because in the following code the class will have to be turned into specifiable so that we can create a spec (see https://github.com/apache/beam/blob/release-2.64/sdks/python/apache_beam/ml/anomaly/transforms_test.py#L336) for the whole model handler object without pickling it.

seed = 1234
model = IForest(random_state=seed)
model.fit(self.get_train_data())
self.tmp_fn = os.path.join(self.tmp_dir, 'iforest_pickled')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename tmp_fn pickled_model_uri?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

assert self._offline_detector._features is not None
k, v = elem
row_dict = v._asdict()
return k, beam.Row(**{k: row_dict[k] for k in self._offline_detector._features}) # pylint: disable=line-too-long
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we cant split this to two lines? Also here it looks a bit strange to access _features (with underscore) on _offline_detector.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow the formatter does not split the original line. Let me see if this can be fixed.

Copy link
Collaborator Author

@shunping shunping Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a pair of brackets seems to make the formatter render it differently.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here it looks a bit strange to access _features (with underscore) on _offline_detector.

Ack. I think it is fine as it is referred in the same package.

Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@shunping shunping force-pushed the anomaly-pyod-models branch from 047dac4 to 8f1f965 Compare April 23, 2025 00:26
@shunping
Copy link
Collaborator Author

@claudevdm: Tests are all good now. Could you please take another look? Thanks!

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @tvalentyn for label python.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@tvalentyn tvalentyn merged commit edd00b2 into apache:master Apr 23, 2025
96 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants