Skip to content

[ENH] Implementation of Extended Isolation Forest (EIF) anomaly detector #2679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

Akhil-Jasson
Copy link
Contributor

@Akhil-Jasson Akhil-Jasson commented Mar 23, 2025

This PR implements the Extended Isolation Forest (EIF) algorithm.

Reference Issues/PRs

Fixes #2113

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

Yes, it introduces H20.ai as a new dependency

Any other comments?

PR checklist

For all contributions
  • I've added myself to the list of contributors. Alternatively, you can use the @all-contributors bot to do this for you after the PR has been merged.
  • The PR title starts with either [ENH], [MNT], [DOC], [BUG], [REF], [DEP] or [GOV] indicating whether the PR topic is related to enhancement, maintenance, documentation, bugs, refactoring, deprecation or governance.
For new estimators and functions
  • I've added the estimator/function to the online API documentation.
  • (OPTIONAL) I've added myself as a __maintainer__ at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.
For developers with write access
  • (OPTIONAL) I've updated aeon's CODEOWNERS to receive notifications about future changes to these files.

@aeon-actions-bot
Copy link
Contributor

Thank you for contributing to aeon

I did not find any labels to add based on the title. Please add the [ENH], [MNT], [BUG], [DOC], [REF], [DEP] and/or [GOV] tags to your pull requests titles. For now you can add the labels manually.
I have added the following labels to this PR based on the changes made: [ $\color{#6F6E8D}{\textsf{anomaly detection}}$ ]. Feel free to change these if they do not properly represent the PR.

The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.

If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.

Don't hesitate to ask questions on the aeon Slack channel if you have any.

PR CI actions

These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.

  • Run pre-commit checks for all files
  • Run mypy typecheck tests
  • Run all pytest tests and configurations
  • Run all notebook example tests
  • Run numba-disabled codecov tests
  • Stop automatic pre-commit fixes (always disabled for drafts)
  • Disable numba cache loading
  • Push an empty commit to re-run CI checks

@MatthewMiddlehurst
Copy link
Member

Please fill out the template and use the correct title format.

@MatthewMiddlehurst MatthewMiddlehurst added the enhancement New feature, improvement request or other non-bug code enhancement label Mar 23, 2025
@Akhil-Jasson Akhil-Jasson changed the title 2113 Implementation of Extended Isolation Forest (EIF) anomaly detector [ENH] Implementation of Extended Isolation Forest (EIF) anomaly detector Mar 23, 2025
@Akhil-Jasson
Copy link
Contributor Author

I've fixed the title format. Shall I proceed with adding EIF to the online API documentation?

@Ramana-Raja
Copy link
Contributor

Ramana-Raja commented Mar 24, 2025

Hi @Akhil-Jasson,

I just saw your code, and it looks great! That said, I don’t think using H2O is the best approach since Aeon doesn’t rely on it. It might be better to have our own implementation instead.

A few updates to consider:

1.Could you update the section "Does your contribution introduce a new dependency?" and mention H2O there?

2.The test cases seem to be missing—could you add them?

3.Instead of importing the entire H2O module, it’s better to import only what’s needed to keep things lightweight.

@MatthewMiddlehurst
Copy link
Member

New dependencies should be put in pyproject.toml otherwise this won't be tested, Still bits missing from the template

@Akhil-Jasson
Copy link
Contributor Author

I've added the h2o dependency to pyproject.toml, but I'm encountering errors when running the test files. The test attempts to import aeon.anomaly_detection._eif but fails, indicating that the module doesn’t exist yet.

Is there a step I'm missing for adding new modules to aeon? What could be the possible issue?

@MatthewMiddlehurst
Copy link
Member

your import is incorrect.

Copy link
Member

@SebastianSchmidl SebastianSchmidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI2O looks like a massive package. Do we actually want to include it as a dependency? The issue clearly states that we are looking for an implementation in aeon directly.

I know this is not mentioned in the corresponding issue, but I think it makes sense to work with sliding windows in EIF as well. We can always get the original behavior back by setting the window-size to 1.


return self

def _predict(self, X) -> np.ndarray:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the usage of EIF similar to our other models, we want it to be usable as a semi-supervised (as implemented already) and an unsupervised algorithm. The current implementation of _predict does not allow that.

Copy link
Member

@SebastianSchmidl SebastianSchmidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code does not look too complicated, so I think, it was a good decision to not include H2O.ai as a dependency!

Thanks for the implementation effort. Please address the issues below and then compare your implementation to the original implementation by Hariri et al. to demonstrate that your/our implementation produces the same results as theirs.

I think this estimator is a suitable candidate for improving the performance with Numba / JITs. But let us first focus on making the implementation correct, and then tackle the performance optimization in another PR.

Comment on lines 46 to 51
axis : int, default=1
The time point axis of the input series if it is 2D. If ``axis==0``, it is
assumed each column is a time series and each row is a time point. i.e. the
shape of the data is ``(n_timepoints, n_channels)``. ``axis==1`` indicates
the time series are in rows, i.e. the shape of the data is
``(n_channels, n_timepoints)``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different data layouts are handled in the base class. Please do not expose an additional axis-parameter here.

Comment on lines 91 to 93
y : np.ndarray, optional
Labels for semi-supervised learning. 0 for normal, 1 for anomalous.
If None, unsupervised learning is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semi-supervised means training on normal data only, so just 0 is allowed here. Otherwise, this would be supervised training with normal data and known anomalies. Details: https://www.aeon-toolkit.org/en/stable/examples/anomaly_detection/anomaly_detection.html#Anomaly-Detection-in-aeon

Comment on lines 100 to 102
# Ensure X is 2D
if X.ndim == 1:
X = X.reshape(-1, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be necessary anymore if you remove the axis-parameter from the class and use a single data layout within the estimator.

Comment on lines 152 to 156
# Calculate anomaly scores
scores = self._predict(X)

# Set threshold
self.threshold_ = float(np.percentile(scores, 100 * (1 - self.contamination)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks suspicious. Why do you predict in fit?

scores,
window_size=self.window_size,
stride=self.stride,
n_timepoints=self._n_samples_orig,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just works if either EIF is used in an unsupervised way or both training and prediction time series have the same length!

return 0
elif n == 2:
return 1
return 2 * (np.log(n - 1) + 0.5772156649) - 2 * (n - 1) / n
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the magic number from? Please at least add a comment.

@SebastianSchmidl
Copy link
Member

Tests are also still missing.

@Akhil-Jasson
Copy link
Contributor Author

Previously, the tests were triggered automatically on commit based on the old repo structure. Since then, the structure has changed significantly. I ran the tests locally using the earlier setup, and they passed.

To accommodate the requirements of the test files—such as the need for y in the fit function—I’ve made it an optional parameter. I’ll take a look into the other comments as well.

Given the extent of changes in the repo structure, is it easier to just create a new branch from the updated main branch or is there an easier alternative? Please let me know!

@MatthewMiddlehurst
Copy link
Member

I would merge main into your branch, the bulk of your code should not conflict. It is mainly that we have refactored the module so it has all moved around a bit. If you do not know which module to put your implementation in just put it anywhere for now and we can discuss after where is best.

@Akhil-Jasson
Copy link
Contributor Author

Akhil-Jasson commented May 1, 2025

Changes have been made to this branch to align with the newly refactored module structure. I have added the EIF model under outlier_detection.

I'll soon address the feedback to remove the axis hyperparameter and add the necessary comments as requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
anomaly detection Anomaly detection package enhancement New feature, improvement request or other non-bug code enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH] Implement Extended Isolation Forest
4 participants