-
Notifications
You must be signed in to change notification settings - Fork 209
[ENH] Implementation of Extended Isolation Forest (EIF) anomaly detector #2679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thank you for contributing to
|
Please fill out the template and use the correct title format. |
I've fixed the title format. Shall I proceed with adding EIF to the online API documentation? |
Hi @Akhil-Jasson, I just saw your code, and it looks great! That said, I don’t think using H2O is the best approach since Aeon doesn’t rely on it. It might be better to have our own implementation instead. A few updates to consider: 1.Could you update the section "Does your contribution introduce a new dependency?" and mention H2O there? 2.The test cases seem to be missing—could you add them? 3.Instead of importing the entire H2O module, it’s better to import only what’s needed to keep things lightweight. |
New dependencies should be put in pyproject.toml otherwise this won't be tested, Still bits missing from the template |
…ile for the EIF implementation.
I've added the h2o dependency to pyproject.toml, but I'm encountering errors when running the test files. The test attempts to import aeon.anomaly_detection._eif but fails, indicating that the module doesn’t exist yet. Is there a step I'm missing for adding new modules to aeon? What could be the possible issue? |
your import is incorrect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AI2O looks like a massive package. Do we actually want to include it as a dependency? The issue clearly states that we are looking for an implementation in aeon directly.
I know this is not mentioned in the corresponding issue, but I think it makes sense to work with sliding windows in EIF as well. We can always get the original behavior back by setting the window-size to 1.
aeon/anomaly_detection/_eif.py
Outdated
|
||
return self | ||
|
||
def _predict(self, X) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the usage of EIF similar to our other models, we want it to be usable as a semi-supervised (as implemented already) and an unsupervised algorithm. The current implementation of _predict
does not allow that.
…d and unsupervised learning. Included Sliding Window Algorithm as requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code does not look too complicated, so I think, it was a good decision to not include H2O.ai as a dependency!
Thanks for the implementation effort. Please address the issues below and then compare your implementation to the original implementation by Hariri et al. to demonstrate that your/our implementation produces the same results as theirs.
I think this estimator is a suitable candidate for improving the performance with Numba / JITs. But let us first focus on making the implementation correct, and then tackle the performance optimization in another PR.
aeon/anomaly_detection/_eif.py
Outdated
axis : int, default=1 | ||
The time point axis of the input series if it is 2D. If ``axis==0``, it is | ||
assumed each column is a time series and each row is a time point. i.e. the | ||
shape of the data is ``(n_timepoints, n_channels)``. ``axis==1`` indicates | ||
the time series are in rows, i.e. the shape of the data is | ||
``(n_channels, n_timepoints)``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different data layouts are handled in the base class. Please do not expose an additional axis
-parameter here.
aeon/anomaly_detection/_eif.py
Outdated
y : np.ndarray, optional | ||
Labels for semi-supervised learning. 0 for normal, 1 for anomalous. | ||
If None, unsupervised learning is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semi-supervised means training on normal data only, so just 0
is allowed here. Otherwise, this would be supervised training with normal data and known anomalies. Details: https://www.aeon-toolkit.org/en/stable/examples/anomaly_detection/anomaly_detection.html#Anomaly-Detection-in-aeon
aeon/anomaly_detection/_eif.py
Outdated
# Ensure X is 2D | ||
if X.ndim == 1: | ||
X = X.reshape(-1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be necessary anymore if you remove the axis-parameter from the class and use a single data layout within the estimator.
aeon/anomaly_detection/_eif.py
Outdated
# Calculate anomaly scores | ||
scores = self._predict(X) | ||
|
||
# Set threshold | ||
self.threshold_ = float(np.percentile(scores, 100 * (1 - self.contamination))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks suspicious. Why do you predict in fit
?
aeon/anomaly_detection/_eif.py
Outdated
scores, | ||
window_size=self.window_size, | ||
stride=self.stride, | ||
n_timepoints=self._n_samples_orig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just works if either EIF is used in an unsupervised way or both training and prediction time series have the same length!
aeon/anomaly_detection/_eif.py
Outdated
return 0 | ||
elif n == 2: | ||
return 1 | ||
return 2 * (np.log(n - 1) + 0.5772156649) - 2 * (n - 1) / n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the magic number from? Please at least add a comment.
Tests are also still missing. |
Previously, the tests were triggered automatically on commit based on the old repo structure. Since then, the structure has changed significantly. I ran the tests locally using the earlier setup, and they passed. To accommodate the requirements of the test files—such as the need for y in the fit function—I’ve made it an optional parameter. I’ll take a look into the other comments as well. Given the extent of changes in the repo structure, is it easier to just create a new branch from the updated main branch or is there an easier alternative? Please let me know! |
I would merge main into your branch, the bulk of your code should not conflict. It is mainly that we have refactored the module so it has all moved around a bit. If you do not know which module to put your implementation in just put it anywhere for now and we can discuss after where is best. |
Changes have been made to this branch to align with the newly refactored module structure. I have added the EIF model under outlier_detection. I'll soon address the feedback to remove the axis hyperparameter and add the necessary comments as requested. |
This PR implements the Extended Isolation Forest (EIF) algorithm.
Reference Issues/PRs
Fixes #2113
What does this implement/fix? Explain your changes.
Does your contribution introduce a new dependency? If yes, which one?
Yes, it introduces H20.ai as a new dependency
Any other comments?
PR checklist
For all contributions
For new estimators and functions
__maintainer__
at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.For developers with write access