-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat(sources): Add new ifile source with improved implementation #22803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat(sources): Add new ifile source with improved implementation #22803
Conversation
This commit modifies the file-source library to use the cross-platform filesystem notification library for Rust (notify-rs/notify) instead of polling, and without holding an active file handle for idle files. Summary of changes: 1. Added notify-rs/notify as a dependency: - Added the notify crate to Cargo.toml 2. Created a new file watcher implementation using notify: - Created a new module file_watcher/notify_watcher.rs to implement a watcher using notify-rs - Implemented a state machine that can transition between active and passive watching states 3. Modified the FileWatcher struct: - Added a new enum for watcher state (Active, Passive) - Added fields to track the notify watcher when in passive mode - Implemented methods to transition between active and passive states 4. Updated the FileServer implementation: - Modified the file server to handle the new passive watching state - Implemented logic to transition files between active and passive states based on activity - Added a new configuration option idle_timeout to control when files switch to passive mode 5. Updated the internal events system: - Added events for when files switch between active and passive modes - Updated the event emitters to handle these new events Benefits: - Reduced resource usage by not holding open file handles for idle files - Better scalability to handle a larger number of files - Improved performance by using filesystem notifications instead of polling - Maintained reliability by keeping checkpoints for all files
Move the 'use crate::file_watcher::WatcherState' statement from the middle of the code to the top of the file with the other imports for better code organization.
This commit fixes a runtime issue where the notify library was detecting when Vector's own file watcher opens a file, creating a feedback loop. The changes include: 1. Filtering events in the notify_watcher to only react to actual file changes (writes, moves, or renames) 2. Adding more detailed logging for better debugging 3. Configuring the notify watcher with optimized settings for our use case These changes ensure that the file watcher only activates when there are actual changes to the file content, not when Vector itself accesses the file.
This commit adds a new NotifyPathsProvider that uses filesystem notifications instead of continuous globbing for file discovery. This provides several benefits: 1. Reduced resource usage by avoiding expensive glob operations 2. Improved responsiveness by detecting new files immediately 3. Better scalability with large numbers of files or complex directory structures The implementation: - Uses notify-rs/notify for filesystem notifications - Watches directories that match the include patterns - Filters events to only react to file creation, moves, and renames - Maintains a cache of discovered files - Falls back to traditional globbing for initial discovery A new configuration option 'use_notify_for_discovery' is added to enable this feature.
This commit improves the file_server to be more event-driven when using notify-based file discovery. The changes include: 1. Added a flag to FileServer to indicate whether we're using notify-based discovery 2. Modified the main loop to be more event-driven when using notify-based discovery: - Reduced frequency of globbing when using notify-based discovery - Added adaptive sleep duration based on file activity - Longer sleep times when all files are in passive mode 3. Added a configurable checkpoint interval for notify-based discovery 4. Updated the file and kubernetes_logs sources to set the new fields These changes significantly reduce CPU usage and improve efficiency when using notify-based file discovery, especially for systems with many idle files.
Thank you @tamer-hassan, this PR looks very promising. I will try to review sometime this week. Hopefully we can include it in the next release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick peek. I will need to spend more time on this next week.
…d code This commit makes several improvements to the file source implementation: 1. Makes notify-based file discovery the default behavior: - Removes the configuration option - Updates the file source to always use NotifyPathsProvider - Sets to true in both file and kubernetes_logs sources 2. Simplifies the NotifyWatcher implementation: - Renames to and makes it private - Simplifies the struct to only contain the path field that's actually used - Removes unused methods (activate, deactivate, get_file_position, update_file_position) - Improves the check_events method to only process events for watched files 3. Eliminates the need for attributes: - Removes all attributes from the codebase - Ensures all fields and methods are actually used in the code These changes improve performance by making the more efficient notify-based implementation the default, while also improving code quality by removing unused code and simplifying the implementation.
This commit refactors the file-source library to: 1. Eliminate active polling, making notification-based watching the only behavior 2. Use filesystem notifications (notify-rs/notify) instead of continuous globbing 3. Only glob once on Vector startup to discover files added while Vector was stopped 4. Read pre-existing files at startup to detect changes that occurred while Vector was stopped 5. Filter out empty lines to avoid sending empty events on subsequent Vector starts These changes significantly improve Vector's performance and resource usage when watching files, especially in environments with large numbers of mostly idle files. The implementation now: - Relies exclusively on passive notification-based watching for all files - Avoids holding active file handles for idle files - Only opens files when needed for reading, then closes them immediately - Provides instantaneous detection of changes for all files - Ensures no data is missed when Vector is restarted
…-only implementation This updates the changelog to reflect that: 1. The file source now exclusively uses filesystem notifications 2. Additional benefits like reading pre-existing files at startup and filtering empty lines
This commit makes several improvements to the file-source component: 1. Make the PathsProvider trait async to match the file watcher implementation - Update all implementations to use async/await for better performance - Ensure the K8sPathsProvider implementation is also async 2. Improve responsiveness of file discovery - Check for new files every 100ms instead of every 10 seconds - Process new files immediately after they're discovered - Reduce delay between file discovery and processing from 5+ seconds to <100ms 3. Optimize performance and reduce logging noise - Only report stats once every 10 seconds instead of continuously - Use a static variable to track when stats were last reported - Avoid unnecessary glob scans when notify-based discovery is working 4. Fix runtime issues - Avoid using block_on inside an async context - Remove redundant code and unused methods - Fix issues with Send/Sync trait bounds These changes make the file-source component more responsive to new files while also being more efficient with CPU and log output.
Add details about the latest improvements to the file-source component: - Highlight the significantly improved responsiveness (< 100ms) - Mention the use of async/await for better performance - Add note about optimized CPU usage and log verbosity
This commit simplifies the file-source component by removing the now-redundant using_notify_discovery flag, since we exclusively use notify-based discovery: 1. Remove the using_notify_discovery field from FileServer struct 2. Simplify the file discovery logic to always use the notify-based approach 3. Maintain the checkpoint_interval field to ensure proper checkpointing 4. Set a consistent 30-second checkpoint interval for all sources 5. Improve code comments to clarify the checkpointing behavior These changes make the code cleaner and easier to maintain while preserving the critical checkpointing behavior that ensures Vector doesn't read the same data twice after a restart.
…eated error messages
To summarize the changes made to fix the failing tests:
These changes make the tests more robust and compatible with the new notify-based implementation of the file source. |
This commit removes unused code from the file-source library: - Remove #[allow(dead_code)] annotations - Remove unused reader field from FileWatcher struct - Remove unused state field from FileWatcher struct - Remove unused state() method from FileWatcher - Remove unused tell() method from FileWatcherFile in tests All tests pass and the code builds successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tamer-hassan, while reviewing I found myself worrying about what might shifted/changed in the existing implementation vs on the new code.
What do you think about introducing this as file2
?
Open to better naming as well. Maybe FileMonitor
, FileNotifier
etc. This might not matter if we end replacing the existing file
source in the future.
- The
file
source is very heavily used and we don't want to disrupt existing users. - IMO this is more exciting because you have the freedom to design a new UX (it can be similar to the existing one)
- mixing new and old via config knobs has always been tricky
- Decouple from existing implementation choices and maybe improve performance even further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor suggestions for readability.
The ifile source is a complete rewrite of the file source that: - Uses async/await throughout for better performance - Leverages cross-platform filesystem notifications instead of polling - Improves file discovery with notification-based detection - Introduces checkpoint_interval configuration for better control - Properly handles shutdown signals and resource cleanup - Maintains compatibility with existing file source configurations The original file source remains unchanged and fully supported.
@pront Completely understood and agreed! As a result, in the last commit, all changes to the |
This fixes a bug where the ifile source would read an empty line on every startup for files that end with a newline. The original file source doesn't have this issue because it doesn't read files at startup. The fix ensures that empty lines at the beginning of a file are skipped during startup, which prevents them from being processed on every Vector restart.
…andle empty output This test was failing because our fix to skip empty lines at the beginning of files was causing the test to not receive any output when restarting the server. This is the expected behavior, so we've updated the test to not assert that there must be output.
- Create ifile.md documentation page - Update ifile.cue with async implementation and checkpointing sections - Update checkpoint_interval documentation with clearer explanation - Generate base ifile.cue with updated configuration documentation
Integrate fix from upstream commit 3c10200 to ensure small files that can't be fingerprinted are properly cleaned up after the specified remove_after_secs period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a lot of comments, but they're all very minor style considerations, please do what makes the most sense for consistency for Vector. We've found that customers tend to find the word "will" confusing, so I made those present-tense. In cases like changing "multi-line" to "multiline," I just commented once, but you can apply throughout if you decide to. Thank you!
@jhgilbert Thanks for the review! However, I've had to reply to most of your comments, due to the fact that the sweeping majority of your suggested changes are to text copied verbatim from the original |
Hi @tamer-hassan, I wanted to express that we are excited about this work. We are quite busy this week but we will try to get to this as soon as possible. |
Summary
A new
ifile
source has been added as an improved version of thefile
source. Theifile
source is a complete rewrite that addresses several limitations of the original implementation while maintaining compatibility with existing configurations.Key improvements in the
ifile
source:glob_minimum_cooldown_ms
optioncheckpoint_interval
configuration optionThe original
file
source remains unchanged and fully supported. Users can migrate to theifile
source at their convenience to take advantage of these improvements.Change Type
Is this a breaking change?
How did you test this PR?
Tested on Linux. To be tested on other OS's as supported by notifyrs/notify library. Currently:
Does this PR include user facing changes?
References
ignore_older
files older than the cutoff? #3567file
source async #9407