-
Notifications
You must be signed in to change notification settings - Fork 14.9k
Add Python Virtualenv Operator Caching #33355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Python Virtualenv Operator Caching #33355
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting approach. Glad that you hought about lock and hash @jens-scheffler-bosch . I could think about the case where the venv will be partially created if it fails during cration.
There is also one other case that is difficul to handle. What happens if A virtualenv operator install another dependency after the venv is created. I am not sure if we can do anything about it, but maybe we could store a kind of checksum that would invalidate the venv in case it has been updated after creating it?
Thanks for the review @potiuk ! Try/Except added - leftover technical "risk" is that cleanup also fails (for whatever e.g. IO problem) then a leftover cleanup might need manual involvement. I thought (a moment) about how a completion in setup can be made safe, best would be to install venv in a separate folder and move to final folder when completed. But moving a venv seems not to be working, a small test showed that packages are broken if folder is moved. If you have any better idea to detect a partly installed venv let me know. (After a moment of thought: What do you think of a "marker" file inside the venv marking the install as complete?)
Each cached virtualenv is created by the list of requirements. If different tasks are using the same requirements, the hash will be stable and venv can be re-used. Different requirements will make a different hash. |
Yes. Note in the docs should be sufficient (avoid modifying the venv after it is created or similar ) . checking venv context is not a good idea (not only because of expensiveness of it but also things like .pycaching and potential unintended small but harmless modifications I also think we should have a way to clean such broken venvs. You know "cache invalidation". Currently we have no easy/documented way of doing it. It does not have to be sophisticated, but one thing that works well is - rather easy - adding a separate unique ID/prefix to such venv so that you can effectively "clean-it" by changing the prefix. This is how it is done in GitHub Actions for example. There you can specify "key" alongside cache definition - and it used to determine if this "the same" or different cache to be used. In our case you could simply change "path" where the venv is created, but maybe also adding "cache-key" (as sub-folder in the path) might make sense? This way you could store (and sync?) all the cache files by using the same PATH, but then have a key that you could change in case you would like to invalidate this particular venv you use? |
Hi @potiuk Thanks for the second round intensive review. Took a moment but now adjusted:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All my concerns are addressed. I'd love another pair of eyes to confirm it though.
@uranusjr as discussed during summit added the hash data in the marker as well to detect a hash collision - looking forward for the next level of review :-D |
Some minor comments, no particular objections in general |
Woohoo! @jens-scheffler-bosch - can you rebase please, just in case ? |
The apache#33355 added caching possibility to Python Virtualenv Operator in order to improve speed of task execution. However cache calculation did not include "use_dill" parameter - use_dill modifies the list of requirements so hash should be different for it.
Import fcntl will cause airflow to be unable to run under windows because fcntl seems to be available only on Linux. I can only go back to 2.7.3 in order to debug dags. |
@renpin - Airflow does NOT work on windows. See installation, prerequisites etc. There are multipele things that won't work on windows - not only that. And all the installation documenation points Windows users to use WSL2 on windows. There is an open issue for that #10388 and if you want to contribute to it and make airflow compatible with Windows you are free to make it happen. But currently it's not compatible, and it will not work. |
@potiuk |
Yes. but as you case in question - even if it accidentally worked for SOME debugging, it's super easy and frigile to break it so you shoudl not rely on it. This is precisely the point of "not supported" - so that maintainers do not have loose time on handling such issue and "actively preventing" such breakages. This slows us down immensely if we would have to take care and individually check every single change if it maybe accidentally break someone's debug workflow on windows. This is precisely why the right way of doing it is implementing #10388 which should include not only checking and fixing all the windows incompatibiities but also running all the CI tests that are necessary to support development workflows on windows so that this will prevent the issues from being merged, rather than loosing your and maintainer time on handling it post-factum. There are just a handful of people who would like to have Windows support for their development efforts now and It's up to those people (maybe you would like to lead that) to implement Windows support (including CI tests and running all things locallly that contributors would normally). You are more than welcome to do so. But until the CI is done and we label airflow supported on Windows (even for development) - please don't rely on it and expect things to break any time. |
100% agree with @potiuk but was scratching my head. Can you @renpin tell me (1) how you were able to make Airflow running with Airflow 2.7.3 (because I miserably failed to make it working even for simple setups) as well as (2) why do you think this PR is breaking the setup? |
This PR uses the fcntl package which is simply not supported on Windows. Before 2.8 I could run pytest on my code and now it fails as it can't find the library. I was also able to get it installed and setup fine locally before this PR. Maybe the answer is for me to raise a pull request making this system agnostic but it isn't great using libraries that are only supported by some operating systems. |
Maybe raise PR. As mentioned above - until someone (maybe even you @mullenpaul ) who cares about Windows support will implement #10388 - and get all the tests run and succeed on Windows, there is no way with 30 PRs merged a week that we check every single PR to see if windows compatibility is broken. So - this is pretty expected that even if accidentally some things will work on Windows sometimes, it's also pretty expected that they will fail equally often and that new PRs will break it. You might ignore the reality, but that's it. This is the reality. So if you really want to have support for Windows and it is important to you, I suggest. you roll your sleeves up and implement proper support for it including making sure that our CI verifies if things that are supposed to work for windows or not. Airflow is created by > 2800 people and you coudl be one of them. Complaining that this is "not greeat" to use libraries that do not work on Windows when there are no CI checks for it - is just this - idle complainining that brings nothing to the conversation except angering maintainers who work on their free time, taking their time out of their families and possibly paid job so that other people can use the software they help to maintain for absolutely free and without contributing, and yet having demends that THEIR needs are fulfilled. Yes I suggest rolling sleeves up and implementing - properly with CI tests and everything Windows support. That's the only way to all-but-guarantee it will not be broken in the next 120 or so PRs which we are going to merge during the next month. Highly recommend it, |
This PR is a follow-up and split of the PR #33017 to add virtualenv caching to PythonVirtualEnvOperator.
FYI @AutomationDev85 @clellmann @wolfdn