Add Python Virtualenv Operator Caching #33355

jscheffl · 2023-08-12T18:34:41Z

This PR is a follow-up and split of the PR #33017 to add virtualenv caching to PythonVirtualEnvOperator.

airflow/operators/python.py

potiuk

Interesting approach. Glad that you hought about lock and hash @jens-scheffler-bosch . I could think about the case where the venv will be partially created if it fails during cration.

There is also one other case that is difficul to handle. What happens if A virtualenv operator install another dependency after the venv is created. I am not sure if we can do anything about it, but maybe we could store a kind of checksum that would invalidate the venv in case it has been updated after creating it?

…ator-caching

jscheffl · 2023-08-15T19:54:42Z

Interesting approach. Glad that you hought about lock and hash @jens-scheffler-bosch . I could think about the case where the venv will be partially created if it fails during cration.

Thanks for the review @potiuk ! Try/Except added - leftover technical "risk" is that cleanup also fails (for whatever e.g. IO problem) then a leftover cleanup might need manual involvement. I thought (a moment) about how a completion in setup can be made safe, best would be to install venv in a separate folder and move to final folder when completed. But moving a venv seems not to be working, a small test showed that packages are broken if folder is moved. If you have any better idea to detect a partly installed venv let me know.

(After a moment of thought: What do you think of a "marker" file inside the venv marking the install as complete?)

There is also one other case that is difficul to handle. What happens if A virtualenv operator install another dependency after the venv is created. I am not sure if we can do anything about it, but maybe we could store a kind of checksum that would invalidate the venv in case it has been updated after creating it?

Each cached virtualenv is created by the list of requirements. If different tasks are using the same requirements, the hash will be stable and venv can be re-used. Different requirements will make a different hash.
Caching needs to be of course taken with case. Any "pollution" of temp files, added or deleted packages during use will influence other task executions. Other than restricting file permissions (which is usually impossible if not executed as root) I see no "easy" way to ensure integrity (Except making a hash along the folder tree which would be an expensive operation).
If you really fear such problem will appear I propose I could add a note to the docs.

potiuk · 2023-08-15T21:06:38Z

I propose I could add a note to the docs.

Yes. Note in the docs should be sufficient (avoid modifying the venv after it is created or similar ) . checking venv context is not a good idea (not only because of expensiveness of it but also things like .pycaching and potential unintended small but harmless modifications

I also think we should have a way to clean such broken venvs. You know "cache invalidation". Currently we have no easy/documented way of doing it.

It does not have to be sophisticated, but one thing that works well is - rather easy - adding a separate unique ID/prefix to such venv so that you can effectively "clean-it" by changing the prefix. This is how it is done in GitHub Actions for example. There you can specify "key" alongside cache definition - and it used to determine if this "the same" or different cache to be used.

In our case you could simply change "path" where the venv is created, but maybe also adding "cache-key" (as sub-folder in the path) might make sense? This way you could store (and sync?) all the cache files by using the same PATH, but then have a key that you could change in case you would like to invalidate this particular venv you use?

airflow/operators/python.py

…ator-caching

…cache_key

jscheffl · 2023-09-08T22:02:07Z

Hi @potiuk Thanks for the second round intensive review. Took a moment but now adjusted:

There is a new Variable PythonVirtualenvOperator.cache_key which can optionally be set to influence the hash creation. Also adjusted docs for this
Added notes for the potential problems if caches venvs are "dirty".
Added all parameters which influence the venv setup (python version, site packages, index URLs) into the hash.

potiuk

All my concerns are addressed. I'd love another pair of eyes to confirm it though.

…ator-caching

airflow/operators/python.py

…ter type

…ator-caching

jscheffl · 2023-09-25T22:08:09Z

@uranusjr as discussed during summit added the hash data in the marker as well to detect a hash collision - looking forward for the next level of review :-D

airflow/operators/python.py

uranusjr · 2023-09-26T08:44:22Z

Some minor comments, no particular objections in general

…ator-caching

potiuk · 2023-10-18T14:22:28Z

Woohoo! @jens-scheffler-bosch - can you rebase please, just in case ?

The apache#33355 added caching possibility to Python Virtualenv Operator in order to improve speed of task execution. However cache calculation did not include "use_dill" parameter - use_dill modifies the list of requirements so hash should be different for it.

renpin · 2024-01-15T13:11:47Z

@jscheffl

Import fcntl will cause airflow to be unable to run under windows because fcntl seems to be available only on Linux.

I can only go back to 2.7.3 in order to debug dags.

potiuk · 2024-01-15T13:21:47Z

@renpin - Airflow does NOT work on windows. See installation, prerequisites etc. There are multipele things that won't work on windows - not only that. And all the installation documenation points Windows users to use WSL2 on windows.

There is an open issue for that #10388 and if you want to contribute to it and make airflow compatible with Windows you are free to make it happen. But currently it's not compatible, and it will not work.

renpin · 2024-01-16T03:01:40Z

@potiuk
Thanks for the answer, but it is easy to debug dag on windows in version 2.7.3. It works well until import "fnctl" in version 2.8.0.
I will try your suggestions or change my development environment to mac.

potiuk · 2024-01-16T10:49:17Z

Thanks for the answer, but it is easy to debug dag on windows in version 2.7.3. It works well until import "fnctl" in version 2.8.0.

Yes. but as you case in question - even if it accidentally worked for SOME debugging, it's super easy and frigile to break it so you shoudl not rely on it. This is precisely the point of "not supported" - so that maintainers do not have loose time on handling such issue and "actively preventing" such breakages. This slows us down immensely if we would have to take care and individually check every single change if it maybe accidentally break someone's debug workflow on windows.

This is precisely why the right way of doing it is implementing #10388 which should include not only checking and fixing all the windows incompatibiities but also running all the CI tests that are necessary to support development workflows on windows so that this will prevent the issues from being merged, rather than loosing your and maintainer time on handling it post-factum.

There are just a handful of people who would like to have Windows support for their development efforts now and It's up to those people (maybe you would like to lead that) to implement Windows support (including CI tests and running all things locallly that contributors would normally). You are more than welcome to do so.

But until the CI is done and we label airflow supported on Windows (even for development) - please don't rely on it and expect things to break any time.

jscheffl · 2024-01-19T19:44:27Z

@jscheffl

Import fcntl will cause airflow to be unable to run under windows because fcntl seems to be available only on Linux.

I can only go back to 2.7.3 in order to debug dags.

100% agree with @potiuk but was scratching my head. Can you @renpin tell me (1) how you were able to make Airflow running with Airflow 2.7.3 (because I miserably failed to make it working even for simple setups) as well as (2) why do you think this PR is breaking the setup?

mullenpaul · 2024-02-16T09:10:24Z

This PR uses the fcntl package which is simply not supported on Windows. Before 2.8 I could run pytest on my code and now it fails as it can't find the library.

I was also able to get it installed and setup fine locally before this PR. Maybe the answer is for me to raise a pull request making this system agnostic but it isn't great using libraries that are only supported by some operating systems.

potiuk · 2024-02-16T10:23:34Z

This PR uses the fcntl package which is simply not supported on Windows. Before 2.8 I could run pytest on my code and now it fails as it can't find the library.

I was also able to get it installed and setup fine locally before this PR. Maybe the answer is for me to raise a pull request making this system agnostic but it isn't great using libraries that are only supported by some operating systems.

Maybe raise PR. As mentioned above - until someone (maybe even you @mullenpaul ) who cares about Windows support will implement #10388 - and get all the tests run and succeed on Windows, there is no way with 30 PRs merged a week that we check every single PR to see if windows compatibility is broken. So - this is pretty expected that even if accidentally some things will work on Windows sometimes, it's also pretty expected that they will fail equally often and that new PRs will break it. You might ignore the reality, but that's it. This is the reality.

So if you really want to have support for Windows and it is important to you, I suggest. you roll your sleeves up and implement proper support for it including making sure that our CI verifies if things that are supposed to work for windows or not. Airflow is created by > 2800 people and you coudl be one of them. Complaining that this is "not greeat" to use libraries that do not work on Windows when there are no CI checks for it - is just this - idle complainining that brings nothing to the conversation except angering maintainers who work on their free time, taking their time out of their families and possibly paid job so that other people can use the software they help to maintain for absolutely free and without contributing, and yet having demends that THEIR needs are fulfilled.

Yes I suggest rolling sleeves up and implementing - properly with CI tests and everything Windows support. That's the only way to all-but-guarantee it will not be broken in the next 120 or so PRs which we are going to merge during the next month.

Highly recommend it,

jscheffl added 4 commits August 12, 2023 20:29

Extended PythonVirtualEnvOperator for extra index URL

dbd7701

Extended PythonVirtualEnvOperator for venv caching

9d54c5b

Extended PythonVirtualEnvOperator for venv caching

12958a6

Revert BranchPythonVirtualenvOperator

fd7509a

jscheffl requested a review from potiuk as a code owner August 12, 2023 18:34

boring-cyborg bot added area:core-operators Operators, Sensors and hooks within Core Airflow kind:documentation labels Aug 12, 2023

jscheffl changed the title ~~Feature/add venv operator caching~~ Add Python Virtualenv Operator Caching Aug 12, 2023

potiuk reviewed Aug 13, 2023

View reviewed changes

airflow/operators/python.py Show resolved Hide resolved

potiuk requested changes Aug 13, 2023

View reviewed changes

jscheffl added 2 commits August 15, 2023 21:22

Merge remote-tracking branch 'origin/main' into feature/add-venv-oper…

3e59eb0

…ator-caching

Add exception and cleanup if venv setup fails

9bb4371

jscheffl and others added 2 commits August 15, 2023 22:01

Add marker to cached venv to detect partial installs

b51f360

Merge branch 'main' into feature/add-venv-operator-caching

0ade6ec

potiuk reviewed Aug 15, 2023

View reviewed changes

airflow/operators/python.py Outdated Show resolved Hide resolved

jscheffl added 3 commits September 8, 2023 23:17

Merge remote-tracking branch 'origin/main' into feature/add-venv-oper…

6de9c75

…ator-caching

Add usage notes to cache documentation as discussed in review

f802db1

Add a more intense dependency check of cache key calculation and add …

f799ff1

…cache_key

jscheffl requested a review from potiuk September 8, 2023 22:02

jscheffl added the type:new-feature Changelog: New Features label Sep 9, 2023

potiuk approved these changes Sep 11, 2023

View reviewed changes

Merge remote-tracking branch 'origin/main' into feature/add-venv-oper…

9a78ff8

…ator-caching

uranusjr reviewed Sep 12, 2023

View reviewed changes

airflow/operators/python.py Outdated Show resolved Hide resolved

uranusjr reviewed Sep 12, 2023

View reviewed changes

airflow/operators/python.py Outdated Show resolved Hide resolved

uranusjr reviewed Sep 12, 2023

View reviewed changes

airflow/operators/python.py Outdated Show resolved Hide resolved

jscheffl added 2 commits September 12, 2023 23:18

Apply review feedback, rename to virtual environment and cache parame…

01d62a5

…ter type

Apply review feedback, rename to virtual environment

e8d58c8

jscheffl added 2 commits September 25, 2023 23:19

Merge remote-tracking branch 'origin/main' into feature/add-venv-oper…

4c83527

…ator-caching

Apply review feedback, marker hash content is validated

91d3967

uranusjr reviewed Sep 26, 2023

View reviewed changes

airflow/operators/python.py Outdated Show resolved Hide resolved

uranusjr reviewed Sep 26, 2023

View reviewed changes

airflow/operators/python.py Outdated Show resolved Hide resolved

jscheffl and others added 5 commits September 26, 2023 22:19

Merge remote-tracking branch 'origin/main' into feature/add-venv-oper…

9ac4ed2

…ator-caching

Code review findings

0ece29e

Merge branch 'main' into feature/add-venv-operator-caching

6a7120d

Merge remote-tracking branch 'origin/main' into feature/add-venv-oper…

a23056f

…ator-caching

Merge branch 'main' into feature/add-venv-operator-caching

06e6f87

jscheffl requested a review from uranusjr October 8, 2023 15:33

uranusjr approved these changes Oct 18, 2023

View reviewed changes

eladkal added this to the Airflow 2.8.0 milestone Oct 18, 2023

Merge branch 'main' into feature/add-venv-operator-caching

dd40c7e

potiuk merged commit a206012 into apache:main Oct 18, 2023

jscheffl deleted the feature/add-venv-operator-caching branch October 28, 2023 08:23

This was referenced Oct 29, 2023

Include 'use_dill' in venv hash caching for PythonVirtualenvOperator #35248

Closed

Improve testing harness to separate DB and non-DB tests #35160

Merged

jscheffl mentioned this pull request Oct 29, 2023

Bugfix/prevent concurrency with cached venv #35258

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Python Virtualenv Operator Caching #33355

Add Python Virtualenv Operator Caching #33355

jscheffl commented Aug 12, 2023 •

edited

Loading

potiuk left a comment

jscheffl commented Aug 15, 2023 •

edited

Loading

potiuk commented Aug 15, 2023

jscheffl commented Sep 8, 2023

potiuk left a comment

jscheffl commented Sep 25, 2023

uranusjr commented Sep 26, 2023

potiuk commented Oct 18, 2023

renpin commented Jan 15, 2024

potiuk commented Jan 15, 2024

renpin commented Jan 16, 2024

potiuk commented Jan 16, 2024 •

edited

Loading

jscheffl commented Jan 19, 2024

mullenpaul commented Feb 16, 2024

potiuk commented Feb 16, 2024

Add Python Virtualenv Operator Caching #33355

Add Python Virtualenv Operator Caching #33355

Conversation

jscheffl commented Aug 12, 2023 • edited Loading

potiuk left a comment

Choose a reason for hiding this comment

jscheffl commented Aug 15, 2023 • edited Loading

potiuk commented Aug 15, 2023

jscheffl commented Sep 8, 2023

potiuk left a comment

Choose a reason for hiding this comment

jscheffl commented Sep 25, 2023

uranusjr commented Sep 26, 2023

potiuk commented Oct 18, 2023

renpin commented Jan 15, 2024

potiuk commented Jan 15, 2024

renpin commented Jan 16, 2024

potiuk commented Jan 16, 2024 • edited Loading

jscheffl commented Jan 19, 2024

mullenpaul commented Feb 16, 2024

potiuk commented Feb 16, 2024

jscheffl commented Aug 12, 2023 •

edited

Loading

jscheffl commented Aug 15, 2023 •

edited

Loading

potiuk commented Jan 16, 2024 •

edited

Loading