Skip to content

Fix multithreading s3 memory leak #1495

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

ddelange
Copy link

@ddelange ddelange commented Apr 1, 2025

storages.s3 has a memory leak

boto/boto3#1670
boto/botocore#3078

this PR adds up to two thread-safe boto3 resource per-storage (one signed one unsigned) instead of two per-thread, and adds a 1 hour time to live cache for the boto3 resources such that it gets periodically recreated to avoid this memory leak.

  • the memory leak on infra level (6 day view):
    image
  • memray --leaks profile consistently shows a long lived s3 client as the culprit (just a random snapshot from the entire profile):
    image

@jschneier
Copy link
Owner

Thanks for opening. I don't think we can collapse UNSIGNED down like this. The reason that was written at all is that if you want to upload something you need to use a signed version regardless. Maintaining a single storage API that can do both uploading and generating URLs requires hiding the complexity within the class.

@@ -691,10 +674,7 @@ def url(self, name, parameters=None, expire=None, http_method=None):
params["Bucket"] = self.bucket.name
params["Key"] = name

connection = (
self.connection if self.querystring_auth else self.unsigned_connection
)
Copy link
Author

@ddelange ddelange Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @jschneier 👋

I didn't see any point in code where self.querystring_auth gets altered after init, and I didn't see any other point in code where self.unsigned_connection is used, so this seemed like a safe refactor to me (the class instance only ever uses either connection, or unsigned_connection, so this can be collapsed). I didn't look at parent classes though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, but the same is not true for connection! you're right, will adjust 👍

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddelange
Copy link
Author

ddelange commented Apr 2, 2025

This is green now on my fork 👍

@ddelange
Copy link
Author

ddelange commented Apr 3, 2025

this has been live since April 2nd and as you can see, we are now consistently below 50% memory 🎉

pip install 'django-storages[s3]@https://github.com/jschneier/django-storages/archive/4623ca742f6d40d7aeb3d9b4d47f8bcbe3163fc5.zip'
image

Comment on lines 352 to 354
max_pool_connections=64, # thread-safe
tcp_keepalive=True,
retries={"max_attempts": 6, "mode": "adaptive"},
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, these values have been empirically chosen using a c6 family EC2 machine talking directly to s3 from the AWS network using this code. increasing threads beyond 64 starts to become unstable and does not yield a total throughput increase.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea what the defaults are?

People run on all kinds of hardware and I'm mostly inclined to say this should be configured by the user if they want.

Copy link
Author

@ddelange ddelange Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default is no retries, no keepalive, 10 conns. the above is the sensible default for multithreaded applications like django servers imo:

  • 10 conns is a bottleneck for IO bound views (this used to not be an issue because, before this PR, each thread created its own boto3 resource).
  • tcp_keepalive is a huge speedup at no additional cost
  • (especially aws) s3 infrastructure is prone to intermittent failures at high throughput.

people can pass their own client_config if the need arises, but adding these doesn't put load on people's hardware and only makes things more robust and production-ready.

@ddelange ddelange force-pushed the memory-leak branch 4 times, most recently from 29907eb to 7a9a7d7 Compare April 3, 2025 11:54
@ddelange
Copy link
Author

ddelange commented Apr 3, 2025

Alright, this last commit took some trial and error. The test that the commit adds was essential, and prompted that change.

@ddelange ddelange force-pushed the memory-leak branch 5 times, most recently from 0dd808f to 348bee8 Compare April 4, 2025 07:10
@ddelange
Copy link
Author

ddelange commented Apr 4, 2025

Hi @jschneier 👋

This is ready for review

Comment on lines 39 to 46
try:
from functools import cached_property
except ImportError: # python_version<='3.7'
try:
from backports.cached_property import cached_property
except (ImportError, ModuleNotFoundError) as e:
msg = "Could not import backports.cached_property. Did you run 'pip install django-storages[s3]'?" # noqa: E501
raise ImproperlyConfigured(msg) from e
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jschneier is the plan to support python 3.7 much longer? since it's end of life for almost 2 years now

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was supported because I had not yet dropped Django3.2 but that is now done.

@jschneier
Copy link
Owner

Thanks for this PR. My main hesitation is taking a dependency on a 3rd party library, namely cachetools in this case. Generally speaking I prefer not to do that for all the typical reasons.

Is there no other way to solve the problem, even in a less good way?

…into memory-leak

* 'master' of https://github.com/jschneier/django-storages:
  Add moto5 support (jschneier#1464)
  Drop support for Django3.2 and Django4.1 (jschneier#1505)
  [ci] Update CI to use Ubuntu 22.04 for testing (jschneier#1502)
  Bump version for release (jschneier#1497)
  Release version 1.14.6 (jschneier#1496)
  [s3] Default `url_protocol` to `https:` if set to None (jschneier#1483)
@ddelange
Copy link
Author

Hi @jschneier 👋

I've added a commit to remove dependency on cachetools (and cached_property). ready for review again!

@ddelange ddelange requested a review from jschneier April 28, 2025 08:10
@ddelange
Copy link
Author

@jschneier kind reminder :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants