-
Notifications
You must be signed in to change notification settings - Fork 5.1k
This is an incomplete implementation of a new randomized allocation s… #98167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ampling feature. This commit is only to gather feedback on the direction and is not intended to be checked in. The is a resumption of some of the original work that occured in dotnet#85750 although this is a different implementation approach. The overall goal is to do lightweight allocation sampling that can produce unbiased approximations of what was actually allocated. Our current ETW AllocationTick sampling is periodic but not randomized. Extrapolating from the AllocationTick samples can lead to unbounded estimation errors in theory and also substantial estimation error observed in practice. This commit is primarily to get feedback on adding a new adjustable limit pointer on the coreclr EE side of allocation contexts. This allows breaking out of the fast path allocation helpers at an arbitrarily selected sampling point that need not align with the standard GC allocation quantum for SOH. With sampling off the fast_helper_alloc_limit would always be identical to alloc_limit, but when sampling is on we would occasionally have them diverge. The key changes are: gcheaputilities.h - defines a new ee_alloc_context which augments the existing gc_alloc_context with the extra fast helper limit field gcheaputilities.h and threads.h - replace the global and per-thread storage of gc_alloc_context with the expanded ee_alloc_context to store the new field. jithelpers.cpp, jitinterfacex86.cpp, and amd64 asm stuff - refactors fast path alloc helpers to use the new limit field instead gchelpers.cpp - Updated Alloc() function recognizes when fast helper limit was exceeded to do some kind of sampling callback and update the fast helper limit when needed. This commit doesn't contain any logic that actually turns the sampling on, determines the limit values needed for sampling, or issues callbacks for sampled objects. That would come later if folks think this is a reasonable approach to intercept allocations.
The current allocation tick event is implemented on the GC side. Why is a better to add this new allocation tick event on the VM side? It seems to be unnecessarily complicated and spread through number of places. |
@@ -458,12 +472,12 @@ void GCToEEInterface::GcEnumAllocContexts(enum_alloc_context_func* fn, void* par | |||
Thread * pThread = NULL; | |||
while ((pThread = ThreadStore::GetThreadList(pThread)) != NULL) | |||
{ | |||
fn(pThread->GetAllocContext(), param); | |||
InvokeGCAllocCallback(pThread->GetEEAllocContext(), fn, param); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why we would want to trigger sampling logic while enumerating allocs. Perhaps I am missing the utility of the GcEnumAllocContexts()
, are we attempting to create some side-effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is enumerating alloc contexts. I assume that this wrapper is here to do catch up of the runtime-owned side of the alloc context with the GC-owned side of the alloc context.
Every time GC updates the GC-owned side of the alloc context (that happens in number of GC/EE interface calls), the runtime side needs to do a catch up to update the runtime-owned side. It is what I meant by "spread through many places".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this isn't triggering a sampling callback directly but we do need to keep the fast alloc limit coherent with the GC's limit in the event that the GC changed the limit field inside this callback.
that happens in number of GC/EE interface calls
So far I only noticed limit updates happening in two calls, GCHeap::Alloc and this enumeration callback. Technically FixAllocationContext() also changes limit, but it appears to only get used when zeroing out an AC in which case I also zeroed out the fast alloc helper limit.
High level I think this is a breakdown of different potential approaches:
This PR is in bucket (2a). The previous allocation tick approaches were in the bucket (1). Previous AllocationTick work accepted whatever allocation quantum sizes it was given which meant the sampled allocations tended to follow regular periodic patterns which lead to large statistical errors. To do randomized sampling via option (1) we need the option to pick a dynamically chosen maximum size for each AC returned by the GC. That constraint needs to interact with GC's own rules for heap layout continuity, bookkeeping stats, and the existing logic for finding and sizing an alloc_context. I expect it is doable but comes with its own alternate complexity costs in the GC allocation code path. For example there are places where currently the GC is free to adjust an alloc_context, but the GC would no longer be free to perform these types of adjustments in the future. The GC approach also makes sampling produce a different heap layout with slightly increased fragmentation vs. dual limits which preserves the heap layout exactly given the same object allocation sequence. Last the GC approach has a disadvantage that emitting a callback from inside the GC doesn't have the object in an initialized state. For ETW events that doesn't really matter but for an ICorProfiler callback I'd prefer to deliver a callback at the point the object's backing memory has been initialized. We can work around it by recording a desire to emit a callback in the future and then delivering it later in the allocation sequence, its just another example of adding another chunk of complexity. Ultimately I suspect any of the approaches (1), (2a), (2b) can be implemented successfully and performantly, its a question of which portion of the code would bear the complexity burden of doing so. [EDIT]: I forgot to mention that using a single limit field doesn't alleviate us from still needing to track and update a 2nd sampling threshold in some other per-thread and global field. The fast alloc helpers can continue refering to the current limit field to do decision making about whether an allocation fits, but the slow allocation path needs to be able to distinguish which allocs represent normal quantum exhaustion and which ones represent sampling points. |
Thanks @noahfalk for starting this PR :^) There are a few topics that are not currently covered and that could impact the change implementation.
From my limited understanding of the current AllocationTick implementation, the threshold detection is based, not on the allocated objects size but on the cummulated size of the allocation contexts (could only contain 1 object that triggered the slow path). Is it expected to change that? If not, it means that some code will need to be added after the Alloc() call that returns the allocation context provided by the GC to do the threshold check/computation. This leads to my main question: how is
My initial idea was a little bit different:
|
@noahfalk already gave a great detailed explanation above - just to add an additional more high level perspective to that - one of the biggest reasons I prefer this done on the EE side is because GC has always given out an allocation context to the EE side for SOH allocations and had no need to be concerned with the sub-allocation within an AC. I'd really like to keep it that way. and because GC currently only gives out periodical info on AC granularity, to do a more detailed accounting it would require attention when doing sub-allocations in an AC. so this is more suited on the EE side that is doing the sub-allocations. |
@jkotas @AaronRobinsonMSFT - Given the info above do you think its reasonable that we are doing this on the EE side? If not would you like to discuss it with @Maoni0 and come to a conclusion? IMO the EE approach seems marginally better, but I'm pretty sure we can find an implementation either way as long as there is agreement on it. @chrisnas - Responding back to your questions...
The current PR anticipates handling this case by inserting some logic here. POH and LOH are the easy cases for sampling because we need only do a comparison between the object size and a pseudo-random number generated on the spot to make a sampling decision.
I'm a bit skeptical different sampling intervals would be needed in the different heaps, but for sake of argument assuming it was needed I believe the current PR would allow for it. The code here could make decisions differently for both POH and UOH. The code here which handles the SOH case is not constrained to match either POH or LOH.
During a GC
No |
I think it is best for the GC diagnostic and tracing smarts to be concentrated in the GC code, like the current allocation tick event is. It provides full control and allows us to do anything we want. Splitting the responsibility over an interface that we want to keep stable as much as possible is likely to be prone to versioning issues as the design evolves. |
Can you and @Maoni0 discuss and reach an agreement? Currently you are each pointing in a different direction.
Fwiw there is no split and nothing is being coordinated across the interface. 100% of the responsibility for doing the sampling using this new mechanism is falling on the EE side (this change doesn't touch a single file in the GC or in the interface). If we do the logic on the GC side that will need to be communicated across the interface and require interface changes. The EE side will need to indicate the mechanism is on and what the sampling rate is, the GC side will need to communicate back which objects were sampled. |
Sure, we can discuss next week.
Yes, it is for now until we find that we want to integrate it with something that is only conveniently available on the GC side. Also, if we stay with this being on the EE side, please make sure that it works in native AOT too. |
I chatted with @jkotas and the conclusion was we'd proceed with @noahfalk's proposal - since this is allocation diagnostics and a very user oriented event and GC plays a very small part currently (the EE side already has to get the type info), it's better to just not have it straddle 2 sides. @jkotas did have some suggestion on the EE's side def of the AC so I'll let him comment on that. |
{ | ||
// fast_alloc_limit_ptr should be in the range [alloc_ptr, alloc_limit] | ||
// if it isn't that means at least one of these things just happened: | ||
// 1) We allocated a new object that advanced alloc_ptr past fast_alloc_helper_limit_ptr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, GC could have allocated a free object as padding. The space occupied by the free object should be excluded in the sampling distribution calculations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. I didn't initially notice that GC is allocating a FreeObject inside the [ptr,limit) region for some allocations but I'm pretty sure we can detect it and handle the sampling math accurately with minimal overhead now that I am aware of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you are on it you can also move the handling of double alignment for x86 to the GC side (the EE side would still be responsible for setting the GC_ALLOC_ALIGN8
flag) to make it simpler and more uniform. There is no reason for the Arm vs. x86 handling to be done differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds fine to me. @Maoni0 fyi.
After closer look, the structure in this PR looks fine. My idea was to separate the EE owned and GC owned parts of the alloc context into separate structs to allow us to version them more independently in future. It was based on the assumption that the alloc context is opaque to the EE except for the ptr and limit fields. Unfortunately, it is not the case. We took direct dependency on other alloc context fields from the EE without going through GC/EE interface abstraction. |
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
…ampling feature. This commit is only to gather feedback on the direction and is not intended to be checked in.
The is a resumption of some of the original work that occured in #85750 although this is a different implementation approach. The overall goal is to do lightweight allocation sampling that can produce unbiased approximations of what was actually allocated. Our current ETW AllocationTick sampling is periodic but not randomized. Extrapolating from the AllocationTick samples can lead to unbounded estimation errors in theory and also substantial estimation error observed in practice.
This commit is primarily to get feedback on adding a new adjustable limit pointer on the coreclr EE side of allocation contexts. This allows breaking out of the fast path allocation helpers at an arbitrarily selected sampling point that need not align with the standard GC allocation quantum for SOH. With sampling off the fast_helper_alloc_limit would always be identical to alloc_limit, but when sampling is on we would occasionally have them diverge.
The key changes are:
gcheaputilities.h - defines a new ee_alloc_context which augments the existing gc_alloc_context with the extra fast helper limit field gcheaputilities.h and threads.h - replace the global and per-thread storage of gc_alloc_context with the expanded ee_alloc_context
to store the new field.
jithelpers.cpp, jitinterfacex86.cpp, and amd64 asm stuff - refactors fast path alloc helpers to use the new limit field instead gchelpers.cpp - Updated Alloc() function recognizes when fast helper limit was exceeded to do some kind of sampling callback
and update the fast helper limit when needed.
This commit doesn't contain any logic that actually turns the sampling on, determines the limit values needed for sampling, or issues callbacks for sampled objects. That would come later if folks think this is a reasonable approach to intercept allocations.
@Maoni0 @davidwr @jkotas @AaronRobinsonMSFT @chrisnas @davmason
cc @dotnet/dotnet-diag