Skip to content

This is an incomplete implementation of a new randomized allocation s… #98167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

noahfalk
Copy link
Member

@noahfalk noahfalk commented Feb 8, 2024

…ampling feature. This commit is only to gather feedback on the direction and is not intended to be checked in.

The is a resumption of some of the original work that occured in #85750 although this is a different implementation approach. The overall goal is to do lightweight allocation sampling that can produce unbiased approximations of what was actually allocated. Our current ETW AllocationTick sampling is periodic but not randomized. Extrapolating from the AllocationTick samples can lead to unbounded estimation errors in theory and also substantial estimation error observed in practice.

This commit is primarily to get feedback on adding a new adjustable limit pointer on the coreclr EE side of allocation contexts. This allows breaking out of the fast path allocation helpers at an arbitrarily selected sampling point that need not align with the standard GC allocation quantum for SOH. With sampling off the fast_helper_alloc_limit would always be identical to alloc_limit, but when sampling is on we would occasionally have them diverge.

The key changes are:
gcheaputilities.h - defines a new ee_alloc_context which augments the existing gc_alloc_context with the extra fast helper limit field gcheaputilities.h and threads.h - replace the global and per-thread storage of gc_alloc_context with the expanded ee_alloc_context
to store the new field.
jithelpers.cpp, jitinterfacex86.cpp, and amd64 asm stuff - refactors fast path alloc helpers to use the new limit field instead gchelpers.cpp - Updated Alloc() function recognizes when fast helper limit was exceeded to do some kind of sampling callback
and update the fast helper limit when needed.

This commit doesn't contain any logic that actually turns the sampling on, determines the limit values needed for sampling, or issues callbacks for sampled objects. That would come later if folks think this is a reasonable approach to intercept allocations.

@Maoni0 @davidwr @jkotas @AaronRobinsonMSFT @chrisnas @davmason
cc @dotnet/dotnet-diag

…ampling feature. This commit is

only to gather feedback on the direction and is not intended to be checked in.

The is a resumption of some of the original work that occured in dotnet#85750
although this is a different implementation approach. The overall goal is to do lightweight allocation
sampling that can produce unbiased approximations of what was actually allocated. Our current ETW AllocationTick
sampling is periodic but not randomized. Extrapolating from the AllocationTick samples can lead to unbounded
estimation errors in theory and also substantial estimation error observed in practice.

This commit is primarily to get feedback on adding a new adjustable limit pointer on
the coreclr EE side of allocation contexts. This allows breaking out of the fast path allocation helpers at an
arbitrarily selected sampling point that need not align with the standard GC allocation
quantum for SOH. With sampling off the fast_helper_alloc_limit would always be
identical to alloc_limit, but when sampling is on we would occasionally have them
diverge.

The key changes are:
gcheaputilities.h - defines a new ee_alloc_context which augments the existing gc_alloc_context with the extra fast helper limit field
gcheaputilities.h and threads.h - replace the global and per-thread storage of gc_alloc_context with the expanded ee_alloc_context
  to store the new field.
jithelpers.cpp, jitinterfacex86.cpp, and amd64 asm stuff - refactors fast path alloc helpers to use the new limit field instead
gchelpers.cpp - Updated Alloc() function recognizes when fast helper limit was exceeded to do some kind of sampling callback
  and update the fast helper limit when needed.

This commit doesn't contain any logic that actually turns the sampling on, determines the limit values needed for sampling,
or issues callbacks for sampled objects. That would come later if folks think this is a reasonable approach to intercept
allocations.
@ghost ghost added the area-VM-coreclr label Feb 8, 2024
@ghost ghost assigned noahfalk Feb 8, 2024
@jkotas
Copy link
Member

jkotas commented Feb 8, 2024

The current allocation tick event is implemented on the GC side. Why is a better to add this new allocation tick event on the VM side? It seems to be unnecessarily complicated and spread through number of places.

@@ -458,12 +472,12 @@ void GCToEEInterface::GcEnumAllocContexts(enum_alloc_context_func* fn, void* par
Thread * pThread = NULL;
while ((pThread = ThreadStore::GetThreadList(pThread)) != NULL)
{
fn(pThread->GetAllocContext(), param);
InvokeGCAllocCallback(pThread->GetEEAllocContext(), fn, param);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why we would want to trigger sampling logic while enumerating allocs. Perhaps I am missing the utility of the GcEnumAllocContexts(), are we attempting to create some side-effect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is enumerating alloc contexts. I assume that this wrapper is here to do catch up of the runtime-owned side of the alloc context with the GC-owned side of the alloc context.

Every time GC updates the GC-owned side of the alloc context (that happens in number of GC/EE interface calls), the runtime side needs to do a catch up to update the runtime-owned side. It is what I meant by "spread through many places".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this isn't triggering a sampling callback directly but we do need to keep the fast alloc limit coherent with the GC's limit in the event that the GC changed the limit field inside this callback.

that happens in number of GC/EE interface calls

So far I only noticed limit updates happening in two calls, GCHeap::Alloc and this enumeration callback. Technically FixAllocationContext() also changes limit, but it appears to only get used when zeroing out an AC in which case I also zeroed out the fast alloc helper limit.

@noahfalk
Copy link
Member Author

noahfalk commented Feb 8, 2024

High level I think this is a breakdown of different potential approaches:

  1. Approaches that use only a single limit field for allocation fit decisions
  2. Approaches that use the gc quantum limit and the sampling limit separately for allocation fit decisions
    a. Isolating the sampling limit on the EE side
    b. Making both the GC and EE aware of the dual limits.

This PR is in bucket (2a). The previous allocation tick approaches were in the bucket (1).

Previous AllocationTick work accepted whatever allocation quantum sizes it was given which meant the sampled allocations tended to follow regular periodic patterns which lead to large statistical errors. To do randomized sampling via option (1) we need the option to pick a dynamically chosen maximum size for each AC returned by the GC. That constraint needs to interact with GC's own rules for heap layout continuity, bookkeeping stats, and the existing logic for finding and sizing an alloc_context. I expect it is doable but comes with its own alternate complexity costs in the GC allocation code path. For example there are places where currently the GC is free to adjust an alloc_context, but the GC would no longer be free to perform these types of adjustments in the future. The GC approach also makes sampling produce a different heap layout with slightly increased fragmentation vs. dual limits which preserves the heap layout exactly given the same object allocation sequence. Last the GC approach has a disadvantage that emitting a callback from inside the GC doesn't have the object in an initialized state. For ETW events that doesn't really matter but for an ICorProfiler callback I'd prefer to deliver a callback at the point the object's backing memory has been initialized. We can work around it by recording a desire to emit a callback in the future and then delivering it later in the allocation sequence, its just another example of adding another chunk of complexity.

Ultimately I suspect any of the approaches (1), (2a), (2b) can be implemented successfully and performantly, its a question of which portion of the code would bear the complexity burden of doing so.

[EDIT]: I forgot to mention that using a single limit field doesn't alleviate us from still needing to track and update a 2nd sampling threshold in some other per-thread and global field. The fast alloc helpers can continue refering to the current limit field to do decision making about whether an allocation fits, but the slow allocation path needs to be able to distinguish which allocs represent normal quantum exhaustion and which ones represent sampling points.

@chrisnas
Copy link
Contributor

chrisnas commented Feb 9, 2024

Thanks @noahfalk for starting this PR :^)

There are a few topics that are not currently covered and that could impact the change implementation.

  • both POH and LOH allocations are not using allocation context; they only update the alloc_bytes_uoh field of the thread allocation context

  • 3 fixed 100 KB thresholds exist for SOH, LOH and POH: maybe the sampling needs to be different for each; especially for LOH where the distribution has a high minimum value close to the 100 KB mean target

  • I don't fully understand the role of fix_allocation_context called at the end of GCs but is seems to "reset" the ptr/limit fields after having created a free object at the end (previous value of ptr). So what is supposed to happen to the new fast_alloc_helper_limit_ptr field?

From my limited understanding of the current AllocationTick implementation, the threshold detection is based, not on the allocated objects size but on the cummulated size of the allocation contexts (could only contain 1 object that triggered the slow path). Is it expected to change that? If not, it means that some code will need to be added after the Alloc() call that returns the allocation context provided by the GC to do the threshold check/computation.

This leads to my main question: how is fast_alloc_helper_limit_ptr computed? (I'll discuss only the SOH case)
If we go back to our goal (get a "good" statistical sampling distribution), it requires being able to trigger AllocationTick within an allocation context without slowing down too much the fast path. This is avoiding missing the small allocations (below the allocation tick size ~8KB).
If it is decided to compute a new threshold (per thread because per allocation context) each time the previous one is reached, 2 situations are possibles:

  • the new one is inside the allocation context size: the fast_alloc_helper_limit_ptr field should work
  • the new one is beyond the allocation context size (should be most of the time if the mean is 100 KB): it is complicated to use a pointer that is outside of the allocation context. Each time a new allocation context is needed (slow path), fast_alloc_helper_limit_ptr -= new allocation context size. Did I follow your reasoning?

My initial idea was a little bit different:

  • a new field keeps track of the remaining size before the next threshold
  • a new field keeps track of the real end of the allocation context = current implementation of limit (so don't change the GC code)
  • do not touch the fast path assembly code: the role of the limit field stays the same (limit - ptr < size to allocate --> updated slow path)
  • in the updated slow path, check if (real limit - ptr < size to allocation --> current slow path). If not, still some room left so update limit and return (but the allocation tick is emitted). That should avoid triggering the slow slow path too often.

@Maoni0
Copy link
Member

Maoni0 commented Feb 10, 2024

@noahfalk already gave a great detailed explanation above - just to add an additional more high level perspective to that - one of the biggest reasons I prefer this done on the EE side is because GC has always given out an allocation context to the EE side for SOH allocations and had no need to be concerned with the sub-allocation within an AC. I'd really like to keep it that way. and because GC currently only gives out periodical info on AC granularity, to do a more detailed accounting it would require attention when doing sub-allocations in an AC. so this is more suited on the EE side that is doing the sub-allocations.

@noahfalk
Copy link
Member Author

@jkotas @AaronRobinsonMSFT - Given the info above do you think its reasonable that we are doing this on the EE side? If not would you like to discuss it with @Maoni0 and come to a conclusion? IMO the EE approach seems marginally better, but I'm pretty sure we can find an implementation either way as long as there is agreement on it.

@chrisnas - Responding back to your questions...

both POH and LOH allocations are not using allocation context; they only update the alloc_bytes_uoh field of the thread allocation context

The current PR anticipates handling this case by inserting some logic here. POH and LOH are the easy cases for sampling because we need only do a comparison between the object size and a pseudo-random number generated on the spot to make a sampling decision.

3 fixed 100 KB thresholds exist for SOH, LOH and POH: maybe the sampling needs to be different for each; especially for LOH where the distribution has a high minimum value close to the 100 KB mean target

I'm a bit skeptical different sampling intervals would be needed in the different heaps, but for sake of argument assuming it was needed I believe the current PR would allow for it. The code here could make decisions differently for both POH and UOH. The code here which handles the SOH case is not constrained to match either POH or LOH.

I don't fully understand the role of fix_allocation_context called at the end of GCs but is seems to "reset" the ptr/limit fields after having created a free object at the end (previous value of ptr). So what is supposed to happen to the new fast_alloc_helper_limit_ptr field?

During a GC fix_allocation_context gets invoked via a callback passed to GCEnumAllocContexts here. fix_alloc_context calls GCHeap::FixAllocContext which calls fix_allocation_context. The PR anticipates GCEnumAllocContexts callbacks might change the limit pointer and recomputes the fast_alloc_helper_limit_ptr here

it is complicated to use a pointer that is outside of the allocation context. Each time a new allocation context is needed (slow path), fast_alloc_helper_limit_ptr -= new allocation context size. Did I follow your reasoning?

No fast_alloc_helper_limit_ptr is not intended be outside the range of [alloc_ptr,alloc_limit]. Conveniently the math allows us to make sampling decisions on each AC independently and pseudo-random number generation is cheap enough that I expect we'd do it once per AC without any need for additional bookkeeping. For sake of argument if additional bookkeeping was required I wasn't suggesting to do it in the fast_alloc_helper_limit_ptr field. This PR was just focused on mechanism that lets us intercept allocations at arbitrary places.

@jkotas
Copy link
Member

jkotas commented Feb 10, 2024

I think it is best for the GC diagnostic and tracing smarts to be concentrated in the GC code, like the current allocation tick event is. It provides full control and allows us to do anything we want. Splitting the responsibility over an interface that we want to keep stable as much as possible is likely to be prone to versioning issues as the design evolves.

@noahfalk
Copy link
Member Author

noahfalk commented Feb 10, 2024

I think it is best for the GC diagnostic and tracing smarts to be concentrated in the GC code

Can you and @Maoni0 discuss and reach an agreement? Currently you are each pointing in a different direction.

Splitting the responsibility over an interface

Fwiw there is no split and nothing is being coordinated across the interface. 100% of the responsibility for doing the sampling using this new mechanism is falling on the EE side (this change doesn't touch a single file in the GC or in the interface). If we do the logic on the GC side that will need to be communicated across the interface and require interface changes. The EE side will need to indicate the mechanism is on and what the sampling rate is, the GC side will need to communicate back which objects were sampled.

@jkotas
Copy link
Member

jkotas commented Feb 10, 2024

Can you and @Maoni0 discuss and reach an agreement?

Sure, we can discuss next week.

the sampling using this new mechanism is falling on the EE side (this change doesn't touch a single file in the GC or in the interface

Yes, it is for now until we find that we want to integrate it with something that is only conveniently available on the GC side.

Also, if we stay with this being on the EE side, please make sure that it works in native AOT too.

@davidwr
Copy link

davidwr commented Feb 11, 2024

@davidwrighton

@Maoni0
Copy link
Member

Maoni0 commented Feb 14, 2024

I chatted with @jkotas and the conclusion was we'd proceed with @noahfalk's proposal - since this is allocation diagnostics and a very user oriented event and GC plays a very small part currently (the EE side already has to get the type info), it's better to just not have it straddle 2 sides. @jkotas did have some suggestion on the EE's side def of the AC so I'll let him comment on that.

{
// fast_alloc_limit_ptr should be in the range [alloc_ptr, alloc_limit]
// if it isn't that means at least one of these things just happened:
// 1) We allocated a new object that advanced alloc_ptr past fast_alloc_helper_limit_ptr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, GC could have allocated a free object as padding. The space occupied by the free object should be excluded in the sampling distribution calculations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. I didn't initially notice that GC is allocating a FreeObject inside the [ptr,limit) region for some allocations but I'm pretty sure we can detect it and handle the sampling math accurately with minimal overhead now that I am aware of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you are on it you can also move the handling of double alignment for x86 to the GC side (the EE side would still be responsible for setting the GC_ALLOC_ALIGN8 flag) to make it simpler and more uniform. There is no reason for the Arm vs. x86 handling to be done differently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fine to me. @Maoni0 fyi.

@jkotas
Copy link
Member

jkotas commented Feb 14, 2024

@jkotas did have some suggestion on the EE's side def of the AC so I'll let him comment on that.

After closer look, the structure in this PR looks fine.

My idea was to separate the EE owned and GC owned parts of the alloc context into separate structs to allow us to version them more independently in future. It was based on the assumption that the alloc context is opaque to the EE except for the ptr and limit fields. Unfortunately, it is not the case. We took direct dependency on other alloc context fields from the EE without going through GC/EE interface abstraction.

Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants