Skip to content

openfl.display.Loader is significantly slower on native targets in Lime 8.2 (compared to Lime 8.1.3) #1895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
joshtynjala opened this issue Jan 22, 2025 · 16 comments

Comments

@joshtynjala
Copy link
Member

The following code loads a lot of images simultaneously:

var images = ["image1.png", "image2.png", "image3.png"];

var xPos = 0;
var yPos = 0;
for (i in 0...40) {
	var loader = new Loader();
	loader.contentLoaderInfo.addEventListener(Event.COMPLETE, event -> {
		var loader = cast(event.currentTarget, LoaderInfo).loader;
		loader.width = 100;
		loader.height = 100;
	});
	loader.load(new URLRequest(images[i % images.length]));
	loader.x = xPos;
	loader.y = yPos;
	xPos += 100;
	if (xPos > stage.stageWidth) {
		xPos = 0;
		yPos += 100;
	}
	addChild(loader);
}

When I run it on a native target (neko, hl, cpp) with Lime 8.1.3, they all render basically instantaneously when the app starts up. With Lime 8.2, they seem to load one by one much more slowly.

I tried PR #1837, and it might help just a tiny bit. However, it is still clearly slower than Lime 8.1.3. Interestingly, in Lime 8.2, the images start rendering from last-to-first. With the PR, they render first-to-last.

@joshtynjala
Copy link
Member Author

I found that I can make two changes to help.

In NativeHTTPRequest, I increased the max number of threads available to the pools:

localThreadPool = new ThreadPool(0, 50);
multiThreadPool = new ThreadPool(0, 50);

Additionally, I increased the max number of threads on FutureWork:

lime.app.Future.FutureWork.maxThreads = 50;

Increasing the maxThreads for both ThreadPool and FutureWork appear to be required to achieve the same instantaneous loads as Lime 8.1.3. Only one or the other is not enough.

For comparison, Adobe AIR appears to render the images basically immediately, like Lime 8.1.3.

I also forgot to mention that this slower loading behavior was first noticed by a user in one of their apps, and I'm reporting on their behalf.

@barisyild
Copy link
Contributor

@joshtynjala @player-03 std::thread::hardware_concurrency can be good for maxThreads default value.

https://en.cppreference.com/w/cpp/thread/thread/hardware_concurrency

@player-03
Copy link
Contributor

I'd guess this is related to the switch away from Deque, as discussed on Discord a few months back. To summarize:

  • The old version pushed jobs into a Deque object, which threads would read from when they became idle.
  • Web workers don't share memory, so they have no equivalent to Deque. Instead, I rewrote it so that the main thread handled all scheduling. Threads would phone home once finished, and when the main thread received that message it would send out new jobs (and it only checks for messages once a frame).
  • The biggest roadblock to restoring Deque is the cancelJob() function (also added in 8.2.0). Canceling a specific job requires knowing which thread (if any) is running that job, and Deque just doesn't allow that (AFAICT). Nor does Deque allow removing items from the middle.
  • A smaller roadblock is the added complexity. Two scheduling implementations are more than enough, and a third would make it that much harder to understand. I'd rather look at other options first.

localThreadPool = new ThreadPool(0, 50);

I'm almost certain that's more than the number of cores your machine has. Did 50 really work better than 8, 16, or however many cores you have?

If so, that suggests something interesting: your machine is really good at swapping between threads. It can't run them all at once, but it knows they exist, and the moment one returns, it can swap in another. This is sort of similar to the benefits of Deque, in that there's no need to wait for the main thread.

But I wouldn't recommend it, for two reasons. One, last I tested this, my machine is not that good at swapping, and works fastest when the number of threads equals the number of cores. Two, I don't think we have a way to mark the main thread as higher priority than the background ones. If there are lots of active jobs, they might block UI updates, which is the exact opposite of the point. (Or maybe Haxe already marks sys.thread.Threads as low priority. I honestly don't know.)

Interestingly, in Lime 8.2, the images start rendering from last-to-first. With the PR, they render first-to-last.

Looks like my old implementation used pop() to get the next job, and the new one uses shift(). I don't recall a specific reason for using pop(), so I assume it was out of habit.

Increasing the maxThreads for both ThreadPool and FutureWork appear to be required

I haven't found the part where it calls new Future(), so I can't comment on this.

@player-03
Copy link
Contributor

With all that out of the way, I have a rough idea for a new approach.

If we pass multiple jobs to each thread, we can avoid the delay for the back-and-forth message passing after each one. When a thread finishes a job, it already has the next one and can start at once. We don't do this normally because we don't know which jobs will finish first, or which ones will take huge amounts of time. Wouldn't want a 1 millisecond job to get stuck behind a 90 second job.

In the example above, the user is loading all 40 images from disk and knows they're all going to be done in milliseconds. Same with Baris's example from Discord (maybe microseconds in that case). Either way, the user knows that the jobs are supposed to be quick, and they could pass this information along, probably via a third argument to ThreadPool.run(). Mark a job as short, and ThreadPool will pre-schedule the one after it. Mark 40 jobs as short, and ThreadPool will pre-schedule them all, splitting them up among maxThreads threads.

As a bonus, this approach ought to work in HTML5, with all the same benefits.

@dimensionscape
Copy link
Member

I'm almost certain that's more than the number of cores your machine has. Did 50 really work better than 8, 16, or however many cores you have?

Its generally rare for situations in which every core is to be used 100% at the same time. On modern cpu's, context switching between threads is quite trivially fast (I assure you that your cpu is doing this constantly). Using more threads than you have logical cores is common since each one is typically not always busy. Of course, this availability is going to scale up with your hardware capability, but the point is that it remains viable to use more threads than you have cores.

Looks like my old implementation used pop() to get the next job, and the new one uses shift(). I don't recall a specific reason for using pop(), so I assume it was out of habit.

We should always avoid shifting from a indexed collection(like the array here) when possible. Removing an element from the end of a data structure doesn't require re-indexing, which is O(n) and carries a lot of overhead, especially with large data structures. The same is true for adding to the beginning or middle of a data set. Double ended Queue's(Deque) avoid this providing O(1) time complexity for both ends. We can create a similar data structure with such efficiency that works on html5 by using Doubly Linked Lists.

@player-03
Copy link
Contributor

player-03 commented Feb 7, 2025

/* Note for anyone following along: arrays outperform linked lists in most cases, because cache misses are way more important than theoretical time complexity. Always use arrays by default, and only switch to linked lists if you're certain. Chris brought this up because shift() specifically is a worst-case scenario for arrays and a best-case scenario for linked lists, but even then, check on actual hardware. */

When talking about ThreadPool, we'd expect n to be small. How many people actually queue thousands of jobs at once?

I ran some quick-and-dirty tests, and at 1000 elements it took 0.0014s to shift them all. That's over 10x slower than a list, but an end user wouldn't notice it, especially not compared everything else involved in processing 1000 jobs. Obviously it got WAY worse as I added zeroes, scaling at O(n²). (Except in JS, which presumably did some optimization under the hood.)

I also tried optimizing the array. Instead of shifting, I set the current index to null and incremented an integer. Once the array was about 2/3 empty, I moved the remaining values to the start. Results: about 2x slower than a list in Neko, 1.5x slower in HL, about equal in JS, ~20% faster in Eval, and over 10x faster in C++. I tested up to a million values, and it appeared to scale at O(n), just like the list.

Since realistically n<100, and since a list would slow down other ThreadPool functions that iterate over it, I think we should stick with arrays. shift() ought to be fine as-is, but I'm also happy to apply the optimization I described, or even go back to pop() if we don't think order matters.

@joshtynjala
Copy link
Member Author

I'm almost certain that's more than the number of cores your machine has. Did 50 really work better than 8, 16, or however many cores you have?

I don't recall. I may have simply chosen a large number. I seem to have 8 cores. I just tried 8, 16, and 32 to see what happens. 8 was still kind of slow. 16 and 32 were both nearly instantaneous, but 50 still clearly a bit faster because all of the images have rendered immediately on startup.

In the example above, the user is loading all 40 images from disk and knows they're all going to be done in milliseconds.
Either way, the user knows that the jobs are supposed to be quick, and they could pass this information along, probably via a third argument to ThreadPool.run().

I don't want to have to expose a new public property on openfl.display.Loader to let the user to configure this behavior. ThreadPool should be an implementation detail of Loader that the user knows nothing about.

I'd go further and say that ThreadPool should also be considered an implementation detail of Lime's NativeHTTPRequest, and that Loader shouldn't need to know anything about it either. NativeHTTPRequest uses ThreadPool and Promise/Future, so if this behavior is to be configurable, then NativeHTTPRequest needs to gain some smarts to determine how to configure the ThreadPool, if that's going to be our solution. I guess it could determine whether the URL is actually a local file vs whether it's coming from a network location, since local files should be significantly faster. Of course, "local" files could be coming from a network drive, which might be harder to detect. Maybe that's okay as an edge case where things are a bit slower.

This all feels kind of messy. I hate to say it, but maybe that new cancelJob() API needs to be reverted so that we can go back to the faster Deque. It's not a good tradeoff, in my opinion.

@dimensionscape
Copy link
Member

dimensionscape commented Feb 7, 2025

/* Note for anyone following along: arrays outperform linked lists in most cases, because cache misses are way more important than theoretical time complexity. Always use arrays by default, and only switch to linked lists if you're certain. Chris brought this up because shift() specifically is a worst-case scenario for arrays and a best-case scenario for linked lists, but even then, check on actual hardware. */

When talking about ThreadPool, we'd expect n to be small. How many people actually queue thousands of jobs at once?

I ran some quick-and-dirty tests, and at 1000 elements it took 0.0014s to shift them all. That's over 10x slower than a list, but an end user wouldn't notice it, especially not compared everything else involved in processing 1000 jobs. Obviously it got WAY worse as I added zeroes, scaling at O(n²). (Except in JS, which presumably did some optimization under the hood.)

I also tried optimizing the array. Instead of shifting, I set the current index to null and incremented an integer. Once the array was about 2/3 empty, I moved the remaining values to the start. Results: about 2x slower than a list in Neko, 1.5x slower in HL, about equal in JS, ~20% faster in Eval, and over 10x faster in C++. I tested up to a million values, and it appeared to scale at O(n), just like the list.

Since realistically n<100, and since a list would slow down other ThreadPool functions that iterate over it, I think we should stick with arrays. shift() ought to be fine as-is, but I'm also happy to apply the optimization I described, or even go back to pop() if we don't think order matters.

I didn't say Linked Lists are inherently faster than an Array, I just said shifting should be avoided if possible because of the cost of re-indexing, which is not just theoretically O(n), it is actually O(n). When you insert an element, whether it's in the middle or the beginning, every element needs to be re-indexed after the point of insertion, with the beginning being the worst.

For a queue like this, there shouldn't be any cache locality concerns. I was not aware we are iterating over the job queue in any way, considering we previously used a Deque. We originally used a Deque for a reason—it provided thread-safe push/pop operations without a global lock. Switching solely to an array-based queue has unintentionally removed that safety(although to be fair hxcpp's Deque implementation is based on a Dynamic Haxe array with Lock mechanisms). Thread pool queues handle relatively few elements at a time (n is usually in the hundreds or thousands, not millions). Each job (in a linked list implementation) is just a reference (pointer), so the actual job execution happens elsewhere, and the queue itself doesn’t need tight CPU cache optimizations(And we can provide fine-grained locks, when considering Sys). Job queues are accessed intermittently, meaning the cost of pointer chasing (in a linked list) is less significant compared to a tight loop processing contiguous data (like in graphics rendering).

With that said, it was just an observation and suggestion. In any case, I'm not trying to debate, just offering potential solutions. I would like to say that performance enhancements shouldn’t only be considered based on whether the user "notices" something or not. Every unnecessary CPU cycle increases power consumption, draining batteries faster in a mobile-first world. At scale, this affects millions of devices, increasing energy waste and environmental impact.

On another note: After reviewing the ThreadPool code, I've also identified another critical issue. Getting rid of the Deque for jobs means adding and removing are no longer thread-safe mutations. Given this and other concerns, I strongly recommend we reconsider bringing back a Deque for Sys targets or another thread-safe queue implementation to restore proper concurrency guarantees, however, I don't think just swapping out a Deque here is going to solve any performance related issues. To address those, some architectural changes need to be made.

@player-03
Copy link
Contributor

50 still clearly a bit faster because all of the images have rendered immediately on startup.

Given Chris's explanation and the fact that there are only 40 images, this makes sense. Spin up 40 threads, and the CPU can handle swapping.

(Also, it turns out making multiple threads is a lot faster than I thought, and I've been measuring it wrong. Nice!)

I guess it could determine whether the URL is actually a local file vs whether it's coming from a network location, since local files should be significantly faster.

That had been my idea too. I don't have a good solution for the network drive problem, though if they all take a similar amount of time to load from there, that's also not bad. Mainly, we'd be trying to avoid letting a fast job get stuck behind a slow one while the rest of the threads sit idle.

I hate to say it, but maybe that new cancelJob() API needs to be reverted so that we can go back to the faster Deque. It's not a good tradeoff, in my opinion.

I'm starting to see it that way.

My advice has always been not to use threads in cases like this; the overhead is greater than the time saved. But Loader is a realistic example where that isn't feasible. The user doesn't control what Loader does under the hood, and it's absolutely not safe for Loader to ever do any loading on the main thread. It never knows which file will turn out to be on a network drive, taking a second or more to load.

I previously suggested adding conditional compilation for cancelJob(), to avoid pulling the rug out from anyone who started using it. Plus, it still works fine for HTML5 and green threads, so it makes sense to keep it for those. I'm not worried about the complexity there: the real complexity will come from using Deque on native, manual scheduling on HTML5, and green threads on both.

Since you brought up removing cancelJob(), I realized that maybe we could relax our requirements a bit. It isn't reasonable to remove a specific item from a Deque, but would it really be that bad if we just kind of...waited for the job to start before canceling it? The function still technically does what it claims to do, just on a delay, and we'd include a note about it. Meanwhile we aren't breaking anyone's code.

@player-03
Copy link
Contributor

I didn't say Linked Lists are inherently faster than an Array

Yeah, I probably needed to tone down that first part, sorry. I was worried you were reinforcing a common misconception, not that you personally believed it.

Switching solely to an array-based queue has unintentionally removed that safety

After reviewing the ThreadPool code, I've also identified another critical issue. Getting rid of the Deque for jobs means adding and removing are no longer thread-safe mutations.

Only the main thread ever accesses these arrays, specifically for thread safety reasons. (And also for web worker compatibility, but web workers were designed to enforce thread safety, so it's the same thing.)

Every unnecessary CPU cycle increases power consumption, draining batteries faster in a mobile-first world. At scale, this affects millions of devices, increasing energy waste and environmental impact.

You may not be trying to debate, but you're still scoring points. 👍 I'll keep this one in mind.

@dimensionscape
Copy link
Member

dimensionscape commented Feb 7, 2025

Oooh Ok, I see the check now:

if (!isMainThread())
but isn't this kind of an arbitrary limitation? If I wanted to add a job from another thread, I would need to create an additional queue mechanism on top of this. On the web, we wouldn't even be able to access the ThreadPool in either case so it's kind of moot(there are just some irreconcilable differences here that will never be resolved between native threads and workers).

I don't know if we necessarily have to scale back on any intended functionality of the new ThreadPool here. I think there is a way to have our cake and eat it too, given the right compromise on the implementation specifics.

Edit: I've completely overshadowed the obvious point that we can't use a Deque-like mechanism on web, and it doesn't matter what the data structure is. Generally I think we just need to have 2 separate underlying ThreadPool implementations with a single unified api and it can be done. The implementation itself does not necessarily have to be unified.

@joshtynjala
Copy link
Member Author

would it really be that bad if we just kind of...waited for the job to start before canceling it?

Seems like it could work. 👍

@dimensionscape
Copy link
Member

would it really be that bad if we just kind of...waited for the job to start before canceling it?

Passively canceling this way might even be preferable.

@player-03
Copy link
Contributor

player-03 commented Feb 7, 2025

If I wanted to add a job from another thread, I would need to create an additional queue mechanism on top of this.

Users can use haxe.MainLoop.runInMainThread(), so I'm not too worried.

but isn't this kind of an arbitrary limitation?

A little bit, but I always figured it wasn't my responsibility. The old version didn't support it either (at most, it pretended to). I mean, look at how much thread-unsafe code there is here:

public function queue(state:Dynamic = null):Void
{
	#if (cpp || neko || webassembly)
	// TODO: Better way to handle this?

	if (Application.current != null && Application.current.window != null && !__synchronous)
	{
		__workIncoming.add(new ThreadPoolMessage(WORK, state));
		__workQueued++; //Race condition

		if (currentThreads < maxThreads && currentThreads < (__workQueued - __workCompleted)) //Race conditions
		{
			currentThreads++; //Race condition
			Thread.create(__doWork);
		}

		if (!Application.current.onUpdate.has(__update))
		{
			Application.current.onUpdate.add(__update); //Who even knows what will happen?
		}
	}
	else
	{
		__synchronous = true;
		runWork(state); //Not the intended use case, but maybe fine if the user knows what they're doing?
	}
	#else
	runWork(state);
	#end
}

On the web, we wouldn't even be able to access the ThreadPool in either case so it's kind of moot(there are just some irreconcilable differences here that will never be resolved between native threads and workers).

Looks like web workers can create web workers, so we might be able to make a decent workaround. We have WorkOutput representing a subset of the ThreadPool API, and we could add a carefully-constrained "spawn thread from thread" function to that.

@dimensionscape
Copy link
Member

dimensionscape commented Feb 7, 2025

but isn't this kind of an arbitrary limitation?

A little bit, but I always figured it wasn't my responsibility. The old version didn't support it either (at most, it pretended to). I mean, look at how much thread-unsafe code there is here:

public function queue(state:Dynamic = null):Void
{
#if (cpp || neko || webassembly)
// TODO: Better way to handle this?

if (Application.current != null && Application.current.window != null && !__synchronous)
{
__workIncoming.add(new ThreadPoolMessage(WORK, state));
__workQueued++; //Race condition

  if (currentThreads < maxThreads && currentThreads < (__workQueued - __workCompleted)) //Race conditions
  {
  	currentThreads++; //Race condition
  	Thread.create(__doWork);
  }

  if (!Application.current.onUpdate.has(__update))
  {
  	Application.current.onUpdate.add(__update); //Who even knows what will happen?
  }

}
else
{
__synchronous = true;
runWork(state); //Not the intended use case, but maybe fine if the user knows what they're doing?
}
#else
runWork(state);
#end
}

The only real issue I see here is the increment which can be easily solved with https://api.haxe.org/haxe/atomic/AtomicInt.html

Users can use haxe.MainLoop.runInMainThread(), so I'm not too worried.

I guess that's sort of fair, although it still enforces a single consumer model. I'm not a fan of how it is forced to be coupled with the main thread, although that has always been the case, even previously in regard to the render update and that's not your fault (or problem).

I have some ideas on improving things(the aspect of tight coupling) a bit but I'll need to give it some more thought.

@player-03
Copy link
Contributor

The only real issue I see here is the increment which can be easily solved with https://api.haxe.org/haxe/atomic/AtomicInt.html

I would much rather wrap the whole thing in a mutex lock. Even with AtomicInt, if two queue() calls happened at the same time, they could go over maxThreads.

  1. Both check currentThreads < maxThreads, and currently currentThreads == maxThreads - 1, so both proceed.
  2. Both go to increment currentThreads.
  3. Thanks to AtomicInt, one of them increments at a time, but they both increment eventually.
  4. Both go on to spawn a thread.
  5. currentThreads is now greater than maxThreads, but at least it correctly reports the number of threads, so future threads will wait.

Plus, Event.add() does a bunch of array searching, so that's certainly not safe.

Anyway, I'm fully on board that we can solve the concurrency issues and support threads-making-threads, my point is just that 8.1.3 didn't really support that. I took that to mean it wasn't important to support, but I don't mind being corrected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants