-
-
Notifications
You must be signed in to change notification settings - Fork 393
openfl.display.Loader is significantly slower on native targets in Lime 8.2 (compared to Lime 8.1.3) #1895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I found that I can make two changes to help. In localThreadPool = new ThreadPool(0, 50); multiThreadPool = new ThreadPool(0, 50); Additionally, I increased the max number of threads on lime.app.Future.FutureWork.maxThreads = 50; Increasing the For comparison, Adobe AIR appears to render the images basically immediately, like Lime 8.1.3. I also forgot to mention that this slower loading behavior was first noticed by a user in one of their apps, and I'm reporting on their behalf. |
@joshtynjala @player-03 std::thread::hardware_concurrency can be good for maxThreads default value. https://en.cppreference.com/w/cpp/thread/thread/hardware_concurrency |
I'd guess this is related to the switch away from
I'm almost certain that's more than the number of cores your machine has. Did 50 really work better than 8, 16, or however many cores you have? If so, that suggests something interesting: your machine is really good at swapping between threads. It can't run them all at once, but it knows they exist, and the moment one returns, it can swap in another. This is sort of similar to the benefits of But I wouldn't recommend it, for two reasons. One, last I tested this, my machine is not that good at swapping, and works fastest when the number of threads equals the number of cores. Two, I don't think we have a way to mark the main thread as higher priority than the background ones. If there are lots of active jobs, they might block UI updates, which is the exact opposite of the point. (Or maybe Haxe already marks
Looks like my old implementation used
I haven't found the part where it calls |
With all that out of the way, I have a rough idea for a new approach. If we pass multiple jobs to each thread, we can avoid the delay for the back-and-forth message passing after each one. When a thread finishes a job, it already has the next one and can start at once. We don't do this normally because we don't know which jobs will finish first, or which ones will take huge amounts of time. Wouldn't want a 1 millisecond job to get stuck behind a 90 second job. In the example above, the user is loading all 40 images from disk and knows they're all going to be done in milliseconds. Same with Baris's example from Discord (maybe microseconds in that case). Either way, the user knows that the jobs are supposed to be quick, and they could pass this information along, probably via a third argument to As a bonus, this approach ought to work in HTML5, with all the same benefits. |
Its generally rare for situations in which every core is to be used 100% at the same time. On modern cpu's, context switching between threads is quite trivially fast (I assure you that your cpu is doing this constantly). Using more threads than you have logical cores is common since each one is typically not always busy. Of course, this availability is going to scale up with your hardware capability, but the point is that it remains viable to use more threads than you have cores.
We should always avoid shifting from a indexed collection(like the array here) when possible. Removing an element from the end of a data structure doesn't require re-indexing, which is O(n) and carries a lot of overhead, especially with large data structures. The same is true for adding to the beginning or middle of a data set. Double ended Queue's(Deque) avoid this providing O(1) time complexity for both ends. We can create a similar data structure with such efficiency that works on html5 by using Doubly Linked Lists. |
/* Note for anyone following along: arrays outperform linked lists in most cases, because cache misses are way more important than theoretical time complexity. Always use arrays by default, and only switch to linked lists if you're certain. Chris brought this up because When talking about I ran some quick-and-dirty tests, and at 1000 elements it took 0.0014s to shift them all. That's over 10x slower than a list, but an end user wouldn't notice it, especially not compared everything else involved in processing 1000 jobs. Obviously it got WAY worse as I added zeroes, scaling at O(n²). (Except in JS, which presumably did some optimization under the hood.) I also tried optimizing the array. Instead of shifting, I set the current index to null and incremented an integer. Once the array was about 2/3 empty, I moved the remaining values to the start. Results: about 2x slower than a list in Neko, 1.5x slower in HL, about equal in JS, ~20% faster in Eval, and over 10x faster in C++. I tested up to a million values, and it appeared to scale at O(n), just like the list. Since realistically n<100, and since a list would slow down other |
I don't recall. I may have simply chosen a large number. I seem to have 8 cores. I just tried 8, 16, and 32 to see what happens. 8 was still kind of slow. 16 and 32 were both nearly instantaneous, but 50 still clearly a bit faster because all of the images have rendered immediately on startup.
I don't want to have to expose a new public property on I'd go further and say that This all feels kind of messy. I hate to say it, but maybe that new |
I didn't say Linked Lists are inherently faster than an Array, I just said shifting should be avoided if possible because of the cost of re-indexing, which is not just theoretically O(n), it is actually O(n). When you insert an element, whether it's in the middle or the beginning, every element needs to be re-indexed after the point of insertion, with the beginning being the worst. For a queue like this, there shouldn't be any cache locality concerns. I was not aware we are iterating over the job queue in any way, considering we previously used a Deque. We originally used a Deque for a reason—it provided thread-safe push/pop operations without a global lock. Switching solely to an array-based queue has unintentionally removed that safety(although to be fair hxcpp's Deque implementation is based on a Dynamic Haxe array with Lock mechanisms). Thread pool queues handle relatively few elements at a time (n is usually in the hundreds or thousands, not millions). Each job (in a linked list implementation) is just a reference (pointer), so the actual job execution happens elsewhere, and the queue itself doesn’t need tight CPU cache optimizations(And we can provide fine-grained locks, when considering Sys). Job queues are accessed intermittently, meaning the cost of pointer chasing (in a linked list) is less significant compared to a tight loop processing contiguous data (like in graphics rendering). With that said, it was just an observation and suggestion. In any case, I'm not trying to debate, just offering potential solutions. I would like to say that performance enhancements shouldn’t only be considered based on whether the user "notices" something or not. Every unnecessary CPU cycle increases power consumption, draining batteries faster in a mobile-first world. At scale, this affects millions of devices, increasing energy waste and environmental impact. On another note: After reviewing the ThreadPool code, I've also identified another critical issue. Getting rid of the Deque for jobs means adding and removing are no longer thread-safe mutations. Given this and other concerns, I strongly recommend we reconsider bringing back a Deque for Sys targets or another thread-safe queue implementation to restore proper concurrency guarantees, however, I don't think just swapping out a Deque here is going to solve any performance related issues. To address those, some architectural changes need to be made. |
Given Chris's explanation and the fact that there are only 40 images, this makes sense. Spin up 40 threads, and the CPU can handle swapping. (Also, it turns out making multiple threads is a lot faster than I thought, and I've been measuring it wrong. Nice!)
That had been my idea too. I don't have a good solution for the network drive problem, though if they all take a similar amount of time to load from there, that's also not bad. Mainly, we'd be trying to avoid letting a fast job get stuck behind a slow one while the rest of the threads sit idle.
I'm starting to see it that way. My advice has always been not to use threads in cases like this; the overhead is greater than the time saved. But I previously suggested adding conditional compilation for Since you brought up removing |
Yeah, I probably needed to tone down that first part, sorry. I was worried you were reinforcing a common misconception, not that you personally believed it.
Only the main thread ever accesses these arrays, specifically for thread safety reasons. (And also for web worker compatibility, but web workers were designed to enforce thread safety, so it's the same thing.)
You may not be trying to debate, but you're still scoring points. 👍 I'll keep this one in mind. |
Oooh Ok, I see the check now: lime/src/lime/system/ThreadPool.hx Line 341 in 69bbcae
I don't know if we necessarily have to scale back on any intended functionality of the new ThreadPool here. I think there is a way to have our cake and eat it too, given the right compromise on the implementation specifics. Edit: I've completely overshadowed the obvious point that we can't use a Deque-like mechanism on web, and it doesn't matter what the data structure is. Generally I think we just need to have 2 separate underlying ThreadPool implementations with a single unified api and it can be done. The implementation itself does not necessarily have to be unified. |
Seems like it could work. 👍 |
Passively canceling this way might even be preferable. |
Users can use
A little bit, but I always figured it wasn't my responsibility. The old version didn't support it either (at most, it pretended to). I mean, look at how much thread-unsafe code there is here: public function queue(state:Dynamic = null):Void
{
#if (cpp || neko || webassembly)
// TODO: Better way to handle this?
if (Application.current != null && Application.current.window != null && !__synchronous)
{
__workIncoming.add(new ThreadPoolMessage(WORK, state));
__workQueued++; //Race condition
if (currentThreads < maxThreads && currentThreads < (__workQueued - __workCompleted)) //Race conditions
{
currentThreads++; //Race condition
Thread.create(__doWork);
}
if (!Application.current.onUpdate.has(__update))
{
Application.current.onUpdate.add(__update); //Who even knows what will happen?
}
}
else
{
__synchronous = true;
runWork(state); //Not the intended use case, but maybe fine if the user knows what they're doing?
}
#else
runWork(state);
#end
}
Looks like web workers can create web workers, so we might be able to make a decent workaround. We have |
The only real issue I see here is the increment which can be easily solved with https://api.haxe.org/haxe/atomic/AtomicInt.html
I guess that's sort of fair, although it still enforces a single consumer model. I'm not a fan of how it is forced to be coupled with the main thread, although that has always been the case, even previously in regard to the render update and that's not your fault (or problem). I have some ideas on improving things(the aspect of tight coupling) a bit but I'll need to give it some more thought. |
I would much rather wrap the whole thing in a mutex lock. Even with
Plus, Anyway, I'm fully on board that we can solve the concurrency issues and support threads-making-threads, my point is just that 8.1.3 didn't really support that. I took that to mean it wasn't important to support, but I don't mind being corrected. |
The following code loads a lot of images simultaneously:
When I run it on a native target (neko, hl, cpp) with Lime 8.1.3, they all render basically instantaneously when the app starts up. With Lime 8.2, they seem to load one by one much more slowly.
I tried PR #1837, and it might help just a tiny bit. However, it is still clearly slower than Lime 8.1.3. Interestingly, in Lime 8.2, the images start rendering from last-to-first. With the PR, they render first-to-last.
The text was updated successfully, but these errors were encountered: