-
Notifications
You must be signed in to change notification settings - Fork 19
use global offset for buffered chunks #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For split matchers that occur rarely in a stream with many chunks, resetting the search offset inside the handler makes for very bad performance (e.g. for a matcher that doesn't occur at all, the runtime is O(N²) where N is the total number of chunks in the stream). Moving the offset outside of the stream-chunk-handler restores linear runtime.
@@ -37,6 +37,7 @@ function BinarySplit (matcher) { | |||
} else { | |||
if (offset >= buf.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I don't see how this clause can ever be evaluated true (it doesn't in the tests and any scenario I can think of). So, the offset = 0
two lines below is kind of a guess…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm also not sure whether this block is necessary there. @maxogden is it safe to remove or does it have a special meaning that's not obvious?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't remember :P Probably me just being overly defensive
This looks great to me. Unfortunately we don't have any real stress tests in the test suite to find potential regressions in changes like this, but if the change works fine with something like a big Tile-Reduce job, let's merge. |
(this one fails before max-mapper#8)
OK, my initial approach turned out not to be good enough, so I've replaced it with something more robust including some test. |
} | ||
} else if (idx === 0) { | ||
buf = bops.subarray(buf, offset + matcher.length) | ||
if (typeof idx === 'number' && idx < buf.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idx can be either false
or number, so typeof is not necessary here — just idx !== false
. Although for consistency with JS indexOf
, I'd return -1
instead of false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated this in 8735002
(this one fails before max-mapper#8)
Great, looks good now! Maybe let's squash into one commit? |
Rebased and merged. |
Nope, had to revert — I ran a local benchmark on a big text file with short lines and it's 60% slower. We need to investigate this. |
My benchmark script, using Node 5.8.0: var fs = require('fs');
var split = require('./');
console.time('split');
fs.createReadStream('../mbtiles/latest.planet-z12.txt')
.pipe(split())
.on('data', function () {})
.on('end', function () {
console.timeEnd('split');
}); The text file is 28M list of line-delimited tile numbers (e.g. |
use global offset for buffered chunks
Performance is great now. |
Yeah. I didn't think that adding one additional |
@tyrasd it's O(1) but happens a lot with short lines, in a hot loop. |
For split matchers that occur rarely in a stream with many chunks, resetting the search offset inside the handler makes for very bad performance (e.g. for a matcher that doesn't occur at all, the runtime is
O(N²)
where N is the total number of chunks in the stream). Moving the offset outside of the stream-chunk-handler restores linear runtime.