Skip to content

use global offset for buffered chunks #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 15, 2016
Merged

Conversation

tyrasd
Copy link
Contributor

@tyrasd tyrasd commented Mar 14, 2016

For split matchers that occur rarely in a stream with many chunks, resetting the search offset inside the handler makes for very bad performance (e.g. for a matcher that doesn't occur at all, the runtime is O(N²) where N is the total number of chunks in the stream). Moving the offset outside of the stream-chunk-handler restores linear runtime.

For split matchers that occur rarely in a stream with many chunks, resetting the search offset inside the handler makes for very bad performance (e.g. for a matcher that doesn't occur at all, the runtime is O(N²) where N is the total number of chunks in the stream). Moving the offset outside of the stream-chunk-handler restores linear runtime.
@@ -37,6 +37,7 @@ function BinarySplit (matcher) {
} else {
if (offset >= buf.length) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't see how this clause can ever be evaluated true (it doesn't in the tests and any scenario I can think of). So, the offset = 0 two lines below is kind of a guess…

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm also not sure whether this block is necessary there. @maxogden is it safe to remove or does it have a special meaning that's not obvious?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't remember :P Probably me just being overly defensive

@mourner
Copy link
Collaborator

mourner commented Mar 14, 2016

This looks great to me. Unfortunately we don't have any real stress tests in the test suite to find potential regressions in changes like this, but if the change works fine with something like a big Tile-Reduce job, let's merge.

tyrasd added a commit to tyrasd/binary-split that referenced this pull request Mar 14, 2016
tyrasd added a commit to tyrasd/binary-split that referenced this pull request Mar 14, 2016
@tyrasd
Copy link
Contributor Author

tyrasd commented Mar 14, 2016

OK, my initial approach turned out not to be good enough, so I've replaced it with something more robust including some test.

}
} else if (idx === 0) {
buf = bops.subarray(buf, offset + matcher.length)
if (typeof idx === 'number' && idx < buf.length) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idx can be either false or number, so typeof is not necessary here — just idx !== false. Although for consistency with JS indexOf, I'd return -1 instead of false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this in 8735002

@mourner
Copy link
Collaborator

mourner commented Mar 15, 2016

Great, looks good now! Maybe let's squash into one commit?

mourner pushed a commit that referenced this pull request Mar 15, 2016
(this one fails before #8)
@mourner
Copy link
Collaborator

mourner commented Mar 15, 2016

Rebased and merged.

@mourner mourner closed this Mar 15, 2016
@mourner mourner reopened this Mar 15, 2016
@mourner
Copy link
Collaborator

mourner commented Mar 15, 2016

Nope, had to revert — I ran a local benchmark on a big text file with short lines and it's 60% slower. We need to investigate this.

@mourner
Copy link
Collaborator

mourner commented Mar 15, 2016

My benchmark script, using Node 5.8.0:

var fs = require('fs');
var split = require('./');

console.time('split');

fs.createReadStream('../mbtiles/latest.planet-z12.txt')
.pipe(split())
.on('data', function () {})
.on('end', function () {
    console.timeEnd('split');
});

The text file is 28M list of line-delimited tile numbers (e.g. 5, 122, 206). Master takes 1434ms and this branch 2307ms.

mourner added a commit that referenced this pull request Mar 15, 2016
use global offset for buffered chunks
@mourner mourner merged commit d1d8d0b into max-mapper:master Mar 15, 2016
@mourner
Copy link
Collaborator

mourner commented Mar 15, 2016

Performance is great now.

@tyrasd
Copy link
Contributor Author

tyrasd commented Mar 15, 2016

Yeah. I didn't think that adding one additional Buffer.slice (which should basically just be an O(1) operation, right?) would be such a performance overhead…

@mourner
Copy link
Collaborator

mourner commented Mar 15, 2016

@tyrasd it's O(1) but happens a lot with short lines, in a hot loop.

@tyrasd tyrasd deleted the patch-1 branch March 15, 2016 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants