Skip to content

Faster and dependency-free MurmurHash3_32 implementation #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Faster and dependency-free MurmurHash3_32 implementation #5

wants to merge 1 commit into from

Conversation

alex-schneider
Copy link

No description provided.

@twmb
Copy link
Owner

twmb commented Feb 16, 2020

Thanks for the PR, but unfortunately this fails the tests. I'm not sure how much this would buy in general, since it looks like it replaces some stuff that exists for speed ("these 8 bytes are a uint64") with a bunch of slice indexing (which causes mem barriers), and replaces the switch (fast) with a loop (not as fast).

@alex-schneider
Copy link
Author

alex-schneider commented Feb 16, 2020

Hi twmb,

I am afraid that your tests will also test something wrong here.

Your implementation uses the unsafe package. Line 30 in the murmur32_gen.go file always generates a uint32 value of a "Big-endian" machine (https://en.wikipedia.org/wiki/Endianness), on my "Little-endian" home-PC too.

See https://en.wikipedia.org/wiki/MurmurHash

for (size_t i = len >> 2; i; i--) {
    // Here is a source of differing results across endiannesses.
    // A swap here has no effects on hash properties though.
    k = *((uint32_t*)key);
    ...
}

To verify this, you can simply replace line 38 in my PR like following (other indexing in the data slice):

- k = uint32(data[0])<<24 | uint32(data[1])<<16 | uint32(data[2])<<8 | uint32(data[3])
+ k = uint32(data[3])<<24 | uint32(data[2])<<16 | uint32(data[1])<<8 | uint32(data[0])

... and let the tests run again.

Do you know a solution to make the k "endiannless free" in Go?

Finally: Sorry for my bad English :)

@twmb
Copy link
Owner

twmb commented Feb 16, 2020

I see the endianness point, and now see in the canonical source as well as your quote from the wikipedia.

I would say that the tests are based around little endianness, though, since my laptop is little endian and the tests test against the canonical source. The unsafe conversion just says four bytes are now uint32, and then all the math works on that. Your PR is piecemeal converting four contiguous bytes to a uint32, reading those bytes as if they were encoded big endian order.

I think that really the confusion here comes around the fact that murmur3 actually doesn't define an endianness with its hashing. Your PR assumes that we have to convert blocks of four bytes from a big-endian encoded order to CPU native order. The murmur3 canonical source doesn't define that such a conversion needs to happen. The murmur2 source explicitly documents that the resulting hashes are variable depending on endianness. Even that wikipedia quote says that the results are varying across the endianness. The murmur3 source does have opening documentation that x86 and x64 do produce different results, which in a way implies that hashes across platforms are not consistent.

I did some looking and have found a conversation about the Dovecot(??) source around this same problem. The path they took to resolve this was to read the input as little endian (see le32_to_cpu). They've since removed the murmur3 hashing since they do not use it anymore.

As @dignifiedquire points out on this discussion around this same problem with @spaolacci's murmur3, murmur3 hashing is underspecified. His linked Rust murmur3 library also favors reading blocks of four as little endian uint32s.

Other systems also seem to have this same endian bug, with the common report that tests fail on big endian systems: Python cassandra driver, even a bug report on the canonical code.

Lastly, although this is likely a non-issue, changing the reading from platform-dependent to always-big-endian is a behavior change that may break users, especially considering that little endian systems are the most common.

I think a better approach here would be to define new APIs:

NewNeutral
SeedNewNeutral
SeedNeutralStringSum
SeedNeutralSum
NeutralStringSum
NeutralSum

for 32, 64, and where relevant, 128.

I do think it's unfortunate that this would basically double the API surface, which makes go doc more intimidating than it should be, but this would allow people to opt into platform neutral hashing when necessary, while avoiding any hashing breakage for people using the existing hashing with no problems.

Alternatively, since big endian is likely pretty uncommon, we could just define the hashing based off of little endian blocks. But, in short, I favor little endian since it's the most common.

Thoughts?

@alex-schneider
Copy link
Author

Great respect for your research!!!

I favor little-endian too, because, how did you write "since it's the most common."

Your PR is piecemeal converting four contiguous bytes to a uint32, reading those bytes as if they were encoded big endian order.

Other way, my PR handles the data as encoded little-endian:

// Example: Given is an uint32 with value "1"
//
// B-E: 01000000 00000000 00000000 00000000
// L-E: 00000000 00000000 00000000 00000001
//
// data: [0]<<24  [1]<<16   [2]<<8   [3]<<0

Thoughts?

What do you think about an endianness dependent solution? (But therefor the tests have to be duplicated for both endian types).

Unfortunately I still haven't found a reliable solution in Go to check for the system endianness, except to execute a native C code like (simple example):

// int is_bigEndian() {
//    const int i = 1;
//    return (*(char *)&i == 0) ? 1 : 0;
// }
import "C"

const (
	endiannessUnknown = iota
	endiannessLittle
	endiannessBig
)

var endianness = endiannessUnknown

func init() {
	switch int(C.is_bigEndian()) {
	case 0:
		endianness = endiannessLittle
	case 1:
		endianness = endiannessBig
	default:
		endianness = endiannessUnknown
	}
}

I hope, you understand what I mean.

@twmb
Copy link
Owner

twmb commented Feb 17, 2020

I've done this in the past with some unsafe code, e.g. https://play.golang.org/p/wyamz1LFWxL.

What do you mean by an endianness dependent solution?

@alex-schneider
Copy link
Author

The unsafe.Pointer() construct from your example returns the bytes in the reverse order: raw[1 0]

On a little-endian mashine I would expect raw[0 1], like in C:

const int i = 1;
char *raw = (char *)&i;

raw[0] == 0
raw[1] == 0
raw[n] == 0
raw[sizeof(int) - 1] == 1

I think that is exactly why the tests in my PR fail.

What do you mean by an endianness dependent solution?

Like the following?

if isBigE() {
  k = uint32(data[0])<<24 | uint32(data[1])<<16 | uint32(data[2])<<8 | uint32(data[3])
} else {
  k = uint32(data[3])<<24 | uint32(data[2])<<16 | uint32(data[1])<<8 | uint32(data[0])
}

@twmb
Copy link
Owner

twmb commented Feb 18, 2020

You write "On a little-endian mashine I would expect raw[0 1], but that's not what your prior C example shows? In that code, if the first byte of a raw conversion is 1, then we're little endian. That's the same thing the unsafe code is doing. Manually using binary.LittleEndian.PutUint16 agrees: https://play.golang.org/p/hdSVDyrDXYX.

But ultimately, this is a question for how to read four continuous bytes and parse them into a number. I think keeping the current behavior is best, whereas your proposed solution looks to reverse the order of reading numbers, changing all tests. But, it also looks to try to get the same hash across both big and little endian platforms, which is what I proposed at the end of this comment.

I think that this may suffice to do the same thing by making big endian platforms conform to little endian:

 
+var isBig = func() bool {
+       one := uint16(1)
+       return (*(*[2]byte)(unsafe.Pointer(&one)))[0] == 0
+}()
+
 // SeedSum32 returns the murmur3 sum of data with the digest initialized to
 // seed.
 func SeedSum32(seed uint32, data []byte) (h1 uint32) {
@@ -28,6 +33,9 @@ func SeedSum32(seed uint32, data []byte) (h1 uint32) {
        nblocks := len(data) / 4
        for i := 0; i < nblocks; i++ {
                k1 := *(*uint32)(unsafe.Pointer(&data[i*4]))
+               if isBig {
+                       k1 = k1>>24 | (k1&0x00ff0000)>>8 | (k1&0x0000ff00)<<8 | k1<<24
+               }

Although, I think the logic in the tail code also may break things. I used to have some test matrix for this code, maybe it'd be worth re-setting that up with github actions and adding big and little endian.

@alex-schneider
Copy link
Author

You write "On a little-endian mashine I would expect raw[0 1], but...

Right, I had a mistake in thought.

I think that this may suffice to do the same thing by making big endian platforms conform to little endian:...

I think, the simplest solution is to use binary package:

- k1 := *(*uint32)(unsafe.Pointer(&data[i*4]))
+ k1 := binary.LittleEndian.Uint32(data[i*4])

@twmb
Copy link
Owner

twmb commented Feb 20, 2020

Unfortunately, just doing that results in a nearly 40% hit to throughput on my laptop, likely due to the compiler being unable to prove that the data[i*4] is in range, and then unable to prove that the fourth slot past that is in range, meaning two bounds checks (I'm just guessing here).

One can be eliminated by doing

b := data[:4]
data = data[4:]`

so that does one bounds check (then the four indexes into b are proven in bounds, no further checking), same as we have right now, but it has a membarrier penalty on assign which drops the performance by 10%.

Another benefit of my suggested code is that the conditional is likely 100% predicted, meaning it's of no hit to performance (no change on my machine).

Unfortunately, it's a 58% hit if the flip is needed.

I can't come up with a way that keeps the performance while works the same on both endians, which leads me to believe that, if we want such behavior, it's worth it to add new APIs and document that these APIs return hashes that are not portable across architectures.

@twmb
Copy link
Owner

twmb commented Feb 26, 2020

We're in "luck". Due to failures on go1.14, I've converted everything to read little endian numbers, making the hashes architectures portable. The speed is not as fast, but we do fix some alignment problems, so, pros and cons. It's close to being as fast.

@twmb twmb closed this Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants