Skip to content

regexp: case-insensitive MatchString performance #13288

Closed
@egonelbre

Description

@egonelbre

I was investigating why https://github.com/dimroc/etl-language-comparison/blob/517507b033dbb938dd1e83401914cbd4dd9a79bc/golang/search.go#L160 ends up significantly slower with regular expression.

strings.Contains(strings.ToLower(message), "knicks") // 7.8s
// versus
rx := regexp.MustCompile("(?i)knicks") // won't be done multiple times
rx.MatchString(message) // 21.7s

Profiling the etl example lead me to:

152.84s of 158.97s total (96.14%)
Dropped 167 nodes (cum <= 0.79s)
      flat  flat%   sum%        cum   cum%
    61.12s 38.45% 38.45%     61.25s 38.53%  unicode.SimpleFold
    22.62s 14.23% 52.68%    121.64s 76.52%  regexp.(*machine).tryBacktrack
    18.76s 11.80% 64.48%     18.76s 11.80%  regexp.(*bitState).push
    10.41s  6.55% 71.03%     71.66s 45.08%  regexp/syntax.(*Inst).MatchRunePos
     8.99s  5.66% 76.68%      9.41s  5.92%  regexp.(*inputString).step
     8.94s  5.62% 82.30%    135.55s 85.27%  regexp.(*machine).backtrack
     5.76s  3.62% 85.93%      5.76s  3.62%  runtime.osyield
     3.48s  2.19% 88.12%     75.14s 47.27%  regexp/syntax.(*Inst).MatchRune

Which lead me to: https://github.com/golang/go/blob/master/src/regexp/syntax/prog.go#L213

I assume that there is a good reason for using unicode.SimpleFold, i.e. some special unicode symbols. When I changed it:

for r1 := unicode.SimpleFold(r0); r1 != r0; r1 = unicode.SimpleFold(r1) {
// into
if r1 := unicode.ToLower(r0); r1 == r || r1 == unicode.ToLower(r) {

It didn't break any tests... also I created easy0i = "(?i)ABCDEFGHIJKLMNOPQRSTUVWXYZ$" in https://github.com/golang/go/blob/master/src/regexp/exec_test.go#L674 to measure the performance change:

// old
BenchmarkMatchEasy0i_32-8                 500000              2860 ns/op          11.19 MB/s
BenchmarkMatchEasy0i_1K-8                  20000             84204 ns/op          12.16 MB/s
BenchmarkMatchEasy0i_32K-8                   500           3346191 ns/op           9.79 MB/s
BenchmarkMatchEasy0i_1M-8                     10         107206130 ns/op           9.78 MB/s
BenchmarkMatchEasy0i_32M-8                     1        3424195900 ns/op           9.80 MB/s
// new
BenchmarkMatchEasy0i_32-8                1000000              1831 ns/op          17.48 MB/s
BenchmarkMatchEasy0i_1K-8                  30000             56369 ns/op          18.17 MB/s
BenchmarkMatchEasy0i_32K-8                  1000           2276130 ns/op          14.40 MB/s
BenchmarkMatchEasy0i_1M-8                     20          73004175 ns/op          14.36 MB/s
BenchmarkMatchEasy0i_32M-8                     1        2334133500 ns/op          14.38 MB/s

So, is there a reason why unicode.SimpleFold is used instead of something else, such as unicode.ToLower? Either way, there might be some significant performance gains here for case-insensitive matches.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions