Skip to content

Commit 92f1422

Browse files
committed
Add support for additional text encodings.
This includes, but is not limited to, UTF-16, latin-1, GBK, EUC-JP and Shift_JIS. (Courtesy of the `encoding_rs` crate.) Specifically, this feature enables ripgrep to search files that are encoded in an encoding other than UTF-8. The list of available encodings is tied directly to what the `encoding_rs` crate supports, which is in turn tied to the Encoding Standard. The full list of available encodings can be found here: https://encoding.spec.whatwg.org/#concept-encoding-get This pull request also introduces the notion that text encodings can be automatically detected on a best effort basis. Currently, the only support for this is checking for a UTF-16 bom. In all other cases, a text encoding of `auto` (the default) implies a UTF-8 or ASCII compatible source encoding. When a text encoding is otherwise specified, it is unconditionally used for all files searched. Since ripgrep's regex engine is fundamentally built on top of UTF-8, this feature works by transcoding the files to be searched from their source encoding to UTF-8. This transcoding only happens when: 1. `auto` is specified and a non-UTF-8 encoding is detected. 2. A specific encoding is given by end users (including UTF-8). When transcoding occurs, errors are handled by automatically inserting the Unicode replacement character. In this case, ripgrep's output is guaranteed to be valid UTF-8 (excluding non-UTF-8 file paths, if they are printed). In all other cases, the source text is searched directly, which implies an assumption that it is at least ASCII compatible, but where UTF-8 is most useful. In this scenario, encoding errors are not detected. In this case, ripgrep's output will match the input exactly, byte-for-byte. This design may not be optimal in all cases, but it has some advantages: 1. In the happy path ("UTF-8 everywhere") remains happy. I have not been able to witness any performance regressions. 2. In the non-UTF-8 path, implementation complexity is kept relatively low. The cost here is transcoding itself. A potentially superior implementation might build decoding of any encoding into the regex engine itself. In particular, the fundamental problem with transcoding everything first is that literal optimizations are nearly negated. Future work should entail improving the user experience. For example, we might want to auto-detect more text encodings. A more elaborate UX experience might permit end users to specify multiple text encodings, although this seems hard to pull off in an ergonomic way. Fixes #1
1 parent 80e91a1 commit 92f1422

File tree

10 files changed

+607
-13
lines changed

10 files changed

+607
-13
lines changed

Cargo.lock

+16
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

+1
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ path = "tests/tests.rs"
2929
atty = "0.2.2"
3030
bytecount = "0.1.4"
3131
clap = "2.20.5"
32+
encoding_rs = "0.5.0"
3233
env_logger = { version = "0.4", default-features = false }
3334
grep = { version = "0.1.5", path = "grep" }
3435
ignore = { version = "0.1.7", path = "ignore" }

README.md

+6-8
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,10 @@ increases the times to `3.081s` for ripgrep and `11.403s` for GNU grep.
8383
of search results, searching multiple patterns, highlighting matches with
8484
color and full Unicode support. Unlike GNU grep, `ripgrep` stays fast while
8585
supporting Unicode (which is always on).
86+
* `ripgrep` supports searching files in text encodings other than UTF-8, such
87+
as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for
88+
automatically detecting UTF-16 is provided. Other text encodings must be
89+
specifically specified with the `-E/--encoding` flag.)
8690

8791
In other words, use `ripgrep` if you like speed, filtering by default, fewer
8892
bugs and Unicode support.
@@ -101,18 +105,12 @@ give you a glimpse at some important downsides or missing features of
101105
support for Unicode categories (e.g., `\p{Sc}` to match currency symbols or
102106
`\p{Lu}` to match any uppercase letter). (Fancier regexes will never be
103107
supported.)
104-
* If you need to search files with text encodings other than UTF-8 (like
105-
UTF-16), then `ripgrep` won't work. `ripgrep` will still work on ASCII
106-
compatible encodings like latin1 or otherwise partially valid UTF-8.
107-
`ripgrep` *can* search for arbitrary bytes though, which might work in
108-
a pinch. (Likely to be supported in the future.)
109108
* `ripgrep` doesn't yet support searching compressed files. (Likely to be
110109
supported in the future.)
111110
* `ripgrep` doesn't have multiline search. (Unlikely to ever be supported.)
112111

113-
In other words, if you like fancy regexes, non-UTF-8 character encodings,
114-
searching compressed files or multiline search, then `ripgrep` may not quite
115-
meet your needs (yet).
112+
In other words, if you like fancy regexes, searching compressed files or
113+
multiline search, then `ripgrep` may not quite meet your needs (yet).
116114

117115
### Is it really faster than everything else?
118116

doc/rg.1.md

+7
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,13 @@ Project home page: https://github.com/BurntSushi/ripgrep
136136
--debug
137137
: Show debug messages.
138138

139+
-E, --encoding *ENCODING*
140+
: Specify the text encoding that ripgrep will use on all files
141+
searched. The default value is 'auto', which will cause ripgrep to do
142+
a best effort automatic detection of encoding on a per-file basis.
143+
Other supported values can be found in the list of labels here:
144+
https://encoding.spec.whatwg.org/#concept-encoding-get
145+
139146
-f, --file FILE ...
140147
: Search for patterns from the given file, with one pattern per line. When this
141148
flag is used or multiple times or in combination with the -e/--regexp flag,

src/app.rs

+12-2
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,8 @@ fn app<F>(next_line_help: bool, doc: F) -> App<'static, 'static>
9696
.possible_values(&["never", "auto", "always", "ansi"]))
9797
.arg(flag("colors").value_name("SPEC")
9898
.takes_value(true).multiple(true).number_of_values(1))
99+
.arg(flag("encoding").short("E").value_name("ENCODING")
100+
.takes_value(true).number_of_values(1))
99101
.arg(flag("fixed-strings").short("F"))
100102
.arg(flag("glob").short("g")
101103
.takes_value(true).multiple(true).number_of_values(1)
@@ -251,6 +253,14 @@ lazy_static! {
251253
change the match color to magenta and the background color for \
252254
line numbers to yellow:\n\n\
253255
rg --colors 'match:fg:magenta' --colors 'line:bg:yellow' foo.");
256+
doc!(h, "encoding",
257+
"Specify the text encoding of files to search.",
258+
"Specify the text encoding that ripgrep will use on all files \
259+
searched. The default value is 'auto', which will cause ripgrep \
260+
to do a best effort automatic detection of encoding on a \
261+
per-file basis. Other supported values can be found in the list \
262+
of labels here: \
263+
https://encoding.spec.whatwg.org/#concept-encoding-get");
254264
doc!(h, "fixed-strings",
255265
"Treat the pattern as a literal string.",
256266
"Treat the pattern as a literal string instead of a regular \
@@ -335,9 +345,9 @@ lazy_static! {
335345
provided are searched. Empty pattern lines will match all input \
336346
lines, and the newline is not counted as part of the pattern.");
337347
doc!(h, "files-with-matches",
338-
"Only show the path of each file with at least one match.");
348+
"Only show the paths with at least one match.");
339349
doc!(h, "files-without-match",
340-
"Only show the path of each file that contains zero matches.");
350+
"Only show the paths that contains zero matches.");
341351
doc!(h, "with-filename",
342352
"Show file name for each match.",
343353
"Prefix each match with the file name that contains it. This is \

src/args.rs

+32
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ use std::sync::Arc;
1010
use std::sync::atomic::{AtomicBool, Ordering};
1111

1212
use clap;
13+
use encoding_rs::Encoding;
1314
use env_logger;
1415
use grep::{Grep, GrepBuilder};
1516
use log;
@@ -41,6 +42,7 @@ pub struct Args {
4142
column: bool,
4243
context_separator: Vec<u8>,
4344
count: bool,
45+
encoding: Option<&'static Encoding>,
4446
files_with_matches: bool,
4547
files_without_matches: bool,
4648
eol: u8,
@@ -224,6 +226,7 @@ impl Args {
224226
.after_context(self.after_context)
225227
.before_context(self.before_context)
226228
.count(self.count)
229+
.encoding(self.encoding)
227230
.files_with_matches(self.files_with_matches)
228231
.files_without_matches(self.files_without_matches)
229232
.eol(self.eol)
@@ -330,6 +333,7 @@ impl<'a> ArgMatches<'a> {
330333
column: self.column(),
331334
context_separator: self.context_separator(),
332335
count: self.is_present("count"),
336+
encoding: try!(self.encoding()),
333337
files_with_matches: self.is_present("files-with-matches"),
334338
files_without_matches: self.is_present("files-without-match"),
335339
eol: b'\n',
@@ -569,13 +573,18 @@ impl<'a> ArgMatches<'a> {
569573
/// will need to search.
570574
fn mmap(&self, paths: &[PathBuf]) -> Result<bool> {
571575
let (before, after) = try!(self.contexts());
576+
let enc = try!(self.encoding());
572577
Ok(if before > 0 || after > 0 || self.is_present("no-mmap") {
573578
false
574579
} else if self.is_present("mmap") {
575580
true
576581
} else if cfg!(target_os = "macos") {
577582
// On Mac, memory maps appear to suck. Neat.
578583
false
584+
} else if enc.is_some() {
585+
// There's no practical way to transcode a memory map that isn't
586+
// isomorphic to searching over io::Read.
587+
false
579588
} else {
580589
// If we're only searching a few paths and all of them are
581590
// files, then memory maps are probably faster.
@@ -721,6 +730,29 @@ impl<'a> ArgMatches<'a> {
721730
Ok(ColorSpecs::new(&specs))
722731
}
723732

733+
/// Return the text encoding specified.
734+
///
735+
/// If the label given by the caller doesn't correspond to a valid
736+
/// supported encoding (and isn't `auto`), then return an error.
737+
///
738+
/// A `None` encoding implies that the encoding should be automatically
739+
/// detected on a per-file basis.
740+
fn encoding(&self) -> Result<Option<&'static Encoding>> {
741+
match self.0.value_of_lossy("encoding") {
742+
None => Ok(None),
743+
Some(label) => {
744+
if label == "auto" {
745+
return Ok(None);
746+
}
747+
match Encoding::for_label(label.as_bytes()) {
748+
Some(enc) => Ok(Some(enc)),
749+
None => Err(From::from(
750+
format!("unsupported encoding: {}", label))),
751+
}
752+
}
753+
}
754+
}
755+
724756
/// Returns the approximate number of threads that ripgrep should use.
725757
fn threads(&self) -> Result<usize> {
726758
if self.is_present("sort-files") {

0 commit comments

Comments
 (0)