-
Notifications
You must be signed in to change notification settings - Fork 127
filter files that have non utf-8 characters in their filenames #2626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ascii-only is way too exclusive, it will cause trouble for many languages. I'll suggest an alternative solution soon.
@@ -85,6 +85,10 @@ def directory? | |||
def ls | |||
path.each_child.map do |child_path| | |||
PosixFile.stat(child_path) | |||
end.select do |stats| | |||
ascii_only = stats[:name].to_s.ascii_only? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ascii_only?
puts a lot of UTF-8 characters in the wrong bin, I'm afraid. I tried this code on our sandbox instance, and for example these strings get ignored in error (i.e., they should be allowed):
App 22009 output: [2023-03-02 12:06:19 +0200 ] WARN "Not showing file 'これは動作しません' because it is not a UTF-8 filename."
App 22009 output: [2023-03-02 12:06:19 +0200 ] WARN "Not showing file 'тест' because it is not a UTF-8 filename."
I'm not very knowledgeable in Ruby, but I managed to find a way to filter out only the file names that are invalid UTF-8 bytes. I tested it as a little example program that lists the files in my test directory: puts "Listing files in directory"
entries = Dir.entries("/path/to/dir/")
entries.each do |e|
begin
puts "Entry: #{e}" if e.unicode_normalized?
rescue ArgumentError
puts "Found invalid UTF-8 bytes in: #{e}"
end
end
puts "Done" This will allow the Japanese and Russian text which I used in my testing, but filter out the example names I presented in #2624
Unless it is rescued, the ArgumentError` looks like this:
My ruby version in this testing was: |
Additionally, in the current solution, there is no indication to the user that there are ignored files in the directory they listed. But I suppose that could be fixed later as well. At least the skipped files are logged on the server. |
I got a working implementation for the Files app with this: # ...
end.select do |stats|
name_is_valid_unicode = false
begin
name_is_valid_unicode = stats[:name].to_s.unicode_normalized?
rescue ArgumentError
Rails.logger.warn("Not showing file '#{stats[:name]}' because it is not a UTF-8 filename.")
end
name_is_valid_unicode
end.sort_by { |p| p[:directory] ? 0 : 1 }
#... |
Thanks for the info. You're right that we should include other languages. I'll work on something else. I'd like to avoid a |
Thank you. Perhaps there are better ways, I didn't look into this any further. I can imagine that |
Good looking out @CSC-swesters - I think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great that you found a more performant way to test this! Tested the code on our 2.0.29 OOD version as well by patching /var/www/ood/apps/sys/dashboard/app/models/files.rb
accordingly, and I see the same successful result as you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #2624 by hiding non ascii filenames.
We may be able to show them at some point, but will never be able to edit, download or update them. We cannot build routes with non utf-8 characters. We may be able to map them to say question marks instead, but if you it'd be nearly impossible to know which file to choose if they're all
????
.In any case, this'll work in the interem and I can create a follow up ticket for CRUDing them.
┆Issue is synchronized with this Asana task by Unito