Description
If an email contains bodyHTML (mapi 0x1013) that is encoded in for example UTF-8 the parser ignores the encoding and uses CP1252 causing characters like ü being displayed as ü.
Problem is that the correct charset is not known when calling the String constructor. There might be a way to do this more efficient but this is what we've come up with to replace Line 259:
String convertedString = new String((byte[]) value, CharsetHelper.WINDOWS_CHARSET);
Pattern pattern = Pattern.compile("charset=(\"|)([\\w\\-]+)\\1", Pattern.CASE_INSENSITIVE);
Matcher m = pattern.matcher(convertedString);
if(m.find()) {
try {
convertedString = new String((byte[]) value, Charset.forName(m.group(2)));
} catch (Exception e) {
//ignore and use default charset
}
}
return convertedString;
First step, convert everything as before.
Second step, check the result String for a charset. The regex matches the following two pattern and extracts the charset:
<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
If there is a charset in the result String overwrite while using the correct charset, else use the already created String. The try/catch block is for the Charset.forName method in case someone messed up the charset in the bodyHTML.