Skip to content

Wrong encoding for bodyHTML #34

Closed
Closed
@Faelean

Description

@Faelean

If an email contains bodyHTML (mapi 0x1013) that is encoded in for example UTF-8 the parser ignores the encoding and uses CP1252 causing characters like ü being displayed as ü.

private String convertValueToString(final Object value) {
if (value == null) {
return null;
}
if (value instanceof String) {
return (String) value;
} else if (value instanceof byte[]) {
return new String((byte[]) value, CharsetHelper.WINDOWS_CHARSET);
} else {
LOGGER.trace("Unexpected body class: {} (expected String or byte[])", value.getClass().getName());
return value.toString();
}
}

Problem is that the correct charset is not known when calling the String constructor. There might be a way to do this more efficient but this is what we've come up with to replace Line 259:

String convertedString = new String((byte[]) value, CharsetHelper.WINDOWS_CHARSET);
Pattern pattern = Pattern.compile("charset=(\"|)([\\w\\-]+)\\1", Pattern.CASE_INSENSITIVE);
Matcher m = pattern.matcher(convertedString);
if(m.find()) {
	try {
		convertedString = new String((byte[]) value, Charset.forName(m.group(2)));
	} catch (Exception e) {
		//ignore and use default charset
	}
}
return convertedString;

First step, convert everything as before.
Second step, check the result String for a charset. The regex matches the following two pattern and extracts the charset:

<meta charset="utf-8" /> 
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

If there is a charset in the result String overwrite while using the correct charset, else use the already created String. The try/catch block is for the Charset.forName method in case someone messed up the charset in the bodyHTML.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions