Wrong encoding for bodyHTML

If an email contains bodyHTML (mapi 0x1013) that is encoded in for example UTF-8 the parser ignores the encoding and uses CP1252 causing characters like ü being displayed as Ã¼.

https://github.com/bbottema/outlook-message-parser/blob/5a6b5d248b37e70c8ad4280194ff612a497ad9ff/src/main/java/org/simplejavamail/outlookmessageparser/model/OutlookMessage.java#L252-L264

Problem is that the correct charset is not known when calling the String constructor. There might be a way to do this more efficient but this is what we've come up with to replace Line 259:

```java
String convertedString = new String((byte[]) value, CharsetHelper.WINDOWS_CHARSET);
Pattern pattern = Pattern.compile("charset=(\"|)([\\w\\-]+)\\1", Pattern.CASE_INSENSITIVE);
Matcher m = pattern.matcher(convertedString);
if(m.find()) {
	try {
		convertedString = new String((byte[]) value, Charset.forName(m.group(2)));
	} catch (Exception e) {
		//ignore and use default charset
	}
}
return convertedString;
```

First step, convert everything as before. 
Second step, check the result String for a charset. The regex matches the following two pattern and extracts the charset:
```html
<meta charset="utf-8" /> 
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
```
If there is a charset in the result String overwrite while using the correct charset, else use the already created String. The try/catch block is for the Charset.forName method in case someone messed up the charset in the bodyHTML.

	private String convertValueToString(final Object value) {
	if (value == null) {
	return null;
	}
	if (value instanceof String) {
	return (String) value;
	} else if (value instanceof byte[]) {
	return new String((byte[]) value, CharsetHelper.WINDOWS_CHARSET);
	} else {
	LOGGER.trace("Unexpected body class: {} (expected String or byte[])", value.getClass().getName());
	return value.toString();
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wrong encoding for bodyHTML #34

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Wrong encoding for bodyHTML #34

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions