Skip to content

Improved signedness data accuracy and consistency (Fixes #327) #336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 28, 2022

Conversation

mappzor
Copy link
Contributor

@mappzor mappzor commented May 4, 2022

Fixes #327

Full description in zyantific/zydis-db#13

@flobernd flobernd requested review from flobernd and athre0z May 5, 2022 06:17
@flobernd
Copy link
Member

flobernd commented May 5, 2022

Thanks a lot for this huge amount of work @mappzor 🎉

Copy link
Member

@flobernd flobernd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot for the good work 👍

-- --------- ---------- ------ ------------ ---- ----- ------ -------- ---------------------------

== [ ATT ] ============================================================================================
ABSOLUTE: mov $0xA7, %bh
RELATIVE: mov $0xA7, %bh
ABSOLUTE: mov $-0x59, %bh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please check how other disassemblers format this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NASM: mov bh,0xa7
Capstone: mov bh, 0xa7
binutils: mov $0xa7,%bh

In Zydis formatter properties would decide this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So binutils (which seems to be the only one using AT&T syntax) prints the operand as "unsigned". The problem I see here is that changing the setting in the formatter properties will change it for all signed immediate constants.

Currently it is like:

  • imm -> print with sign
  • uimm -> print without sign

I don't mind formatting it as "signed" for mov, just it seems like this is not the standard way of doing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, signedness is so arbitrary after all. For data transfer instructions it should be basically "no sign" and arguably "unsigned" is closer to that. However architecturally those immediates are signed, there are mov variants that do actual sign-extension, so to achieve consistency they must all be signed. That was one of the original arguments behind "signed by default" policy.

If we really want to preserve old behavior I would think about handling it inside formatter for reasons stated above. To be honest I don't think this new behavior is that bad at all. It's different from most, that's a downside for sure but also it's fairly common to see those one-byte regs being used to store signed stuff with -1 being probably the most common value after 0 and 1. I'd say that ignoring sign for 1-byte values is actually a popular quirk but we simply got used to it. I think I would just keep this new behavior as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! We just have to prepare for some issues about "wrong disassembly" I guess 😛

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about providing the formatter with an option to either print values as unsigned or signed, with AsmJit you can specify this for immediates and displacements, that way everyone would be happy I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 options for this already, one for immediates, second one for displacements. However they affect all immediates/displacements at once, it's a global override. Above discussion was just about default behavior when none of those overrides are active (definition signedness decides in such case).

@flobernd
Copy link
Member

flobernd commented May 7, 2022

@athre0z Are you fine with this as well?

@athre0z
Copy link
Member

athre0z commented May 28, 2022

Thanks for the enormous amount of effort! A lot of the changes are clearly desirable. Like you said, it's rather clear that nobody really seems to have bothered with inspecting the signedness information in the past, since even very common instructions received fixups.

I am, however, a bit reluctant to change the signedness of instructions that don't actually care about the signedness. It makes interpreting the data in our structs quite a bit harder. To stay with the mov example discussed previously, I'd argue that users will typically want to interpret that as an unsigned value. If we fill this in as signed, they'll have to add a switch on the operand width / equivalent bit-magic in order to correctly extract the unsigned value. Your argument about that one sign-extending mov variant is valid, but I can't help but find that rather unintuitive. I don't really have a better idea, though. :/

@athre0z
Copy link
Member

athre0z commented May 28, 2022

In a way, changing the default signedness just exposes that representing the immediate as

union {
  uint64_t s;
  int64_t u;
};

is essentially lying to the user. For an instruction like mov ax, 0xAABB, reporting the value as s64 is simply incorrect. The same is true for representing it as u64 -- the flaw is just less obvious in the unsigned case. If we instead used a union such as

union {
  int8_t   s8;
  int16_t  s16;
  int32_t  s32;
  int64_t  s64;
  uint8_t  u8;
  uint16_t u16;
  uint32_t u32;
  uint64_t u64;
};

it would kind of solve the issue. It'd make it much more transparent to the users that they must be concerned with the operand width. Also, reinterpreting signed values as unsigned would work much easier (since the decoder would fill in s16 instead of doing the full sign extension with s64).

Just thinking out load here, though.

@mappzor
Copy link
Contributor Author

mappzor commented May 28, 2022

I am, however, a bit reluctant to change the signedness of instructions that don't actually care about the signedness. It makes interpreting the data in our structs quite a bit harder. (...) If we fill this in as signed, they'll have to add a switch on the operand width / equivalent bit-magic in order to correctly extract the unsigned value.

Not really, that problem existed already and users that needed this info had to take care about this by themselves (I'm one of them). For displacements this problem was mitigated thanks to ZydisCalcAbsoluteAddress but for "real" immediates I found myself discarding higher bits frequently. Perhaps we can have a similar helper for extracting immediates?

Your suggested solution with unions is good as well. HDE does this and I always thought this is good because as you have said, it forces users to think about width and we don't try to create illusions that you can work on 64 bits all the time and get everything right.

Both options sound fine to me. Separate helper function will hurt performance and force users to do "some extra post-processing" (bit annoying). Union is more lightweight but users need to pick correct field by themselves. That's both a good thing (user might already know and make optimal decision, that's often the case imo) and a bad thing (if user doesn't know but to be honest you always need to be aware of signedness to use value for anything meaningful).

Finally those solutions aren't mutually exclusive, so maybe both?

@athre0z
Copy link
Member

athre0z commented May 28, 2022

Hmm, yes, I feel that union + helper functions for sign extension might be a good path forward. Regardless, I think that we have now determined that this is essentially a separate issue that is simply becoming more obvious now (more signed instructions), so there is no need to further delay this PR. :)

@athre0z athre0z merged commit 73d7dbb into zyantific:master May 28, 2022
@mappzor mappzor deleted the signedness branch July 18, 2022 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fail to assemble mov al, -1
4 participants