-
Notifications
You must be signed in to change notification settings - Fork 2k
virtio-net: network IO emulation performance optimizations #413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
devices/src/virtio/net.rs
Outdated
struct RxVirtio { | ||
queue_evt: EventFd, | ||
rate_limiter: RateLimiter, | ||
deffered_frame: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be called deferred_frame
(one f
, two r
s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
devices/src/virtio/net.rs
Outdated
read_count = 0; | ||
// Copy buffer from across multiple descriptors. | ||
// TODO(performance): change this to use `writev()` instead of `write()` and | ||
// get rid of the intermediate buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: full stop at the end of comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
devices/src/virtio/net.rs
Outdated
@@ -380,10 +380,12 @@ impl NetEpollHandler { | |||
} | |||
|
|||
if used_count != 0 { | |||
// TODO(performance): find a way around RUST mutability enforcements to allow | |||
// calling queue.add_used() inside the loop. This would lead to better distribution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand what's the change you are proposing with the comment. To be able to call add_used without any param?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NetEpollHandler::process_tx()
has a main loop which is processing tx packets from the guest one for each iteration of the loop. At the end of the loop tx.queue.add_used()
is called to tell the guest driver of all the used descriptors.
We'd get better concurrent work distribution between the Firecracker IO emulation thread and the guest driver thread if used descriptors were marked as done/used within the loop - right after we're finished with them, as opposed to marking them as done/used all at once at the end of the loop.
The problem (rust mutability enforcements) is that the tx.queue.iter() takes a &mut self
reference to the tx_queue which lives for the full duration of the queue iteration loop which prevents calling tx.queue.add_used()
within the loop since add_used()
also takes a &mut self
reference to the tx_queue.
With some refactoring, it should be possible to get around this. It is safe to call add_used()
while iterating since the two functions work on separate members of the tx_queue.
Created #425 to capture this info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added issue number to TODO comment.
devices/src/virtio/net.rs
Outdated
h.handle_event(TX_QUEUE_EVENT, 0); | ||
assert_eq!(h.interrupt_evt.read(), Ok(2)); | ||
// make sure the data queue advanced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Full stop and capital letter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
devices/src/virtio/net.rs
Outdated
@@ -345,6 +335,29 @@ impl NetEpollHandler { | |||
break; | |||
} | |||
|
|||
read_count = 0; | |||
// Copy buffer from across multiple descriptors. | |||
// TODO(performance): change this to use `writev()` instead of `write()` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to the write from let write_result = self.tap.write(&self.tx.frame_buf[..read_count as usize]);
? If so, I would move the comment right before that line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am referring to the write()
you mentioned, but the TODO includes the loop where this comment is. Instead of looping through the tx.iovec
vector and building the packet in the tx.frame_buf
intermediate buffer, we could rework this loop and the call to tap.write()
and do as described in #420.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added issue number to comment.
} | ||
} | ||
self.tx.iovec.push((desc.addr, desc.len as usize)); | ||
read_count += desc.len as usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the read_count
that the rate_limiter
will be checking against will be different from the one that the other implementation was computing. Is desc_len
the maximum amount that can be read for each descriptor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The read_count
in both cases should always be the same. desc.len
is the actual data length not a maximum. mem.read_slice_at_addr(desc_len)
should always return desc_len
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_slice_at_addr
will return how much it read from that address which could be less than desc.len
(i.e the parameter provided). That is why I got confused. Consequently, the read_count
could declare more than what we actually read. However, since this is only used for imposing rate limiter amounts it should not be a problem. I guess we would have had a problem if the declared read_count
would have been smaller than the amount we actual read.
Intermediate 64k buffer used to build the frame being TX'ed by the guest was being initialized to all zeroes for each TX packet, thus wasting a lot of CPU cycles. Avoid doing that by keeping the buffer in heap instead of on the stack. Signed-off-by: Adrian Catangiu <[email protected]>
Also rearranged net metrics layout for same reason (most likely no real gain here though). Signed-off-by: Adrian Catangiu <[email protected]>
Instead of kicking the guest driver after delivering each RX packet, deliver as many packets are already available on the tap and there is room for in the virtqueue and only then kick the guest driver. This results in slightly higher latency but much higher throughput and even less IRQ-handling guest cpu-overhead. NOTE: it would be worth an experiment for setting up a timer for kicking the guest. But that would need serious fine tuning. Signed-off-by: Adrian Catangiu <[email protected]>
We're currently using an intermediate buffer to build the TX packet from the virtqueue descriptors. Ideally, we should use scatter-gather and avoid using this intermediate buffer altogether. This commit does part of the setup needed for scatter-gather by recording the addrs and lens of the descriptors and postponing the memcpy to right before writing to the tap. This commit still uses the intermediate buffer, but slightly helps with keeping the cache hot when doing the actual memcpy(). This change also greatly improves cpu usage when a TX rate-limiter is present. Before, the full packet was built and copied over to the intermediate buffer, but if there wasn't enough rate-limiter budget for it, that work would just be wasted. With this patch, the packet length is determined while recording the iovec, and if there is not enough rate-limiter budget, the function returns fast without copying over any memory. Signed-off-by: Adrian Catangiu <[email protected]>
The guest virtio driver reclaims used TX buffers as part of its TX handler. Before adding a packet to the TX virtqueue, it first reclaims any used descriptors in the ring. The guest virtio driver interrupt handler does nothing for used TX buffers. It is therefore useless and detrimental for Firecracker virtio device emulation to 'kick' the guest virtio driver on TX packet delivery. This commit brings sizeable throughput gains since fewer cycles are wasted in the guest servicing useless IRQs. Signed-off-by: Adrian Catangiu <[email protected]>
Signed-off-by: Adrian Catangiu <[email protected]>
devices/src/virtio/net.rs
Outdated
tx_frame: [u8; MAX_BUFFER_SIZE], | ||
tx_used_desc_heads: [u16; QUEUE_SIZE as usize], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a bit late, but instead of polluting global state with local state, a better option would have been to use a MaybeUninit array and get a slice over it after initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaybeUninit didn't exist at that point in time. So I say that you are a bit early, not a bit late.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the MaybeUninit
array suggestion, it looks interesting.
How is the current version of the code polluting global state? It's the device state which at any one time may have a buffered/pending frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it seems that this array is only using during process_tx and the data inside the array is only being using during a single call of process_tx, another call will not reuse this state. I haven't checked all the commits in this series though.
Changes
Various network IO optimizations:
memset()
on intermediate 64k buffer for each TX packetTesting
Results
Testing was done on i3.metal:
Iperf testing guest RX
Iperf testing guest TX
node.js benchmark