virtio-net: network IO emulation performance optimizations #413

acatangiu · 2018-07-19T11:44:08Z

Changes

Various network IO optimizations:

avoid memset() on intermediate 64k buffer for each TX packet
rearrange epoll-handler for improved cache performance
coalesce RX interrupts delivered to the guest
prepare tx scatter-gather IO
don't raise interrupts for used descriptors on tx queue

Testing

cargo build
cargo build --release
sudo env "PATH=$PATH" cargo test --all
sudo env "PATH=$PATH" ./testrun.sh

Results

Testing was done on i3.metal:

Iperf testing guest RX

# of guest vCPUs	Before (Gbits/sec)	After (Gbits/sec)	Gain %
1	10.5 Gbps	15.5 Gbps	+ 47.6 %
2	15.5 Gbps	16.36 Gbps	+ 5.5 %
3	15.86 Gbps	17.63 Gbps	+ 11.16 %
4	15.86 Gbps	18.83 Gbps	+ 18.72 %

Iperf testing guest TX

# of guest vCPUs	Before (Gbits/sec)	After (Gbits/sec)	Gain %
1	8.85 Gbps	13.6 Gbps	+ 53.67 %
2	18.9 Gbps	20.35 Gbps	+ 7.67 %
3	19.6 Gbps	23.05 Gbps	+ 17.6 %
4	18.3 Gbps	24.3 Gbps	+ 32.78 %

node.js benchmark

# of guest vCPUs	Before (Requests/sec)	After (Requests/sec)	Gain %
1	1900	2700	+ 42.1 %

alexandruag · 2018-07-23T15:04:07Z

devices/src/virtio/net.rs

+struct RxVirtio {
+    queue_evt: EventFd,
+    rate_limiter: RateLimiter,
+    deffered_frame: bool,


I think this should be called deferred_frame (one f, two rs).

dianpopa · 2018-07-26T09:07:07Z

devices/src/virtio/net.rs

+            read_count = 0;
+            // Copy buffer from across multiple descriptors.
+            // TODO(performance): change this to use `writev()` instead of `write()` and
+            // get rid of the intermediate buffer


nit: full stop at the end of comment?

dianpopa · 2018-07-26T09:20:23Z

devices/src/virtio/net.rs

@@ -380,10 +380,12 @@ impl NetEpollHandler {
        }

        if used_count != 0 {
+            // TODO(performance): find a way around RUST mutability enforcements to allow
+            // calling queue.add_used() inside the loop. This would lead to better distribution


I do not understand what's the change you are proposing with the comment. To be able to call add_used without any param?

NetEpollHandler::process_tx() has a main loop which is processing tx packets from the guest one for each iteration of the loop. At the end of the loop tx.queue.add_used() is called to tell the guest driver of all the used descriptors.

We'd get better concurrent work distribution between the Firecracker IO emulation thread and the guest driver thread if used descriptors were marked as done/used within the loop - right after we're finished with them, as opposed to marking them as done/used all at once at the end of the loop.

The problem (rust mutability enforcements) is that the tx.queue.iter() takes a &mut self reference to the tx_queue which lives for the full duration of the queue iteration loop which prevents calling tx.queue.add_used() within the loop since add_used() also takes a &mut self reference to the tx_queue.

With some refactoring, it should be possible to get around this. It is safe to call add_used() while iterating since the two functions work on separate members of the tx_queue.

Created #425 to capture this info.

Added issue number to TODO comment.

dianpopa · 2018-07-26T09:25:55Z

devices/src/virtio/net.rs

            h.handle_event(TX_QUEUE_EVENT, 0);
-            assert_eq!(h.interrupt_evt.read(), Ok(2));
+            // make sure the data queue advanced


nit: Full stop and capital letter?

dianpopa · 2018-07-26T10:26:33Z

devices/src/virtio/net.rs

@@ -345,6 +335,29 @@ impl NetEpollHandler {
                break;
            }

+            read_count = 0;
+            // Copy buffer from across multiple descriptors.
+            // TODO(performance): change this to use `writev()` instead of `write()` and


Are you referring to the write from let write_result = self.tap.write(&self.tx.frame_buf[..read_count as usize]);? If so, I would move the comment right before that line.

I am referring to the write() you mentioned, but the TODO includes the loop where this comment is. Instead of looping through the tx.iovec vector and building the packet in the tx.frame_buf intermediate buffer, we could rework this loop and the call to tap.write() and do as described in #420.

I've added issue number to comment.

dianpopa · 2018-07-26T10:31:28Z

devices/src/virtio/net.rs

-                            }
-                        }
+                        self.tx.iovec.push((desc.addr, desc.len as usize));
+                        read_count += desc.len as usize;


So the read_count that the rate_limiter will be checking against will be different from the one that the other implementation was computing. Is desc_len the maximum amount that can be read for each descriptor?

The read_count in both cases should always be the same. desc.len is the actual data length not a maximum. mem.read_slice_at_addr(desc_len)should always return desc_len.

read_slice_at_addr will return how much it read from that address which could be less than desc.len (i.e the parameter provided). That is why I got confused. Consequently, the read_count could declare more than what we actually read. However, since this is only used for imposing rate limiter amounts it should not be a problem. I guess we would have had a problem if the declared read_count would have been smaller than the amount we actual read.

Intermediate 64k buffer used to build the frame being TX'ed by the guest was being initialized to all zeroes for each TX packet, thus wasting a lot of CPU cycles. Avoid doing that by keeping the buffer in heap instead of on the stack. Signed-off-by: Adrian Catangiu <[email protected]>

Also rearranged net metrics layout for same reason (most likely no real gain here though). Signed-off-by: Adrian Catangiu <[email protected]>

Instead of kicking the guest driver after delivering each RX packet, deliver as many packets are already available on the tap and there is room for in the virtqueue and only then kick the guest driver. This results in slightly higher latency but much higher throughput and even less IRQ-handling guest cpu-overhead. NOTE: it would be worth an experiment for setting up a timer for kicking the guest. But that would need serious fine tuning. Signed-off-by: Adrian Catangiu <[email protected]>

We're currently using an intermediate buffer to build the TX packet from the virtqueue descriptors. Ideally, we should use scatter-gather and avoid using this intermediate buffer altogether. This commit does part of the setup needed for scatter-gather by recording the addrs and lens of the descriptors and postponing the memcpy to right before writing to the tap. This commit still uses the intermediate buffer, but slightly helps with keeping the cache hot when doing the actual memcpy(). This change also greatly improves cpu usage when a TX rate-limiter is present. Before, the full packet was built and copied over to the intermediate buffer, but if there wasn't enough rate-limiter budget for it, that work would just be wasted. With this patch, the packet length is determined while recording the iovec, and if there is not enough rate-limiter budget, the function returns fast without copying over any memory. Signed-off-by: Adrian Catangiu <[email protected]>

The guest virtio driver reclaims used TX buffers as part of its TX handler. Before adding a packet to the TX virtqueue, it first reclaims any used descriptors in the ring. The guest virtio driver interrupt handler does nothing for used TX buffers. It is therefore useless and detrimental for Firecracker virtio device emulation to 'kick' the guest virtio driver on TX packet delivery. This commit brings sizeable throughput gains since fewer cycles are wasted in the guest servicing useless IRQs. Signed-off-by: Adrian Catangiu <[email protected]>

Signed-off-by: Adrian Catangiu <[email protected]>

petreeftime · 2019-11-28T11:09:42Z

devices/src/virtio/net.rs

+    tx_frame: [u8; MAX_BUFFER_SIZE],
+    tx_used_desc_heads: [u16; QUEUE_SIZE as usize],


Maybe a bit late, but instead of polluting global state with local state, a better option would have been to use a MaybeUninit array and get a slice over it after initialization.

MaybeUninit didn't exist at that point in time. So I say that you are a bit early, not a bit late.

Thanks for the MaybeUninit array suggestion, it looks interesting.

How is the current version of the code polluting global state? It's the device state which at any one time may have a buffered/pending frame.

To me it seems that this array is only using during process_tx and the data inside the array is only being using during a single call of process_tx, another call will not reuse this state. I haven't checked all the commits in this series though.

acatangiu added the Performance: IO Virtualization label Jul 19, 2018

acatangiu added this to the Customer 2 Production milestone Jul 19, 2018

acatangiu self-assigned this Jul 19, 2018

acatangiu requested a review from a team July 19, 2018 14:18

alexandruag reviewed Jul 23, 2018

View reviewed changes

dianpopa reviewed Jul 26, 2018

View reviewed changes

dianpopa previously approved these changes Jul 26, 2018

View reviewed changes

alexandruag previously approved these changes Jul 26, 2018

View reviewed changes

acatangiu dismissed stale reviews from alexandruag and dianpopa via 45756a2 July 31, 2018 11:07

acatangiu added 6 commits July 31, 2018 14:07

virtio-net: rearrange epoll-handler for improved cache performance

899b593

Also rearranged net metrics layout for same reason (most likely no real gain here though). Signed-off-by: Adrian Catangiu <[email protected]>

Fix a typo: virtio::NOITFY_REG_OFFSET -> virtio::NOTIFY_REG_OFFSET

45756a2

Signed-off-by: Adrian Catangiu <[email protected]>

dianpopa approved these changes Jul 31, 2018

View reviewed changes

alexandruag approved these changes Jul 31, 2018

View reviewed changes

acatangiu merged commit d772877 into firecracker-microvm:master Jul 31, 2018

acatangiu deleted the netopt branch August 27, 2018 09:11

hyandell unassigned acatangiu Nov 7, 2018

sandreim mentioned this pull request Nov 26, 2019

Virtio-net: raise interrupt for TX queue used descriptors #1436

Closed

petreeftime reviewed Nov 28, 2019

View reviewed changes

		tx_frame: [u8; MAX_BUFFER_SIZE],
		tx_used_desc_heads: [u16; QUEUE_SIZE as usize],

virtio-net: network IO emulation performance optimizations #413

virtio-net: network IO emulation performance optimizations #413

Uh oh!

Conversation

acatangiu commented Jul 19, 2018

Changes

Testing

Results

Iperf testing guest RX

Iperf testing guest TX

node.js benchmark

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!