Skip to content

virtio-net: network IO emulation performance optimizations #413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 31, 2018
Merged

virtio-net: network IO emulation performance optimizations #413

merged 6 commits into from
Jul 31, 2018

Conversation

acatangiu
Copy link
Contributor

Changes

Various network IO optimizations:

  • avoid memset() on intermediate 64k buffer for each TX packet
  • rearrange epoll-handler for improved cache performance
  • coalesce RX interrupts delivered to the guest
  • prepare tx scatter-gather IO
  • don't raise interrupts for used descriptors on tx queue

Testing

cargo build
cargo build --release
sudo env "PATH=$PATH" cargo test --all
sudo env "PATH=$PATH" ./testrun.sh

Results

Testing was done on i3.metal:

Iperf testing guest RX

# of guest vCPUs Before (Gbits/sec) After (Gbits/sec) Gain %
1 10.5 Gbps 15.5 Gbps + 47.6 %
2 15.5 Gbps 16.36 Gbps + 5.5 %
3 15.86 Gbps 17.63 Gbps + 11.16 %
4 15.86 Gbps 18.83 Gbps + 18.72 %

Iperf testing guest TX

# of guest vCPUs Before (Gbits/sec) After (Gbits/sec) Gain %
1 8.85 Gbps 13.6 Gbps + 53.67 %
2 18.9 Gbps 20.35 Gbps + 7.67 %
3 19.6 Gbps 23.05 Gbps + 17.6 %
4 18.3 Gbps 24.3 Gbps + 32.78 %

node.js benchmark

# of guest vCPUs Before (Requests/sec) After (Requests/sec) Gain %
1 1900 2700 + 42.1 %

@acatangiu acatangiu added this to the Customer 2 Production milestone Jul 19, 2018
@acatangiu acatangiu self-assigned this Jul 19, 2018
@acatangiu acatangiu requested a review from a team July 19, 2018 14:18
struct RxVirtio {
queue_evt: EventFd,
rate_limiter: RateLimiter,
deffered_frame: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be called deferred_frame (one f, two rs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

read_count = 0;
// Copy buffer from across multiple descriptors.
// TODO(performance): change this to use `writev()` instead of `write()` and
// get rid of the intermediate buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: full stop at the end of comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -380,10 +380,12 @@ impl NetEpollHandler {
}

if used_count != 0 {
// TODO(performance): find a way around RUST mutability enforcements to allow
// calling queue.add_used() inside the loop. This would lead to better distribution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand what's the change you are proposing with the comment. To be able to call add_used without any param?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NetEpollHandler::process_tx() has a main loop which is processing tx packets from the guest one for each iteration of the loop. At the end of the loop tx.queue.add_used() is called to tell the guest driver of all the used descriptors.

We'd get better concurrent work distribution between the Firecracker IO emulation thread and the guest driver thread if used descriptors were marked as done/used within the loop - right after we're finished with them, as opposed to marking them as done/used all at once at the end of the loop.

The problem (rust mutability enforcements) is that the tx.queue.iter() takes a &mut self reference to the tx_queue which lives for the full duration of the queue iteration loop which prevents calling tx.queue.add_used() within the loop since add_used() also takes a &mut self reference to the tx_queue.

With some refactoring, it should be possible to get around this. It is safe to call add_used() while iterating since the two functions work on separate members of the tx_queue.

Created #425 to capture this info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added issue number to TODO comment.

h.handle_event(TX_QUEUE_EVENT, 0);
assert_eq!(h.interrupt_evt.read(), Ok(2));
// make sure the data queue advanced
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Full stop and capital letter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -345,6 +335,29 @@ impl NetEpollHandler {
break;
}

read_count = 0;
// Copy buffer from across multiple descriptors.
// TODO(performance): change this to use `writev()` instead of `write()` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the write from let write_result = self.tap.write(&self.tx.frame_buf[..read_count as usize]);? If so, I would move the comment right before that line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am referring to the write() you mentioned, but the TODO includes the loop where this comment is. Instead of looping through the tx.iovec vector and building the packet in the tx.frame_buf intermediate buffer, we could rework this loop and the call to tap.write() and do as described in #420.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added issue number to comment.

}
}
self.tx.iovec.push((desc.addr, desc.len as usize));
read_count += desc.len as usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the read_count that the rate_limiter will be checking against will be different from the one that the other implementation was computing. Is desc_len the maximum amount that can be read for each descriptor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read_count in both cases should always be the same. desc.len is the actual data length not a maximum. mem.read_slice_at_addr(desc_len)should always return desc_len.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_slice_at_addr will return how much it read from that address which could be less than desc.len (i.e the parameter provided). That is why I got confused. Consequently, the read_count could declare more than what we actually read. However, since this is only used for imposing rate limiter amounts it should not be a problem. I guess we would have had a problem if the declared read_count would have been smaller than the amount we actual read.

dianpopa
dianpopa previously approved these changes Jul 26, 2018
alexandruag
alexandruag previously approved these changes Jul 26, 2018
@acatangiu acatangiu dismissed stale reviews from alexandruag and dianpopa via 45756a2 July 31, 2018 11:07
Intermediate 64k buffer used to build the frame being TX'ed by the guest
was being initialized to all zeroes for each TX packet, thus wasting
a lot of CPU cycles.
Avoid doing that by keeping the buffer in heap instead of on the stack.

Signed-off-by: Adrian Catangiu <[email protected]>
Also rearranged net metrics layout for same reason (most likely no
real gain here though).

Signed-off-by: Adrian Catangiu <[email protected]>
Instead of kicking the guest driver after delivering each RX packet,
deliver as many packets are already available on the tap and there
is room for in the virtqueue and only then kick the guest driver.

This results in slightly higher latency but much higher throughput
and even less IRQ-handling guest cpu-overhead.

NOTE: it would be worth an experiment for setting up a timer for
kicking the guest. But that would need serious fine tuning.

Signed-off-by: Adrian Catangiu <[email protected]>
We're currently using an intermediate buffer to build the TX packet from
the virtqueue descriptors.
Ideally, we should use scatter-gather and avoid using this intermediate
buffer altogether.

This commit does part of the setup needed for scatter-gather by recording
the addrs and lens of the descriptors and postponing the memcpy to right
before writing to the tap.

This commit still uses the intermediate buffer, but slightly helps with
keeping the cache hot when doing the actual memcpy().

This change also greatly improves cpu usage when a TX rate-limiter is present.
Before, the full packet was built and copied over to the intermediate buffer,
but if there wasn't enough rate-limiter budget for it, that work would just be
wasted.
With this patch, the packet length is determined while recording the iovec,
and if there is not enough rate-limiter budget, the function returns fast
without copying over any memory.

Signed-off-by: Adrian Catangiu <[email protected]>
The guest virtio driver reclaims used TX buffers as part of its
TX handler. Before adding a packet to the TX virtqueue, it first
reclaims any used descriptors in the ring.
The guest virtio driver interrupt handler does nothing for used
TX buffers. It is therefore useless and detrimental for Firecracker
virtio device emulation to 'kick' the guest virtio driver on TX
packet delivery.

This commit brings sizeable throughput gains since fewer cycles
are wasted in the guest servicing useless IRQs.

Signed-off-by: Adrian Catangiu <[email protected]>
@acatangiu acatangiu merged commit d772877 into firecracker-microvm:master Jul 31, 2018
@acatangiu acatangiu deleted the netopt branch August 27, 2018 09:11
Comment on lines 83 to 84
tx_frame: [u8; MAX_BUFFER_SIZE],
tx_used_desc_heads: [u16; QUEUE_SIZE as usize],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a bit late, but instead of polluting global state with local state, a better option would have been to use a MaybeUninit array and get a slice over it after initialization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaybeUninit didn't exist at that point in time. So I say that you are a bit early, not a bit late.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the MaybeUninit array suggestion, it looks interesting.

How is the current version of the code polluting global state? It's the device state which at any one time may have a buffered/pending frame.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it seems that this array is only using during process_tx and the data inside the array is only being using during a single call of process_tx, another call will not reuse this state. I haven't checked all the commits in this series though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants