Skip to content

Commit a970445

Browse files
yyxintel-lab-lkp
authored andcommitted
ipvs: avoid drop first packet by reusing conntrack
Since 'commit f719e37 ("ipvs: drop first packet to redirect conntrack")', when a new TCP connection meet the conditions that need reschedule, the first syn packet is dropped, this cause one second latency for the new connection, more discussion about this problem can easy search from google, such as: 1)One second connection delay in masque https://marc.info/?t=151683118100004&r=1&w=2 2)IPVS low throughput #70747 kubernetes/kubernetes#70747 3)Apache Bench can fill up ipvs service proxy in seconds torvalds#544 cloudnativelabs/kube-router#544 4)Additional 1s latency in `host -> service IP -> pod` kubernetes/kubernetes#90854 5)kube-proxy ipvs conn_reuse_mode setting causes errors with high load from single client kubernetes/kubernetes#81775 The root cause is when the old session is expired, the conntrack related to the session is dropped by ip_vs_conn_drop_conntrack. The code is as follows: ``` static void ip_vs_conn_expire(struct timer_list *t) { ... if ((cp->flags & IP_VS_CONN_F_NFCT) && !(cp->flags & IP_VS_CONN_F_ONE_PACKET)) { /* Do not access conntracks during subsys cleanup * because nf_conntrack_find_get can not be used after * conntrack cleanup for the net. */ smp_rmb(); if (ipvs->enable) ip_vs_conn_drop_conntrack(cp); } ... } ``` As shown in the code, only when condition (cp->flags & IP_VS_CONN_F_NFCT) is true, the function ip_vs_conn_drop_conntrack will be called. So we optimize this by following steps (Administrators can choose the following optimization by setting net.ipv4.vs.conn_reuse_old_conntrack=1): 1) erase the IP_VS_CONN_F_NFCT flag (it is safely because no packets will use the old session) 2) call ip_vs_conn_expire_now to release the old session, then the related conntrack will not be dropped 3) then ipvs unnecessary to drop the first syn packet, it just continue to pass the syn packet to the next process, create a new ipvs session, and the new session will related to the old conntrack(which is reopened by conntrack as a new one), the next whole things is just as normal as that the old session isn't used to exist. The above processing has no problems except for passive FTP, for passive FTP situation, ipvs can judging from condition (atomic_read(&cp->n_control)) and condition (cp->control). So, for other conditions(means not FTP), ipvs should give users the right to choose,they can choose a high performance one processing logical by setting net.ipv4.vs.conn_reuse_old_conntrack=1. It is necessary because most business scenarios (such as kubernetes) are very sensitive to TCP short connection latency. This patch has been verified on our thousands of kubernets node servers on Tencent Inc. Signed-off-by: YangYuxi <[email protected]>
1 parent c92cbae commit a970445

File tree

4 files changed

+45
-2
lines changed

4 files changed

+45
-2
lines changed

Documentation/networking/ipvs-sysctl.rst

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,29 @@ conn_reuse_mode - INTEGER
5050
balancer in Direct Routing mode. This bit helps on adding new
5151
real servers to a very busy cluster.
5252

53+
conn_reuse_old_conntrack - BOOLEAN
54+
- 0 - disabled
55+
- not 0 - enabled (default)
56+
57+
If set, when a new TCP syn packet hit an old ipvs connection
58+
table and need reschedule to a new dest: if
59+
1) the packet use conntrack
60+
2) the old ipvs connection table is not a master control
61+
connection (E.g the command connection of passived FTP)
62+
3) the old ipvs connection table been not controlled by any
63+
connections (E.g the data connection of passived FTP)
64+
ipvs Will not release the old conntrack, just let the conntrack
65+
reopen the old session as it is a new one. This is an optimization
66+
option selectable by the system administrator.
67+
68+
If not set, when a new TCP syn packet hit an old ipvs connection
69+
table and need reschedule to a new dest: if
70+
1) the packet use conntrack
71+
ipvs just drop this syn packet, expire the old connection by timer.
72+
This will cause the client tcp syn to retransmit.
73+
74+
Only has effect when conn_reuse_mode not 0.
75+
5376
conntrack - BOOLEAN
5477
- 0 - disabled (default)
5578
- not 0 - enabled

include/net/ip_vs.h

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -928,6 +928,7 @@ struct netns_ipvs {
928928
int sysctl_pmtu_disc;
929929
int sysctl_backup_only;
930930
int sysctl_conn_reuse_mode;
931+
int sysctl_conn_reuse_old_conntrack;
931932
int sysctl_schedule_icmp;
932933
int sysctl_ignore_tunneled;
933934

@@ -1049,6 +1050,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
10491050
return ipvs->sysctl_conn_reuse_mode;
10501051
}
10511052

1053+
static inline int sysctl_conn_reuse_old_conntrack(struct netns_ipvs *ipvs)
1054+
{
1055+
return ipvs->sysctl_conn_reuse_old_conntrack;
1056+
}
1057+
10521058
static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
10531059
{
10541060
return ipvs->sysctl_schedule_icmp;
@@ -1136,6 +1142,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
11361142
return 1;
11371143
}
11381144

1145+
static inline int sysctl_conn_reuse_old_conntrack(struct netns_ipvs *ipvs)
1146+
{
1147+
return 1;
1148+
}
1149+
11391150
static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
11401151
{
11411152
return 0;

net/netfilter/ipvs/ip_vs_core.c

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2066,7 +2066,7 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
20662066

20672067
conn_reuse_mode = sysctl_conn_reuse_mode(ipvs);
20682068
if (conn_reuse_mode && !iph.fragoffs && is_new_conn(skb, &iph) && cp) {
2069-
bool uses_ct = false, resched = false;
2069+
bool uses_ct = false, resched = false, drop = false;
20702070

20712071
if (unlikely(sysctl_expire_nodest_conn(ipvs)) && cp->dest &&
20722072
unlikely(!atomic_read(&cp->dest->weight))) {
@@ -2086,10 +2086,17 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
20862086
}
20872087

20882088
if (resched) {
2089+
if (uses_ct) {
2090+
if (unlikely(!atomic_read(&cp->n_control) && !cp->control) &&
2091+
likely(sysctl_conn_reuse_old_conntrack(ipvs)))
2092+
cp->flags &= ~IP_VS_CONN_F_NFCT;
2093+
else
2094+
drop = true;
2095+
}
20892096
if (!atomic_read(&cp->n_control))
20902097
ip_vs_conn_expire_now(cp);
20912098
__ip_vs_conn_put(cp);
2092-
if (uses_ct)
2099+
if (drop)
20932100
return NF_DROP;
20942101
cp = NULL;
20952102
}

net/netfilter/ipvs/ip_vs_ctl.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4049,7 +4049,9 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
40494049
tbl[idx++].data = &ipvs->sysctl_pmtu_disc;
40504050
tbl[idx++].data = &ipvs->sysctl_backup_only;
40514051
ipvs->sysctl_conn_reuse_mode = 1;
4052+
ipvs->sysctl_conn_reuse_old_conntrack = 1;
40524053
tbl[idx++].data = &ipvs->sysctl_conn_reuse_mode;
4054+
tbl[idx++].data = &ipvs->sysctl_conn_reuse_old_conntrack;
40534055
tbl[idx++].data = &ipvs->sysctl_schedule_icmp;
40544056
tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;
40554057

0 commit comments

Comments
 (0)