Skip to content

ngx.location.capture 与 redis.set_keepalive合用导致worker进程退出 #108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xukaifu opened this issue Apr 27, 2012 · 6 comments
Closed

Comments

@xukaifu
Copy link

xukaifu commented Apr 27, 2012

r1.lua 如下:

---- r1.lua
local redis = require "resty.redis"
local red = redis:new()
local ok, err = red:connect("192.168.191.69", 6379)
local ok, err = red:brpop("xxx", 10)
red:set_keepalive()
local http_ress = ngx.location.capture("/r2") -- 1
ngx.say("ok")

r2.lua 配置如下

---- r2.lua
local redis = require "resty.redis"
local red = redis:new()
local ok, err = red:connect("192.168.191.69", 6379) --2
local res = ngx.location.capture("/anyurl") --3

nginx.conf 配置如下

worker_processes  1;
error_log  logs/error.log  info;
pid        logs/nginx.pid;
events {
    worker_connections  1024;
}
http {

    lua_package_path '/opt/openresty/nginx/lua2/include/?.lua;;';

    server {
        listen       8888;
        server_name  localhost;

        location /r1{
            default_type text/html;
            lua_code_cache off;
            lua_need_request_body on;
            content_by_lua_file lua2/r1.lua;
        }

        location /r2{
            default_type text/html;
            lua_code_cache off;
            lua_need_request_body on;
            content_by_lua_file lua2/r2.lua;
        }

        location /anyurl{
            internal;
            proxy_pass http://192.168.194.98/404.php;
        }

    }
}

先往redis里面弄点数据:

lpush xxx v1 
lpush xxx v1 
lpush xxx v1 
...

然后访问r1

curl localhost/r1

此时worker进程退出, debug日志中看到如下信息:
2012/04/27 18:25:24 [notice] 26423#0: signal 17 (SIGCHLD) received
2012/04/27 18:25:24 [alert] 26423#0: worker process 12262 exited on signal 11

redis 2.2.11 , 2.4.1 下均100%重现

只要注释掉代码1, 代码2, 代码3 中的任何一行, 都不会重现。。..

@xukaifu
Copy link
Author

xukaifu commented Apr 27, 2012

补充一下, 如果r1.lua 和 r2.lua 连不同的redis也没问题。

编译选项:
./configure --with-luajit --with-debug --prefix=/opt/openresty

操作系统: SUSE Linux Enterprise Server 10.3 (x86_64) / Ubuntu 11.10
redis版本:2.4.1 / 2.2.11

@bigplum
Copy link

bigplum commented Apr 28, 2012

r2.lua 中 red:connect使用了一个keepalive_peer,但未进行操作,所以 ngx_http_lua_socket_tcp_handler->write_event_handler 未初始化;

在ngx.location.capture结束之后进入epoll write event,调用到u->write_event_handler,就core掉了。 red:connect之后加一个red:ping()可以简单规避这个问题。

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00000000004e57e7 in ngx_http_lua_socket_tcp_handler (ev=0x6afa68) at ../ngx_lua-0.5.0rc21/src/ngx_http_lua_socket.c:1700
#2  0x0000000000447a0b in ngx_epoll_process_events (cycle=0x677ca0, timer=60000, flags=1) at src/event/modules/ngx_epoll_module.c:708
#3  0x00000000004373cc in ngx_process_events_and_timers (cycle=0x677ca0) at src/event/ngx_event.c:245
#4  0x0000000000443c68 in ngx_worker_process_cycle (cycle=0x677ca0, data=0x0) at src/os/unix/ngx_process_cycle.c:801
#5  0x0000000000440a09 in ngx_spawn_process (cycle=0x677ca0, proc=0x443abc <ngx_worker_process_cycle>, data=0x0, name=0x511d7e "worker process", respawn=0)
    at src/os/unix/ngx_process.c:196
#6  0x0000000000443713 in ngx_reap_children (cycle=0x677ca0) at src/os/unix/ngx_process_cycle.c:617
#7  0x00000000004424ec in ngx_master_process_cycle (cycle=0x677ca0) at src/os/unix/ngx_process_cycle.c:180
#8  0x0000000000414b39 in main (argc=1, argv=0x7fff163ad6a8) at src/core/nginx.c:411
2012/04/28 10:20:02 [debug] 26739#0: *19 http subrequest "/anyurl?"
2012/04/28 10:20:02 [debug] 26739#0: *19 lua resume returned 1
2012/04/28 10:20:02 [debug] 26739#0: *19 lua thread yielded
2012/04/28 10:20:02 [debug] 26739#0: *19 http finalize request: -4, "/r2?" a:0, c:4
2012/04/28 10:20:02 [debug] 26739#0: *19 http request count:4 blk:0 
2012/04/28 10:20:02 [debug] 26739#0: *19 http posted request: "/anyurl?"
2012/04/28 10:20:02 [debug] 26739#0: *19 rewrite phase: 0
2012/04/28 10:20:02 [debug] 26739#0: *19 rewrite phase: 1
2012/04/28 10:20:02 [debug] 26739#0: *19 test location: "/r1"
2012/04/28 10:20:02 [debug] 26739#0: *19 test location: "/anyurl"
2012/04/28 10:20:02 [debug] 26739#0: *19 using configuration "/anyurl"
2012/04/28 10:20:02 [debug] 26739#0: *19 http cl:-1 max:1048576
2012/04/28 10:20:02 [debug] 26739#0: *19 rewrite phase: 3
2012/04/28 10:20:02 [debug] 26739#0: *19 rewrite phase: 4
2012/04/28 10:20:02 [debug] 26739#0: *19 post rewrite phase: 5
2012/04/28 10:20:02 [debug] 26739#0: *19 generic phase: 6
2012/04/28 10:20:02 [debug] 26739#0: *19 generic phase: 7
2012/04/28 10:20:02 [debug] 26739#0: *19 http init upstream, client timer: 0
2012/04/28 10:20:02 [debug] 26739#0: *19 epoll add event: fd:3 op:3 ev:80000005
2012/04/28 10:20:02 [debug] 26739#0: *19 http script copy: "Host: " 
2012/04/28 10:20:02 [debug] 26739#0: *19 http script var: "192.168.194.98"
2012/04/28 10:20:02 [debug] 26739#0: *19 http script copy: "^M
"
2012/04/28 10:20:02 [debug] 26739#0: *19 http script copy: "Connection: close^M
"
2012/04/28 10:20:02 [debug] 26739#0: *19 http proxy header: "User-Agent: curl/7.15.1 (x86_64-suse-linux) libcurl/7.15.1 OpenSSL/0.9.8a zlib/1.2.3 libidn/0.6.0"
2012/04/28 10:20:02 [debug] 26739#0: *19 http proxy header: "Accept: */*"
2012/04/28 10:20:02 [debug] 26739#0: *19 http proxy header: 
"GET /sleep.php HTTP/1.0^M
Host: 192.168.194.98^M
Connection: close^M
User-Agent: curl/7.15.1 (x86_64-suse-linux) libcurl/7.15.1 OpenSSL/0.9.8a zlib/1.2.3 libidn/0.6.0^M
Accept: */*^M
^M
"
2012/04/28 10:20:02 [debug] 26739#0: *19 http cleanup add: 000000000067E808
2012/04/28 10:20:02 [debug] 26739#0: *19 get rr peer, try: 1
2012/04/28 10:20:02 [debug] 26739#0: *19 socket 10
2012/04/28 10:20:02 [debug] 26739#0: *19 epoll add connection: fd:10 ev:80000005
2012/04/28 10:20:02 [debug] 26739#0: *19 connect to 192.168.194.98:80, fd:10 #21
2012/04/28 10:20:02 [debug] 26739#0: *19 http upstream connect: -2
2012/04/28 10:20:02 [debug] 26739#0: *19 event timer add: 10: 60000:1335579662742
2012/04/28 10:20:02 [debug] 26739#0: *19 http finalize request: -4, "/anyurl?" a:1, c:4
2012/04/28 10:20:02 [debug] 26739#0: *19 http request count:4 blk:0 
2012/04/28 10:20:02 [debug] 26739#0: *19 lua socket handler for "/r2?", wev 1

@agentzh
Copy link
Member

agentzh commented Apr 28, 2012

@xukaifu 我已经在我本地复现了此问题,这确实是 ngx_lua cosocket 中一个 bug,我现在就着手修复工作 :)

@bigplum 多谢你对此 bug 的细致分析 :)

agentzh added a commit that referenced this issue Apr 28, 2012
…egfaults if it is not used immediately. thanks xukaifu for reporting this as github issue #108.
@agentzh
Copy link
Member

agentzh commented Apr 28, 2012

我已经提交了补丁到 git master HEAD. 请尝试一下?多谢!

@agentzh
Copy link
Member

agentzh commented May 4, 2012

Consider it resolved.

@agentzh agentzh closed this as completed May 4, 2012
@xukaifu
Copy link
Author

xukaifu commented May 4, 2012

刚刚测试了一下,可以啦。非常感谢。

bakins pushed a commit to bakins/lua-nginx-module that referenced this issue Jun 30, 2012
…egfaults if it is not used immediately. thanks xukaifu for reporting this as github issue openresty#108.
thibaultcha added a commit to thibaultcha/lua-nginx-module that referenced this issue Feb 15, 2019
This issue appeared in our EC2 test cluster, which compiles Nginx with
`-DNGX_LUA_USE_ASSERT` and `-DNGX_LUA_ABORT_AT_PANIC`.

The lua-resty-redis test:

    === TEST 1: github issue openresty#108: ngx.locaiton.capture + redis.set_keepalive

in t/bugs.t [1] would produce core dumps in the check leak testing mode.

The backtrace for these core dumps was:

    #0  0x00007fd417bee277 in raise () from /lib64/libc.so.6
    openresty#1  0x00007fd417bef968 in abort () from /lib64/libc.so.6
    openresty#2  0x00007fd417be7096 in __assert_fail_base () from /lib64/libc.so.6
    openresty#3  0x00007fd417be7142 in __assert_fail () from /lib64/libc.so.6
    openresty#4  0x000000000050d227 in ngx_http_lua_socket_tcp_resume_conn_op (spool=c/ngx_http_lua_socket_tcp.c:3963
    openresty#5  0x000000000050e51a in ngx_http_lua_socket_tcp_finalize (r=r@entry=0x5628) at ../../src/ngx_http_lua_socket_tcp.c:4195
    openresty#6  0x000000000050e570 in ngx_http_lua_socket_tcp_cleanup (data=0x7fd419p_lua_socket_tcp.c:3755
    openresty#7  0x0000000000463aa5 in ngx_http_free_request (r=r@entry=0xbfaec0, rc=http_request.c:3508
    ...

Which was caused by the following assertion in ngx_http_lua_socket_tcp.c
with `NGX_DEBUG`:

    #if (NGX_DEBUG)
        ngx_http_lua_assert(spool->connections >= 0);

    #else

Thanks to Mozilla's rr, a recorded session showed that
`spool->connections` was `-1`.

Unfortunately, reproducing this case does not seem possible, since the
failure is due to the request cleanup (`ngx_http_free_request`). Here is
an explanation:

    -- thread 1
    local sock = ngx.socket.tcp()
    sock:connect()
    sock:setkeepalive() -- pool created, connections: 1

        -- thread 2
        local sock = ngx.socket.tcp()
        sock:connect() -- from pool, connections: 1

    -- thread 1
    -- sock from thread 1 idle timeout, closes, and calls
    -- ngx_http_lua_socket_tcp_finalize, connections: 0

        -- thread 2
        sock:setkeepalive() -- connections: -1
        -- ngx_http_lua_socket_tcp_resume_conn_op gets called, assertion fails

In order to avoid this race condition, we must determine whether the
socket pool exists or not, not from the
`ngx_http_lua_socket_tcp_upstream` struct, but from the Lua Registry.
This way, when thread 2's socket enters the keepalive state, it will
respect the previous call to `ngx_http_lua_socket_free_pool` (which
unset the pool from the registry).

[1]: https://github.com/openresty/lua-resty-redis/blob/master/t/bugs.t
thibaultcha added a commit to thibaultcha/lua-nginx-module that referenced this issue Feb 16, 2019
This issue appeared in our EC2 test cluster, which compiles Nginx with
`-DNGX_LUA_USE_ASSERT` and `-DNGX_LUA_ABORT_AT_PANIC`.

The lua-resty-redis test:

    === TEST 1: github issue openresty#108: ngx.locaiton.capture + redis.set_keepalive

in t/bugs.t [1] would produce core dumps in the check leak testing mode.

The backtrace for these core dumps was:

    #0  0x00007fd417bee277 in raise () from /lib64/libc.so.6
    openresty#1  0x00007fd417bef968 in abort () from /lib64/libc.so.6
    openresty#2  0x00007fd417be7096 in __assert_fail_base () from /lib64/libc.so.6
    openresty#3  0x00007fd417be7142 in __assert_fail () from /lib64/libc.so.6
    openresty#4  0x000000000050d227 in ngx_http_lua_socket_tcp_resume_conn_op (spool=c/ngx_http_lua_socket_tcp.c:3963
    openresty#5  0x000000000050e51a in ngx_http_lua_socket_tcp_finalize (r=r@entry=0x5628) at ../../src/ngx_http_lua_socket_tcp.c:4195
    openresty#6  0x000000000050e570 in ngx_http_lua_socket_tcp_cleanup (data=0x7fd419p_lua_socket_tcp.c:3755
    openresty#7  0x0000000000463aa5 in ngx_http_free_request (r=r@entry=0xbfaec0, rc=http_request.c:3508
    ...

Which was caused by the following assertion in ngx_http_lua_socket_tcp.c
with `NGX_DEBUG`:

    #if (NGX_DEBUG)
        ngx_http_lua_assert(spool->connections >= 0);

    #else

Thanks to Mozilla's rr, a recorded session showed that
`spool->connections` was `-1`.

Here is a reproducible test case:

    local sock1 = ngx.socket.tcp()
    local sock2 = ngx.socket.tcp()

    sock1:connect()
    sock2:connect()

    sock1:setkeepalive() -- pool created, connections: 1
    sock2:setkeepalive() -- connections: 1

    sock1:connect() -- connections: 1
    sock2:connect() -- connections: 1

    sock1:close() -- connections: 0
    sock2:close() -- connections: -1
    -- ngx_http_lua_socket_tcp_resume_conn_op gets called, assertion fails

In order to avoid this race condition, we must determine whether the
socket pool exists or not, not from the
`ngx_http_lua_socket_tcp_upstream` struct, but from the Lua Registry.
This way, when thread 2's socket enters the keepalive state, it will
respect the previous call to `ngx_http_lua_socket_free_pool` (which
unset the pool from the registry).

[1]: https://github.com/openresty/lua-resty-redis/blob/master/t/bugs.t
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants