Closed
Description
I've run into an issue when using remotecall to call a function that in turn remotecalls another function on a different worker. The problem only occurs when using workers started with a SSHManager with tunnel=true.
Here is an mwe (however you do somewhere to launch the remote workers on):
using Distributed
manager = Distributed.SSHManager([("myuser@my_ip_address", 2)])
addprocs_opts = Dict(:exename=>"/path/to/julia",
:sshflags=>`-i /path/to/my/key.pem`,
:dir=>"/home/myuser",
:tunnel=>true)
p = addprocs(manager; addprocs_opts...)
@everywhere begin
function get_remote_ids(w2::Int)
caller_id = myid()
called_id = remotecall_fetch(myid, w2)
return [caller_id, called_id]
end
end
remotecall_fetch(get_remote_ids, p[1], p[2])
This results in
From worker 6: ┌ Error: Error on 7 while connecting to peer 6, exiting
From worker 6: │ exception =
From worker 6: │ TypeError: in typeassert, expected Tuple{String, Int64}, got a value of type Tuple{SubString{String}, Int64}
From worker 6: │ Stacktrace:
From worker 6: │ [1] connect_w2w(pid::Int64, config::WorkerConfig)
From worker 6: │ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:584
From worker 6: │ [2] connect(manager::Distributed.DefaultClusterManager, pid::Int64, config::WorkerConfig)
From worker 6: │ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:519
From worker 6: │ [3] connect_to_peer(manager::Distributed.DefaultClusterManager, rpid::Int64, wconfig::WorkerConfig)
From worker 6: │ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:357
From worker 6: │ [4] (::Distributed.var"#117#119"{Int64, WorkerConfig})()
From worker 6: │ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:341
From worker 6: │ [5] exec_conn_func(w::Distributed.Worker)
From worker 6: │ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:179
From worker 6: │ [6] exec_conn_func(id::Int64)
From worker 6: │ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:173
From worker 6: │ [7] (::Distributed.var"#106#108"{Distributed.CallMsg{:call_fetch}})()
From worker 6: │ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:278
From worker 6: │ [8] run_work_thunk(thunk::Distributed.var"#106#108"{Distributed.CallMsg{:call_fetch}}, print_error::Bool)
From worker 6: │ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:63
From worker 6: │ [9] macro expansion
From worker 6: │ @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:278 [inlined]
From worker 6: │ [10] (::Distributed.var"#105#107"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})()
From worker 6: │ @ Distributed ./task.jl:411
From worker 6: └ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:364
Worker 7 terminated.
I've been able to work around it for now by relaxing the type assertion on line 577 for managers.jl
julia/stdlib/Distributed/src/managers.jl
Lines 576 to 582 in 47f9139
to
(rhost, rport) = notnothing(config.connect_at)::Tuple{AbstractString, Int}
Metadata
Metadata
Assignees
Labels
No labels