-
Notifications
You must be signed in to change notification settings - Fork 62
Issue with Tensorflow horizontal_fl Could not start gRPC server #316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Seem that the connection port is already in use, you can kill all the previous ps and worker processes and try again. |
Thank you. I kill all the previous ps and worker processe,and solve this probem. But worker node get this probelm. |
I recommend you first check if SGX/TDX is enabled correctly and can get quote and pass attestation. Also check if it can run successfully without using RA-TLS attestation strategy. |
1.check SGX is pass 2.run ra-tls-mbedtls example is pass gramine-sgx ./server dcap & 3.But run intelcczoo/horizontal_fl:anolis_sgx_latest with command ./test-nosgx.sh worker0 in image_classification get error |
I have no idea about this issue... Seems it is not related about SGX configs. May be you can check your env (proxy, etc.). One time consuming method is re-compiling TensorFlow by the Dockerfile and adding some debug messages to see what happens. |
problem:
E0410 01:06:16.743285006 880726 server_chttp2.cc:49] {"created":"@1744247176.743237155","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":873,"referenced_errors":[{"created":"@1744247176.743233104","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":341,"referenced_errors":[{"created":"@1744247176.743227576","description":"Unable to configure socket","fd":5,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":217,"referenced_errors":[{"created":"@1744247176.743225921","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1744247176.743232784","description":"Unable to configure socket","fd":5,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":217,"referenced_errors":[{"created":"@1744247176.743231782","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
2025-04-10 01:06:16.743320: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:557] Unknown: Could not start gRPC server
Traceback (most recent call last):
File "train.py", line 248, in
tf.app.run(main=main)
File "/usr/local/lib64/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 245, in main
train()
File "train.py", line 171, in train
task_index=FLAGS.task_index)
File "/usr/local/lib64/python3.6/site-packages/tensorflow/python/training/server_lib.py", line 148, in init
self._server = c_api.TF_NewServer(self._server_def.SerializeToString())
tensorflow.python.framework.errors_impl.UnknownError: Could not start gRPC server
The text was updated successfully, but these errors were encountered: