Issue with Tensorflow horizontal_fl Could not start gRPC server #316

lcmfq · 2025-04-10T01:23:10Z

problem:
E0410 01:06:16.743285006 880726 server_chttp2.cc:49] {"created":"@1744247176.743237155","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":873,"referenced_errors":[{"created":"@1744247176.743233104","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":341,"referenced_errors":[{"created":"@1744247176.743227576","description":"Unable to configure socket","fd":5,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":217,"referenced_errors":[{"created":"@1744247176.743225921","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1744247176.743232784","description":"Unable to configure socket","fd":5,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":217,"referenced_errors":[{"created":"@1744247176.743231782","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
2025-04-10 01:06:16.743320: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:557] Unknown: Could not start gRPC server
Traceback (most recent call last):
File "train.py", line 248, in
tf.app.run(main=main)
File "/usr/local/lib64/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 245, in main
train()
File "train.py", line 171, in train
task_index=FLAGS.task_index)
File "/usr/local/lib64/python3.6/site-packages/tensorflow/python/training/server_lib.py", line 148, in init
self._server = c_api.TF_NewServer(self._server_def.SerializeToString())
tensorflow.python.framework.errors_impl.UnknownError: Could not start gRPC server

Hsy-Intel · 2025-04-10T01:27:18Z

Seem that the connection port is already in use, you can kill all the previous ps and worker processes and try again.

lcmfq · 2025-04-10T01:39:11Z

Thank you. I kill all the previous ps and worker processe,and solve this probem. But worker node get this probelm.
E0410 01:36:58.731299495 890200 ssl_transport_security.cc:1468] Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER.

Hsy-Intel · 2025-04-10T01:50:27Z

Thank you. I kill all the previous ps and worker processe,and solve this probem. But worker node get this probelm. E0410 01:36:58.731299495 890200 ssl_transport_security.cc:1468] Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER.

I recommend you first check if SGX/TDX is enabled correctly and can get quote and pass attestation. Also check if it can run successfully without using RA-TLS attestation strategy.

lcmfq · 2025-04-10T02:13:02Z

1.check SGX is pass

2.run ra-tls-mbedtls example is pass

gramine-sgx ./server dcap &
RA_TLS_ALLOW_DEBUG_ENCLAVE_INSECURE=1
RA_TLS_ALLOW_OUTDATED_TCB_INSECURE=1
RA_TLS_MRENCLAVE=
RA_TLS_MRSIGNER=
RA_TLS_ISV_PROD_ID=<ISV_PROD_ID of the server enclave>
RA_TLS_ISV_SVN=<ISV_SVN of the server enclave>
./client dcap

3.But run intelcczoo/horizontal_fl:anolis_sgx_latest with command ./test-nosgx.sh worker0 in image_classification get error
That's the previous question

Hsy-Intel · 2025-04-10T03:02:54Z

I have no idea about this issue... Seems it is not related about SGX configs. May be you can check your env (proxy, etc.). One time consuming method is re-compiling TensorFlow by the Dockerfile and adding some debug messages to see what happens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Tensorflow horizontal_fl Could not start gRPC server #316

Issue with Tensorflow horizontal_fl Could not start gRPC server #316

lcmfq commented Apr 10, 2025

Hsy-Intel commented Apr 10, 2025

lcmfq commented Apr 10, 2025 •

edited

Loading

Hsy-Intel commented Apr 10, 2025

lcmfq commented Apr 10, 2025

Hsy-Intel commented Apr 10, 2025

Issue with Tensorflow horizontal_fl Could not start gRPC server #316

Issue with Tensorflow horizontal_fl Could not start gRPC server #316

Comments

lcmfq commented Apr 10, 2025

Hsy-Intel commented Apr 10, 2025

lcmfq commented Apr 10, 2025 • edited Loading

Hsy-Intel commented Apr 10, 2025

lcmfq commented Apr 10, 2025

Hsy-Intel commented Apr 10, 2025

lcmfq commented Apr 10, 2025 •

edited

Loading