Skip to content

Issue with Tensorflow horizontal_fl Could not start gRPC server #316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lcmfq opened this issue Apr 10, 2025 · 5 comments
Open

Issue with Tensorflow horizontal_fl Could not start gRPC server #316

lcmfq opened this issue Apr 10, 2025 · 5 comments

Comments

@lcmfq
Copy link

lcmfq commented Apr 10, 2025

problem:
E0410 01:06:16.743285006 880726 server_chttp2.cc:49] {"created":"@1744247176.743237155","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":873,"referenced_errors":[{"created":"@1744247176.743233104","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":341,"referenced_errors":[{"created":"@1744247176.743227576","description":"Unable to configure socket","fd":5,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":217,"referenced_errors":[{"created":"@1744247176.743225921","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1744247176.743232784","description":"Unable to configure socket","fd":5,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":217,"referenced_errors":[{"created":"@1744247176.743231782","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
2025-04-10 01:06:16.743320: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:557] Unknown: Could not start gRPC server
Traceback (most recent call last):
File "train.py", line 248, in
tf.app.run(main=main)
File "/usr/local/lib64/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 245, in main
train()
File "train.py", line 171, in train
task_index=FLAGS.task_index)
File "/usr/local/lib64/python3.6/site-packages/tensorflow/python/training/server_lib.py", line 148, in init
self._server = c_api.TF_NewServer(self._server_def.SerializeToString())
tensorflow.python.framework.errors_impl.UnknownError: Could not start gRPC server

@Hsy-Intel
Copy link
Contributor

Seem that the connection port is already in use, you can kill all the previous ps and worker processes and try again.

@lcmfq
Copy link
Author

lcmfq commented Apr 10, 2025

Thank you. I kill all the previous ps and worker processe,and solve this probem. But worker node get this probelm.
E0410 01:36:58.731299495 890200 ssl_transport_security.cc:1468] Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER.

@Hsy-Intel
Copy link
Contributor

Thank you. I kill all the previous ps and worker processe,and solve this probem. But worker node get this probelm. E0410 01:36:58.731299495 890200 ssl_transport_security.cc:1468] Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER.

I recommend you first check if SGX/TDX is enabled correctly and can get quote and pass attestation. Also check if it can run successfully without using RA-TLS attestation strategy.

@lcmfq
Copy link
Author

lcmfq commented Apr 10, 2025

1.check SGX is pass

Image

2.run ra-tls-mbedtls example is pass

gramine-sgx ./server dcap &
RA_TLS_ALLOW_DEBUG_ENCLAVE_INSECURE=1
RA_TLS_ALLOW_OUTDATED_TCB_INSECURE=1
RA_TLS_MRENCLAVE=
RA_TLS_MRSIGNER=
RA_TLS_ISV_PROD_ID=<ISV_PROD_ID of the server enclave>
RA_TLS_ISV_SVN=<ISV_SVN of the server enclave>
./client dcap

Image

Image

3.But run intelcczoo/horizontal_fl:anolis_sgx_latest with command ./test-nosgx.sh worker0 in image_classification get error
That's the previous question

Image

@Hsy-Intel
Copy link
Contributor

I have no idea about this issue... Seems it is not related about SGX configs. May be you can check your env (proxy, etc.). One time consuming method is re-compiling TensorFlow by the Dockerfile and adding some debug messages to see what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants