Description
Environment Details
- XGBoost Version: tried 2.1.0, 2.1.1, 2.1.3, 2.1.4
- Spark: 3.5.0 (Cluster Mode: YARN)
- Scala: 2.12.18
- Java: OpenJDK 8
- Cluster: YARN/Hadoop 3.2.2
Background
Our pipeline ran successfully with Spark 3.1.1 + XGBoost 1.1.1 in production. After upgrading to Spark 3.5.0, we tested multiple XGBoost versions (2.1.0-2.1.4) and consistently encountered the same Rabit tracker connection error during distributed training.
Error Description
Failure occurs when initializing distributed training:
ERROR XGBoostSpark: the job was aborted due to
ml.dmlc.xgboost4j.java.XGBoostError: [12:58:58] /workspace/src/collective/result.cc:78:
- [tracker.cc:286|12:58:58]: Failed to accept connection.
- [socket.h:89|12:58:58]: Invalid polling request.
Full stack trace shows the error originates from RabitTracker.stop() after connection rejection.
Reproduction Steps
- Code:
val assembler = new VectorAssembler()
.setInputCols(Array("f1", "f2", "f3", "f4", "f5", "f6", "f7", "f8", "f9", "f10", "f11", "f12", "f13", "f14", "f15", "f16", "f17", "f18", "f19", "f20", "f21", "f22", "f23"))
.setOutputCol("features")
val labelIndexer = new StringIndexer()
.setInputCol("y")
.setOutputCol("indexedLabel")
.setHandleInvalid("skip")
.fit(training)
val booster = new XGBoostClassifier(
Map(
"eta" -> 0.1f,
"max_depth" -> 5,
"objective" -> "multi:softprob",
"num_class" -> 2,
"device" -> "cpu"
)
).setNumRound(10).setNumWorkers(2)
booster.setFeaturesCol("features")
booster.setLabelCol("indexedLabel")
val converter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("convetedPrediction")
.setLabels(labelIndexer.labelsArray(0))
val pipeline = new Pipeline()
.setStages(Array(assembler, labelIndexer, booster, converter))
println("ready to train...")
val model: PipelineModel = pipeline.fit(training) // stopped here
- Submit Command:
spark-submit --master yarn --deploy-mode cluster ...
Attempted Fixes
✅ Verified compatibility between Spark 3.5.0 and XGBoost 2.1.x
✅ Tested all minor versions of XGBoost 2.1.x series
❌ Adjusting tracker ports (tracker_conf) had no effect
❌ Increasing timeout (timeout parameter) failed
Key Questions
- Is this a known issue with Spark 3.5.0’s network layer and XGBoost 2.1.x?
- Are there specific configurations required for XGBoost 2.1.x + Spark 3.5.0?
- Should we downgrade to Spark 3.1.x or wait for a XGBoost patch?
This template focuses on critical version conflicts and provides actionable context for maintainers.
attaching log:
25/03/31 12:58:58 ERROR RabitTracker: ml.dmlc.xgboost4j.java.XGBoostError: [12:58:58] /workspace/src/collective/result.cc:78:
- [tracker.cc:286|12:58:58]: Failed to accept connection.
- [socket.h:89|12:58:58]: Invalid polling request.
Stack trace:
[bt] (0) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6e) [0x7f9111d241ee]
[bt] (1) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(xgboost::collective::SafeColl(xgboost::collective::Result const&)+0x7e) [0x7f9111db9f7e]
[bt] (2) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(+0x2b435c) [0x7f9111d5235c]
[bt] (3) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(XGTrackerWaitFor+0x1ba) [0x7f9111d5384a]
[bt] (4) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_TrackerWaitFor+0x196) [0x7f911244e856]
[bt] (5) [0x7f91450186c7]25/03/31 12:58:58 ERROR XGBoostSpark: the job was aborted due to
ml.dmlc.xgboost4j.java.XGBoostError: [12:58:58] /workspace/src/collective/result.cc:78:
[tracker.cc:286|12:58:58]: Failed to accept connection.
[socket.h:89|12:58:58]: Invalid polling request.
Stack trace:
[bt] (0) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6e) [0x7f9111d241ee]
[bt] (1) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(xgboost::collective::SafeColl(xgboost::collective::Result const&)+0x7e) [0x7f9111db9f7e]
[bt] (2) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(+0x2b435c) [0x7f9111d5235c]
[bt] (3) /data1/yarn/local/usercache/longyuan_antispam/appcache/application_1710209318993_110720296/container_e47_1710209318993_110720296_01_000002/tmp/libxgboost4j2624238525807620499.so(XGTrackerFree+0x15d) [0x7f9111d529bd]
[bt] (4) [0x7f91450186c7]at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
at ml.dmlc.xgboost4j.java.RabitTracker.stop(RabitTracker.java:84)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.withTracker(XGBoost.scala:467)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:501)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:210)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:34)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:78)
at org.apache.spark.ml.Pipeline.$anonfun$fit$5(Pipeline.scala:151)
at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)
at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)
at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)
at org.apache.spark.ml.Pipeline.$anonfun$fit$4(Pipeline.scala:151)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at org.apache.spark.ml.Pipeline.$anonfun$fit$2(Pipeline.scala:147)
at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)
at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)
at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)
at org.apache.spark.ml.Pipeline.$anonfun$fit$1(Pipeline.scala:133)
at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:133)
at Test$.main(Test.scala:59)
at Test.main(Test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:738)