[SPARK-3162] [MLlib] Add local tree training for decision tree regressors #19433

smurching · 2017-10-04T22:30:57Z

What changes were proposed in this pull request?

Overview

This PR adds local tree training for decision tree regressors as a first step for addressing SPARK-3162: train decision trees locally when possible.

See this design doc (in particular the local tree training section) for detailed discussion of the proposed changes.

Distributed training logic has been refactored but only minimally modified; the local tree training implementation leverages existing distributed training logic for computing impurities and splits. This shared logic has been refactored into ...Utils objects (e.g. SplitUtils.scala, ImpurityUtils.scala).

How to Review

Each commit in this PR adds non-overlapping functionality, so the PR can be reviewed commit-by-commit.

Changes introduced by each commit:

Adds new data structures for local tree training (FeatureVector, TrainingInfo)
Adds shared utility methods for computing splits/impurities (SplitUtils, ImpurityUtils, AggUpdateUtils), largely copied from existing distributed training code in RandomForest.scala.
Unit tests for split/impurity utility methods (TreeSplitUtilsSuite)
Updates distributed training code in RandomForest.scala to depend on the utility methods introduced in 2.
Adds local tree training logic (LocalDecisionTree)
Local tree unit/integration tests (LocalTreeUnitSuite, LocalTreeIntegrationSuite)

How was this patch tested?

No existing tests were modified. The following new tests were added (also described above):

Unit tests for new data structures specific to local tree training (LocalTreeDataSuite, LocalTreeUtilsSuite)
Unit tests for impurity/split utility methods (TreeSplitUtilsSuite)
Unit tests for local tree training logic (LocalTreeUnitSuite)
Integration tests verifying that local & distributed tree training produce the same trees (LocalTreeIntegrationSuite)

…calTreeDataSuite): * TrainingInfo: primary local tree training data structure, contains all information required to describe state of algorithm at any point during learning * FeatureVector: Stores data for an individual feature as an Array[Int]

…oth local & distributed training: * AggUpdateUtils: Helper methods for updating sufficient stats for a given node * ImpurityUtils: Helper methods for impurity-related calcluations during node split decisions * SplitUtils: Helper methods for choosing splits given sufficient stats NOTE: Both ImpurityUtils and SplitUtils primarily contain code taken from RandomForest.scala, with slight modifications. Tests for SplitUtils are contained in the next commit.

* TreeSplitUtilsSuite: Test suite for SplitUtils * TreeTests: Add utility method (getMetadata) for TreeSplitUtilsSuite Also add methods used by these tests in LocalDecisionTree.scala, RandomForest.scala

…lit calculations

smurching · 2017-10-04T23:35:30Z

@WeichenXu123 would you be able to take an initial look at this?

smurching · 2017-10-09T07:40:13Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTreeUtils.scala

+    val numFeatures = rowStore(0).length
+    require(numFeatures > 0, "Local decision tree training requires numFeatures > 0.")
+    // Return the transpose of the rowStore matrix
+    0.until(numFeatures).map { colIdx =>


TODO: replace this with an in-place matrix transpose for memory efficiency.

WeichenXu123 · 2017-10-09T09:44:27Z

@smurching Does it still WIP ? If done remove "[WIP]", I will begin review, thanks!

smurching · 2017-10-09T17:00:30Z

Thanks! I'll remove the WIP. To clear things up for the future, I'd thought [WIP] was the appropriate tag for a PR that's ready for review but not ready to be merged (based on https://spark.apache.org/contributing.html) -- have we stopped using the WIP tag?

jkbradley · 2017-10-09T17:55:47Z

add to whitelist

SparkQA · 2017-10-09T18:31:59Z

Test build #82557 has finished for PR 19433 at commit 9a7174e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class LocalTreeIntegrationSuite extends SparkFunSuite with MLlibTestSparkContext
class LocalTreeUtilsSuite extends SparkFunSuite

smurching · 2017-10-09T20:23:17Z

The failing tests (in DecisionTreeSuite) fail because we've historically handled

a) splits that have 0 gain

differently from

b) splits that fail to achieve user-specified minimum gain (metadata.minInfoGain) or don't meet minimum instance-counts per node (metadata.minInstancesPerNode).

Previously we'd create a leaf node with valid impurity stats in case a) and invalid impurity stats in case b). This PR creates a leaf node with invalid impurity stats in both cases.

As a fix I'd suggest creating a LeafNode with correct impurity stats in case a), but with the stats.valid member set to false to indicate that the node should not be split.

This will keep the process of determining split validity simple (just check stats.valid) and avoid changes to existing distributed tree-training logic.

…ranspose in LocalDecisionTreeUtils. Changes made to fix tests: * Return correct impurity stats for splits that achieved a gain of 0 but didn't violate user-specified constraints on min info gain or min instances per node * Previously, ImpurityStats.impurity was set incorrectly in ImpurityStats.getInvalidImpurityStats(), requiring a correction in LearningNode.toNode. This commit fixes the issue by directly setting impurity = -1 in getInvalidSplits()

SparkQA · 2017-10-09T23:23:05Z

Test build #82570 has finished for PR 19433 at commit abc86b2.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

smurching · 2017-10-12T01:05:35Z

The failing SparkR test (which compares RandomForest predictions to hardcoded values) fails not due to a correctness issue but (AFAICT) because of an implementation change in best-split selection.

In this PR we recompute parent node impurity stats when considering each split for a feature, instead of computing parent impurity stats once per feature (see this by comparing RandomForest.calculateImpurityStats in Spark master and ImpurityUtils.calculateImpurityStats in this PR).

The process of repeatedly computing parent impurity stats results in slightly different impurity values at each iteration due to Double precision limitations. This in turn can cause different splits to be selected (e.g. if two splits have mathematically equal gains, Double precision limitations can cause one split to have a larger/smaller gain than the other, influencing tiebreaking).

…ats during best split selection

SparkQA · 2017-10-12T02:36:29Z

Test build #82652 has finished for PR 19433 at commit 5c29d3d.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-10-12T15:30:36Z

I made a rough pass. I have only a few issues for now, I haven't go into code details:

The colStoreInit currently ignore the subsampleWeights, it should be used, isn't it ? I read your doc, in the higher level, the local training will be used to train sub-trees as parts of the global distributed training, subsampleWeights should be important info. and here it will train only single tree so subsampleWeights only contains one element, does we still need use BaggedPoint structure ?
The logic of training for regression and for classification will be the same I think, only impurity difference but do not influence the code logic.
The key idea is to use the columnar storage format for features, is the purpose to improve memory cost & cache locality when finding best splits ? I see the code will do some reordering operation on feature values and use indices, but I haven't go into details. It's a complex part I need more time to review.
Maybe we can support multithreads in local training, what do you think about it ?

smurching · 2017-10-13T01:20:32Z

Thanks for the comments!

Yep, feature subsampling is necessary for using local tree training in distributed training. I was thinking of adding subsampling in a follow-up PR. You're right that we don't need to pass an array of BaggedPoints to local tree training; we should just pass an array of subsampleWeights (weights for the current tree) and an array of TreePoints. I'll push an update for this.
Agreed that the logic for classification will be the same but with a different impurity metric. I can add support for classification & associated tests in a follow-up PR.
IMO the primary advantage of the columnar storage format is that it'll eventually enable improvements to best split calculations; specifically, for continuous features we could sort the unbinned feature values and consider every possible threshold. There are also the locality & memory advantages described in the design doc. In brief, DTStatsAggregator stores a flat array partitioned by (feature x bin). If we can iterate through all values for a single feature at once, most updates to DTStatsAggregatorwill occur within the same subarray.
Multithreading could be a nice way to increase parallelism since we don't use Spark during local tree training. I think we could add it in a follow-up PR.

smurching · 2017-10-13T03:20:10Z

Sorry, realized I conflated feature subsampling and subsampleWeights (instance weights for training examples). IMO feature subsampling can be added in a follow-up PR, but subsampleWeights should go in this PR.

…methods

SparkQA · 2017-10-13T05:11:02Z

Test build #82717 has finished for PR 19433 at commit c9a8e01.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-13T06:38:35Z

Test build #82721 has finished for PR 19433 at commit 93e17fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123

I made a deeper pass review, Later I will put more thoughts on the columnar feature storage design. Thanks!

WeichenXu123 · 2017-10-16T08:45:25Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/FeatureVector.scala

+    // gives us the split bit value for each instance based on the instance's index.
+    // We copy our feature values into @tempVals and @tempIndices either:
+    // 1) in the [from, numLeftRows) range if the bit is false, or
+    // 2) in the [numBitsNotSet, to) range if the bit is true.


Although numLeftRows == numBitsNotSet, it is better to keep them the same in doc.

Will change this, thanks for the catch!

WeichenXu123 · 2017-10-16T08:50:52Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala

+        // Filter out leaf nodes from the previous iteration
+        val activeNonLeafs = activeNodes.zipWithIndex.filterNot(_._1.isLeaf)
+        // Iterate over the active nodes in the current level.
+        activeNonLeafs.flatMap { case (node: LearningNode, nodeIndex: Int) =>


The var name activeNodes, activeNonLeafs are not accurate I think.
Here the activeNodes are actually "next level nodes", including "probably splittable nodes(active nodes)" and "leaf nodes".

WeichenXu123 · 2017-10-16T08:54:42Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala

+      val activeNodes: Array[LearningNode] =
+        computeBestSplits(trainingInfo, labels, metadata, splits)
+      // Filter active node periphery by impurity.
+      val estimatedRemainingActive = activeNodes.count(_.stats.impurity > 0.0)


Use activeNodes.count(_.isLeaf) instead. Make code simpler.
And as mentioned above, the activeNodes is better to be renamed to nextLevelNodes.

Agreed on using isLeaf instead of checking for positive impurity, thanks for the suggestion.

AFAICT at this point in the code activeNodes actually does refer to the nodes in the current level; the children of nodes in activeNodes are the nodes in the next level, and are returned by computeBestSplits. I forgot to include the return type of computeBestSplit in its method signature, which probably made this more confusing - my mistake.

Yes. Sorry for confusing you. The change what I said was changing to:

val nextLevelNodes: Array[LearningNode] = computeBestSplits(trainingInfo, labels, metadata, splits)

Does it look more reasonable ?
And change the member name in trainingInfo:
TrainingInfo.activeNodes ==> TrainingInfo.currentLevelNodes

Gotcha agreed on the naming change, how about currentLevelActiveNodes? Since only the non-leaf nodes from the current level are included.

Wait... I check the code here: trainingInfo = trainingInfo.update(splits, activeNodes) So it seems you do not filter out the leaf node from the "activeNodes"(which is actually the nextLevelNode I mentioned above).
So I think TrainingInfo.activeNodes is still possible to contains leaf node.

Oh true -- I'll reword the doc for currentLevelActiveNodes to say:

* @param currentLevelActiveNodes Nodes which are active (could still be split). * Inactive nodes are known to be leaves in the final tree.

WeichenXu123 · 2017-10-16T08:58:07Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/TrainingInfo.scala

+ */
+private[impl] case class TrainingInfo(
+    columns: Array[FeatureVector],
+    instanceWeights: Array[Double],


The instanceWeights will never be updated in each iteration, so why put it in the TrainingInfo structure ?

Good call, I'll move instanceWeights outside TrainingInfo

WeichenXu123 · 2017-10-16T09:02:33Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/AggUpdateUtils.scala

+   */
+  private[impl] def updateParentImpurity(
+      statsAggregator: DTStatsAggregator,
+      col: FeatureVector,


Actually, updateParentImpurity has no relation with any feature column, but here you pass in the feature column only want to use the indices array, passing anyone feature column will be OK. But, this looks weird, maybe it can be better designed.

WeichenXu123 · 2017-10-16T09:04:07Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/AggUpdateUtils.scala

+      label: Double,
+      featureIndex: Int,
+      featureIndexIdx: Int,
+      splits: Array[Array[Split]],


You only need to pass in the featureSplit: Array[Split], don't pass all splits for all features.

Good call, I'll make this change.

WeichenXu123 · 2017-10-16T09:05:59Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/TrainingInfo.scala

+      from: Int,
+      to: Int,
+      split: Split,
+      allSplits: Array[Array[Split]]): BitSet = {


Ditto, you only need to pass in the featureSplit: Array[Split], don't pass all splits for all features.

WeichenXu123 · 2017-10-17T14:24:20Z

@smurching I found some issues and have some thoughts on the columnar features format:

In your doc, you said "Specifically, we only need to store sufficient stats for each bin of a single feature, as opposed to each bin of every feature", BUT, current implementation, you still allocate space for all features when computing: -- see DTStatsAggregator implementation, you pass featureSubset = None so DTStatsAggregator will allocate space for every features. According to your purpose, you should pass featureSubset = Some(Array(currentFeatureIndex)).
Current implementation still use binnedFeatures. You said in future it will be improved to sort feature values for continuous feature (for more precise tree training), if you want to consider every possible thresholds, you need hold rawFeatures instead of binnedFeatures in the columnar feature array, and in each split range offset, you need sort every continuous features. Is this the thing you want to do in the future ? This will increase calculation amount.
For current implementation(using binnedFeature) , there is no need to sort continuous features inside each split offset. So the indices for each feature is exactly the same. In order to save memory, I think these indices should be shared, no need to create separate indices array for each features. Even if you add the improvements for continuous features mentioned above, you can create separate indices array for only continuous features, the categorical features can still share the same indices array.
About locality advantage of columnar format, I have some doubts. Current implementation, you do not reorder the label and weight array, access label and weight value need use indices, when calculating DTStat, this break locality. (But I'm not sure how much impact to perf this will bring).
About the overhead of columnar format: when making reordering (when get new split, we need reorder left sub-tree samples into front), so you need reordering on each column, and at the same time, update the indices array. But, if we use row format, like:
Array[(features, label, weight)], reordering will be much easier, and do not need indices.
So, I am considering, whether we can use row format, but at the time when we need DTStatsAggregator computation, copy the data we need from the row format into columnar format array (only need to copy rows between sub-node offset and only copy the sampled features if using feature subsampling).

* Move instanceWeights outside TrainingInfo * Only pass a single array of splits (instead of an array of arrays of splits) when possible

SparkQA · 2017-11-05T17:44:54Z

Test build #83464 has finished for PR 19433 at commit 3f72cc0.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

smurching · 2017-11-06T18:52:14Z

jenkins retest this please

SparkQA · 2017-11-06T19:10:05Z

Test build #83503 has finished for PR 19433 at commit 3f72cc0.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-06T21:05:37Z

Test build #83507 has finished for PR 19433 at commit b7e6e40.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-11-08T00:44:26Z

CC @dbtsai in case you're interested b/c of Sequoia forests

SparkQA · 2017-11-08T01:11:01Z

Test build #3983 has finished for PR 19433 at commit b7e6e40.

This patch fails to generate documentation.
This patch does not merge cleanly.
This patch adds no public classes.

jkbradley

Done with pass over the parts which refactor elements of RandomForest.scala into utility classes. Will review more after updates!

jkbradley · 2017-11-10T04:59:05Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/AggUpdateUtils.scala

+      agg: DTStatsAggregator,
+      featureValue: Int,
+      label: Double,
+      featureIndex: Int,


featureIndex is not used

jkbradley · 2017-11-10T05:44:25Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

+  private[impl] def getNonConstantFeatures(
+      metadata: DecisionTreeMetadata,
+      featuresForNode: Option[Array[Int]]): Seq[(Int, Int)] = {
+    Range(0, metadata.numFeaturesPerNode).map { featureIndexIdx =>


Was there a reason to remove the use of view and withFilter here? With the output of this method going through further Seq operations, I would expect the previous implementation to be more efficient.

At some point when refactoring I was hitting errors caused by a stateful operation within a map over the output of this method (IIRC the result of the map was accessed repeatedly, causing the stateful operation to inadvertently be run multiple times).

However using withFilter and view now seems to work, I'll change it back :)

jkbradley · 2017-11-10T05:57:41Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala

+    // Cumulative sum (scanLeft) of bin statistics.
+    // Afterwards, binAggregates for a bin is the sum of aggregates for
+    // that bin + all preceding bins.
+    assert(!binAggregates.metadata.isUnordered(featureIndex))


Remove this (If there's any chance of this, then we should find ways to test it.)

jkbradley · 2017-11-10T06:02:03Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala

+      val featureValue = categoriesSortedByCentroid(splitIndex)
+      val leftChildStats =
+        binAggregates.getImpurityCalculator(nodeFeatureOffset, featureValue)
+      val rightChildStats =


This line can be moved outside of the map. Actually, this is the parentCalc, right? So if it's not available, parentCalc can be computed beforehand outside of the map.

Exactly, it's the parentCalc minus the left child stats. Since ImpurityCalculator.subtract() updates the impurity calculator in place, we call binAggregates.getParentImpurityCalculator() to get a copy of the parent impurity calculator, then subtract the left child stats.

jkbradley · 2017-11-10T06:10:58Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala

+    // Unordered categorical feature
+    val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
+    val numSplits = binAggregates.metadata.numSplits(featureIndex)
+    var parentCalc = parentCalculator


It'd be nice to calculate the parentCalc right away here, if needed. That seems possible just by taking the first candidate split. Then we could simplify calculateImpurityStats by not passing in parentCalc as an option.

jkbradley · 2017-11-10T18:45:53Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala

+      val centroid = ImpurityUtils.getCentroid(binAggregates.metadata, categoryStats)
+      (featureValue, centroid)
+    }
+    // TODO(smurching): How to handle logging statements like these?


What's the issue? You should be able to call logDebug if this object inherits from org.apache.spark.internal.Logging

jkbradley · 2017-11-10T22:13:18Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

+      node: LearningNode): (Split, ImpurityStats) = {
+    val validFeatureSplits = getNonConstantFeatures(binAggregates.metadata, featuresForNode)
+    // For each (feature, split), calculate the gain, and select the best (feature, split).
+    val parentImpurityCalc = if (node.stats == null) None else Some(node.stats.impurityCalculator)


Note to check: Will node.stats == null for the top level for sure?

I believe so, the nodes at the top level are created (RandomForest.scala:178) with LearningNode.emptyNode, which sets node.stats = null.

I could change this to check node depth (via node index), but if we're planning on deprecating node indices in the future it might be best not to.

jkbradley · 2017-11-10T22:34:05Z

mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala

@@ -112,7 +113,7 @@ private[spark] object ImpurityStats {
   * minimum number of instances per node.
   */
  def getInvalidImpurityStats(impurityCalculator: ImpurityCalculator): ImpurityStats = {
-    new ImpurityStats(Double.MinValue, impurityCalculator.calculate(),
+    new ImpurityStats(Double.MinValue, impurity = -1,


Q: Why -1 here?

I changed this to be -1 here since node impurity would eventually get set to -1 anyways when LearningNodes with invalid ImpurityStats were converted into decision tree leaf nodes (see LearningNode.toNode)

…ately reflect what the method actually does). Switch back to view, withFilter in getNonConstantFeatures

…e the map call in chooseUnorderedCategoricalSplit, orderedSplitHelper

SparkQA · 2017-11-15T02:40:09Z

Test build #83873 has finished for PR 19433 at commit 0b27c56.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-15T02:45:15Z

Test build #83874 has finished for PR 19433 at commit d86dd18.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-24T14:41:48Z

Test build #97977 has finished for PR 19433 at commit d86dd18.

This patch fails to generate documentation.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-01-22T19:06:10Z

Test build #101549 has finished for PR 19433 at commit d86dd18.

This patch fails to generate documentation.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-01-23T13:28:16Z

Test build #101588 has finished for PR 19433 at commit d86dd18.

This patch fails to generate documentation.
This patch does not merge cleanly.
This patch adds no public classes.

holdenk · 2019-02-11T18:45:02Z

Is this still a thing you are actively working on?

rstarosta · 2019-05-09T12:44:07Z

Thank you for your contribution! We've used this code extensively as a basis for our @cisco/oraf library, which incorporates local training into the existing decision tree and random forest APIs, and managed to significantly speed-up the training process.

holdenk · 2019-05-10T19:15:29Z

That's cool @rstarosta . Does having it in a library meet the needs of folks and we can close this PR?

SparkQA · 2019-09-13T19:33:41Z

Test build #110569 has finished for PR 19433 at commit d86dd18.

This patch fails to generate documentation.
This patch does not merge cleanly.
This patch adds no public classes.

github-actions · 2020-01-15T00:06:29Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

smurching added 6 commits October 4, 2017 15:11

Add test suites for utility methods used during best-split computation:

49bf0ae

* TreeSplitUtilsSuite: Test suite for SplitUtils * TreeTests: Add utility method (getMetadata) for TreeSplitUtilsSuite Also add methods used by these tests in LocalDecisionTree.scala, RandomForest.scala

Update RandomForest.scala to use new utility methods for impurity/sp…

bc54b16

…lit calculations

Add local decision tree training logic

6a68a5c

Add local decision tree unit/integration tests

9a7174e

smurching commented Oct 9, 2017

View reviewed changes

smurching changed the title ~~[SPARK-3162] [MLlib][WIP] Add local tree training for decision tree regressors~~ [SPARK-3162] [MLlib] Add local tree training for decision tree regressors Oct 9, 2017

Update calculateImpurityStats to avoid recomputing parent impurity st…

5c29d3d

…ats during best split selection

Merge branch 'master' into pr-splitup

cc6a30c

Fix test bug where instanceWeights weren't properly passed to update …

c9a8e01

…methods

Use per-training-example instance weights in local tree training

93e17fc

WeichenXu123 reviewed Oct 16, 2017

View reviewed changes

Respond to review comments:

fd6cdbb

* Move instanceWeights outside TrainingInfo * Only pass a single array of splits (instead of an array of arrays of splits) when possible

Merge branch 'master' into pr-splitup

b7e6e40

jkbradley reviewed Nov 10, 2017

View reviewed changes

smurching added 2 commits November 13, 2017 15:30

Respond to easy comments

22de575

Merge branch 'master' into pr-splitup

926b5d2

jkbradley mentioned this pull request Nov 14, 2017

[SPARK-22451][ML] Reduce decision tree aggregate size for unordered features from O(2^numCategories) to O(numCategories) #19666

Closed

smurching added 5 commits November 14, 2017 17:37

Rename getNonConstantFeatures -> getFeaturesWithSplits (to more accur…

dbb6a59

…ately reflect what the method actually does). Switch back to view, withFilter in getNonConstantFeatures

Respond to review comments: compute parent impurity calculator outsid…

c0985a8

…e the map call in chooseUnorderedCategoricalSplit, orderedSplitHelper

Merge branch 'master' into pr-splitup

0b27c56

Remove unneeded newline

072e5bc

Remove spaces...

d86dd18

smurching mentioned this pull request Nov 15, 2017

[SPARK-3162][MLlib] Local Tree Training Pt 1: Refactor RandomForest.scala into utility classes #19758

Closed

dongjoon-hyun added the ML label Jun 14, 2019

github-actions bot added the Stale label Jan 15, 2020

github-actions bot closed this Jan 16, 2020

[SPARK-3162] [MLlib] Add local tree training for decision tree regressors #19433

[SPARK-3162] [MLlib] Add local tree training for decision tree regressors #19433

Uh oh!

Conversation

smurching commented Oct 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Overview

How to Review

How was this patch tested?

Uh oh!

smurching commented Oct 4, 2017

Uh oh!

smurching Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Oct 9, 2017

Uh oh!

smurching commented Oct 9, 2017

Uh oh!

jkbradley commented Oct 9, 2017

Uh oh!

SparkQA commented Oct 9, 2017

Uh oh!

smurching commented Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 9, 2017

Uh oh!

smurching commented Oct 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 12, 2017

Uh oh!

WeichenXu123 commented Oct 12, 2017

Uh oh!

smurching commented Oct 13, 2017

Uh oh!

smurching commented Oct 13, 2017

Uh oh!

SparkQA commented Oct 13, 2017

Uh oh!

SparkQA commented Oct 13, 2017

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smurching Oct 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smurching commented Oct 4, 2017 •

edited

Loading

smurching Oct 9, 2017 •

edited

Loading

smurching commented Oct 9, 2017 •

edited

Loading

smurching commented Oct 12, 2017 •

edited

Loading

smurching Oct 27, 2017 •

edited

Loading