dmlc
diff --git a/‎doc/c.rst
Lines changed: 2 additions & 0 deletions b/‎doc/c.rst
Lines changed: 2 additions & 0 deletions
diff --git a/‎doc/tutorials/dask.rst
Lines changed: 3 additions & 0 deletions b/‎doc/tutorials/dask.rst
Lines changed: 3 additions & 0 deletions
diff --git a/‎doc/tutorials/external_memory.rst
Lines changed: 70 additions & 11 deletions b/‎doc/tutorials/external_memory.rst
Lines changed: 70 additions & 11 deletions
diff --git a/‎doc/tutorials/index.rst
Lines changed: 9 additions & 9 deletions b/‎doc/tutorials/index.rst
Lines changed: 9 additions & 9 deletions
diff --git a/‎doc/tutorials/param_tuning.rst
Lines changed: 43 additions & 0 deletions b/‎doc/tutorials/param_tuning.rst
Lines changed: 43 additions & 0 deletions
diff --git a/‎rabit/include/rabit/internal/io.h
Lines changed: 3 additions & 8 deletions b/‎rabit/include/rabit/internal/io.h
Lines changed: 3 additions & 8 deletions
diff --git a/‎src/common/column_matrix.cc
Lines changed: 68 additions & 11 deletions b/‎src/common/column_matrix.cc
Lines changed: 68 additions & 11 deletions
@@ -33,6 +33,8 @@ DMatrix
 .. doxygengroup:: DMatrix
    :project: xgboost
 
+.. _c_streaming:
+
 Streaming
 ---------
 
 
@@ -54,6 +54,9 @@ on a dask cluster:
         y = da.random.random(size=(num_obs, 1), chunks=(1000, 1))
 
         dtrain = xgb.dask.DaskDMatrix(client, X, y)
+        # or
+        # dtrain = xgb.dask.DaskQuantileDMatrix(client, X, y)
+        # `DaskQuantileDMatrix` is available for the `hist` and `gpu_hist` tree method.
 
         output = xgb.dask.train(
             client,
 
@@ -22,6 +22,15 @@ GPU-based training algorithm. We will introduce them in the following sections.
 
    The feature is still experimental as of 2.0. The performance is not well optimized.
 
+The external memory support has gone through multiple iterations and is still under heavy
+development. Like the :py:class:`~xgboost.QuantileDMatrix` with
+:py:class:`~xgboost.DataIter`, XGBoost loads data batch-by-batch using a custom iterator
+supplied by the user. However, unlike the :py:class:`~xgboost.QuantileDMatrix`, external
+memory will not concatenate the batches unless GPU is used (it uses a hybrid approach,
+more details follow). Instead, it will cache all batches on the external memory and fetch
+them on-demand.  Go to the end of the document to see a comparison between
+`QuantileDMatrix` and external memory.
+
 *************
 Data Iterator
 *************
@@ -113,10 +122,11 @@ External memory is supported by GPU algorithms (i.e. when ``tree_method`` is set
 ``gpu_hist``). However, the algorithm used for GPU is different from the one used for
 CPU. When training on a CPU, the tree method iterates through all batches from external
 memory for each step of the tree construction algorithm. On the other hand, the GPU
-algorithm concatenates all batches into one and stores it in GPU memory. To reduce overall
-memory usage, users can utilize subsampling. The good news is that the GPU hist tree
-method supports gradient-based sampling, enabling users to set a low sampling rate without
-compromising accuracy.
+algorithm uses a hybrid approach. It iterates through the data during the beginning of
+each iteration and concatenates all batches into one in GPU memory. To reduce overall
+memory usage, users can utilize subsampling. The GPU hist tree method supports
+`gradient-based sampling`, enabling users to set a low sampling rate without compromising
+accuracy.
 
 .. code-block:: python
 
@@ -134,6 +144,8 @@ see `this paper <https://arxiv.org/abs/2005.09148>`_.
    When GPU is running out of memory during iteration on external memory, user might
    recieve a segfault instead of an OOM exception.
 
+.. _ext_remarks:
+
 *******
 Remarks
 *******
@@ -142,17 +154,64 @@ When using external memory with XBGoost, data is divided into smaller chunks so
 a fraction of it needs to be stored in memory at any given time. It's important to note
 that this method only applies to the predictor data (``X``), while other data, like labels
 and internal runtime structures are concatenated. This means that memory reduction is most
-effective when dealing with wide datasets where ``X`` is larger compared to other data
-like ``y``, while it has little impact on slim datasets.
+effective when dealing with wide datasets where ``X`` is significantly larger in size
+compared to other data like ``y``, while it has little impact on slim datasets.
+
+As one might expect, fetching data on-demand puts significant pressure on the storage
+device. Today's computing device can process way more data than a storage can read in a
+single unit of time. The ratio is at order of magnitudes. An GPU is capable of processing
+hundred of Gigabytes of floating-point data in a split second. On the other hand, a
+four-lane NVMe storage connected to a PCIe-4 slot usually has about 6GB/s of data transfer
+rate. As a result, the training is likely to be severely bounded by your storage
+device. Before adopting the external memory solution, some back-of-envelop calculations
+might help you see whether it's viable. For instance, if your NVMe drive can transfer 4GB
+(a fairly practical number) of data per second and you have a 100GB of data in compressed
+XGBoost cache (which corresponds to a dense float32 numpy array with the size of 200GB,
+give or take). A tree with depth 8 needs at least 16 iterations through the data when the
+parameter is right. You need about 14 minutes to train a single tree without accounting
+for some other overheads and assume the computation overlaps with the IO. If your dataset
+happens to have TB-level size, then you might need thousands of trees to get a generalized
+model. These calculations can help you get an estimate on the expected training time.
+
+However, sometimes we can ameliorate this limitation. One should also consider that the OS
+(mostly talking about the Linux kernel) can usually cache the data on host memory. It only
+evicts pages when new data comes in and there's no room left. In practice, at least some
+portion of the data can persist on the host memory throughout the entire training
+session. We are aware of this cache when optimizing the external memory fetcher. The
+compressed cache is usually smaller than the raw input data, especially when the input is
+dense without any missing value. If the host memory can fit a significant portion of this
+compressed cache, then the performance should be decent after initialization. Our
+development so far focus on two fronts of optimization for external memory:
+
+- Avoid iterating through the data whenever appropriate.
+- If the OS can cache the data, the performance should be close to in-core training.
 
 Starting with XGBoost 2.0, the implementation of external memory uses ``mmap``. It is not
-yet tested against system errors like disconnected network devices (`SIGBUS`). Also, it's
-worth noting that most tests have been conducted on Linux distributions.
+tested against system errors like disconnected network devices (`SIGBUS`). In the face of
+a bus error, you will see a hard crash and need to clean up the cache files. If the
+training session might take a long time and you are using solutions like NVMe-oF, we
+recommend checkpointing your model periodically. Also, it's worth noting that most tests
+have been conducted on Linux distributions.
 
-Another important point to keep in mind is that creating the initial cache for XGBoost may
-take some time. The interface to external memory is through custom iterators, which may or
-may not be thread-safe. Therefore, initialization is performed sequentially.
 
+Another important point to keep in mind is that creating the initial cache for XGBoost may
+take some time. The interface to external memory is through custom iterators, which we can
+not assume to be thread-safe. Therefore, initialization is performed sequentially. Using
+the `xgboost.config_context` with `verbosity=2` can give you some information on what
+XGBoost is doing during the wait if you don't mind the extra output.
+
+*******************************
+Compared to the QuantileDMatrix
+*******************************
+
+Passing an iterator to the :py:class:`~xgboost.QuantileDmatrix` enables direct
+construction of `QuantileDmatrix` with data chunks. On the other hand, if it's passed to
+:py:class:`~xgboost.DMatrix`, it instead enables the external memory feature. The
+:py:class:`~xgboost.QuantileDmatrix` concatenates the data on memory after compression and
+doesn't fetch data during training. On the other hand, the external memory `DMatrix`
+fetches data batches from external memory on-demand.  Use the `QuantileDMatrix` (with
+iterator if necessary) when you can fit most of your data in memory. The training would be
+an order of magnitute faster than using external memory.
 
 ****************
 Text File Inputs
 
@@ -11,22 +11,22 @@ See `Awesome XGBoost <https://github.com/dmlc/xgboost/tree/master/demo>`_ for mo
 
   model
   saving_model
+  learning_to_rank
+  dart
+  monotonic
+  feature_interaction_constraint
+  aft_survival_analysis
+  categorical
+  multioutput
+  rf
   kubernetes
   Distributed XGBoost with XGBoost4J-Spark <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html>
   Distributed XGBoost with XGBoost4J-Spark-GPU <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_gpu_tutorial.html>
   dask
   spark_estimator
   ray
-  dart
-  monotonic
-  rf
-  feature_interaction_constraint
-  learning_to_rank
-  aft_survival_analysis
+  external_memory
   c_api_tutorial
   input_format
   param_tuning
-  external_memory
   custom_metric_obj
-  categorical
-  multioutput
@@ -58,3 +58,46 @@ This can affect the training of XGBoost model, and there are two ways to improve
 
   - In such a case, you cannot re-balance the dataset
   - Set parameter ``max_delta_step`` to a finite number (say 1) to help convergence
+
+
+*********************
+Reducing Memory Usage
+*********************
+
+If you are using a HPO library like :py:class:`sklearn.model_selection.GridSearchCV`,
+please control the number of threads it can use. It's best to let XGBoost to run in
+parallel instead of asking `GridSearchCV` to run multiple experiments at the same
+time. For instance, creating a fold of data for cross validation can consume a significant
+amount of memory:
+
+.. code-block:: python
+
+    # This creates a copy of dataset. X and X_train are both in memory at the same time.
+
+    # This happens for every thread at the same time if you run `GridSearchCV` with
+    # `n_jobs` larger than 1
+
+    X_train, X_test, y_train, y_test = train_test_split(X, y)
+
+.. code-block:: python
+
+    df = pd.DataFrame()
+    # This creates a new copy of the dataframe, even if you specify the inplace parameter
+    new_df = df.drop(...)
+
+.. code-block:: python
+
+    array = np.array(...)
+    # This may or may not make a copy of the data, depending on the type of the data
+    array.astype(np.float32)
+
+.. code-block::
+
+    # np by default uses double, do you actually need it?
+    array = np.array(...)
+
+You can find some more specific memory reduction practices scattered through the documents
+For instances: :doc:`/tutorials/dask`, :doc:`/gpu/index`,
+:doc:`/contrib/scaling`. However, before going into these, being conscious about making
+data copies is a good starting point. It usually consumes a lot more memory than people
+expect.
@@ -19,8 +19,7 @@
 #include "rabit/internal/utils.h"
 #include "rabit/serializable.h"
 
-namespace rabit {
-namespace utils {
+namespace rabit::utils {
 /*! \brief re-use definition of dmlc::SeekStream */
 using SeekStream = dmlc::SeekStream;
 /**
@@ -31,9 +30,6 @@ struct MemoryFixSizeBuffer : public SeekStream {
   // similar to SEEK_END in libc
   static std::size_t constexpr kSeekEnd = std::numeric_limits<std::size_t>::max();
 
- protected:
-  MemoryFixSizeBuffer() = default;
-
  public:
   /**
    * @brief Ctor
@@ -68,7 +64,7 @@ struct MemoryFixSizeBuffer : public SeekStream {
    * @brief Current position in the buffer (stream).
    */
   std::size_t Tell() override { return curr_ptr_; }
-  virtual bool AtEnd() const { return curr_ptr_ == buffer_size_; }
+  [[nodiscard]] virtual bool AtEnd() const { return curr_ptr_ == buffer_size_; }
 
  protected:
   /*! \brief in memory buffer */
@@ -119,6 +115,5 @@ struct MemoryBufferStream : public SeekStream {
   /*! \brief current pointer */
   size_t curr_ptr_;
 };  // class MemoryBufferStream
-}  // namespace utils
-}  // namespace rabit
+}  // namespace rabit::utils
 #endif  // RABIT_INTERNAL_IO_H_
@@ -1,16 +1,27 @@
-/*!
- * Copyright 2017-2022 by XGBoost Contributors
+/**
+ * Copyright 2017-2023, XGBoost Contributors
  * \brief Utility for fast column-wise access
  */
 #include "column_matrix.h"
 
-namespace xgboost {
-namespace common {
+#include <algorithm>    // for transform
+#include <cstddef>      // for size_t
+#include <cstdint>      // for uint64_t, uint8_t
+#include <limits>       // for numeric_limits
+#include <type_traits>  // for remove_reference_t
+#include <vector>       // for vector
+
+#include "../data/gradient_index.h"  // for GHistIndexMatrix
+#include "io.h"                      // for AlignedResourceReadStream, AlignedFileWriteStream
+#include "xgboost/base.h"            // for bst_feaature_t
+#include "xgboost/span.h"            // for Span
+
+namespace xgboost::common {
 void ColumnMatrix::InitStorage(GHistIndexMatrix const& gmat, double sparse_threshold) {
   auto const nfeature = gmat.Features();
   const size_t nrow = gmat.Size();
   // identify type of each column
-  type_.resize(nfeature);
+  type_ = common::MakeFixedVecWithMalloc(nfeature, ColumnType{});
 
   uint32_t max_val = std::numeric_limits<uint32_t>::max();
   for (bst_feature_t fid = 0; fid < nfeature; ++fid) {
@@ -34,7 +45,7 @@ void ColumnMatrix::InitStorage(GHistIndexMatrix const& gmat, double sparse_thres
 
   // want to compute storage boundary for each feature
   // using variants of prefix sum scan
-  feature_offsets_.resize(nfeature + 1);
+  feature_offsets_ = common::MakeFixedVecWithMalloc(nfeature + 1, std::size_t{0});
   size_t accum_index = 0;
   feature_offsets_[0] = accum_index;
   for (bst_feature_t fid = 1; fid < nfeature + 1; ++fid) {
@@ -49,17 +60,63 @@ void ColumnMatrix::InitStorage(GHistIndexMatrix const& gmat, double sparse_thres
   SetTypeSize(gmat.MaxNumBinPerFeat());
   auto storage_size =
       feature_offsets_.back() * static_cast<std::underlying_type_t<BinTypeSize>>(bins_type_size_);
-  index_.resize(storage_size, 0);
+
+  index_ = common::MakeFixedVecWithMalloc(storage_size, std::uint8_t{0});
+
   if (!all_dense_column) {
-    row_ind_.resize(feature_offsets_[nfeature]);
+    row_ind_ = common::MakeFixedVecWithMalloc(feature_offsets_[nfeature], std::size_t{0});
   }
 
   // store least bin id for each feature
   index_base_ = const_cast<uint32_t*>(gmat.cut.Ptrs().data());
 
   any_missing_ = !gmat.IsDense();
 
-  missing_flags_.clear();
+  missing_ = MissingIndicator{0, false};
+}
+
+// IO procedures for external memory.
+bool ColumnMatrix::Read(AlignedResourceReadStream* fi, uint32_t const* index_base) {
+  if (!common::ReadVec(fi, &index_)) {
+    return false;
+  }
+  if (!common::ReadVec(fi, &type_)) {
+    return false;
+  }
+  if (!common::ReadVec(fi, &row_ind_)) {
+    return false;
+  }
+  if (!common::ReadVec(fi, &feature_offsets_)) {
+    return false;
+  }
+
+  if (!common::ReadVec(fi, &missing_.storage)) {
+    return false;
+  }
+  missing_.InitView();
+
+  index_base_ = index_base;
+  if (!fi->Read(&bins_type_size_)) {
+    return false;
+  }
+  if (!fi->Read(&any_missing_)) {
+    return false;
+  }
+  return true;
+}
+
+std::size_t ColumnMatrix::Write(AlignedFileWriteStream* fo) const {
+  std::size_t bytes{0};
+
+  bytes += common::WriteVec(fo, index_);
+  bytes += common::WriteVec(fo, type_);
+  bytes += common::WriteVec(fo, row_ind_);
+  bytes += common::WriteVec(fo, feature_offsets_);
+  bytes += common::WriteVec(fo, missing_.storage);
+
+  bytes += fo->Write(bins_type_size_);
+  bytes += fo->Write(any_missing_);
+
+  return bytes;
 }
-}  // namespace common
-}  // namespace xgboost
+}  // namespace xgboost::common