[RFC] Possible DMatrix refactor

As we add different types of algorithms to xgboost and as the performance of these algorithms  improves, the DMatrix class may also be improved in order to improve performance and memory usage for these particular cases.

**Current DMatrix**
The current DMatrix class has two sub classes. The first is the primary in memory representation where the entire dataset is ingested and converted into a CSR format with values stored as 32-bit floats. The entire dataset may be transposed in memory into a column format for exact style algorithms.

The second is the external memory representation which constructs a number of binary page files (32mb in size by default) which are streamed from disk. If a column format is requested, new transposed pages will be generated and these will be streamed from disk.

The end user does not directly choose the type of DMatrix and these subclasses are not exposed via external APIs. The type of underlying DMatrix is automatically selected according to if the user supplies a cache prefix in the constructor.

Currently a batch of data is stored inside the DMatrix as a 'SparsePage' class. This uses HostDeviceVector in order to expose data to both the CPU and GPU.

**Current fast histogram algorithms**
The current static histogram algorithms (hist, gpu_hist) convert the DMatrix object on the fly inside of their respective TreeUpdater implementations, not as a part of DMatrix implementation, although I believe that it was eventually intended to be integrated with DMatrix (#1673). 'hist' converts the data into a CSR matrix with integers instead of floats. 'gpu_hist' converts the data into an ELLPACK format with integers instead of floats and additionally applies bitwise compression at run time to both the matrix elements and indices, commonly resulting in 4-5x compression over the standard CSR matrix. In gpu_hist we avoid ever copying the training DMatrix to the GPU if prediction cacheing is used.

 **Some desirable features for DMatrix**
Here I will list some features that we would like to have. Not all of these will be practical under the same design.
- The ability to train entirely on a compressed quantised DMatrix. We could save around ~4x memory and be much more competitive with LightGBM on memory usage.
- Support for dense matrices. We would directly halve memory usage by using dense formats where appropriate.
- DMatrix as a thin wrapper over an already allocated data matrix. e.g. a numpy ndarray. Given a dense numpy input, we copy this into DMatrix format resulting in 3n memory usage. Simply using the underlying numpy array as our storage uses 1n memory.
- Support for ingest of data from GPU memory (see #3997)
- Support for external memory for all of the above

 **Notes on possible implementation**
- Current iterators used to access DMatrix could be made into virtual functions, abstracting away underlying memory format. There could be a significant performance penalty here from virtual function look up.
- The question is how do the histogram learning algorithms then access the underlying integer representation if it exists? Do we provide another set of iterators specifically for quantised data? What happens if these iterators are called but the underlying data is stored as a float? We can either error and tell the user to construct a different type of DMatrix for this particular learning algorithm or generate the required representation on the fly with some silent performance penalty.
- Another possible solution is to have the DMatrix as a completely 'lazy' object that contains a pointer to an external data source but defers constructing an internal representation until it knows the optimal format, based on whatever learning algorithm is chosen. This solution has the downside of introducing complicated state to the DMatrix object.

I realise the above is somewhat rambling and does not propose a concrete solution, but I hope to start discussion on these ideas.

@tqchen @hcho3 @trivialfis @CodingCat 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC] Possible DMatrix refactor #4354

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC] Possible DMatrix refactor #4354

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions