-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
PDEP-18: Nullable Object Dtype #61599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,175 @@ | ||
# PDEP-18: Nullable Object Dtype for Pandas | ||
|
||
- Created: 07 June 2025 | ||
- Status: Draft | ||
- Discussion: [#32931](https://github.com/pandas-dev/pandas/issues/32931) | ||
- Author: [Simon Hawkins](https://github.com/simonjayhawkins) | ||
- Revision: 1 | ||
|
||
## Abstract | ||
|
||
This proposal outlines the introduction of a nullable object | ||
dtype to the pandas library. The goal is to provide a | ||
dedicated dtype for handling arbitrary Python objects with | ||
consistent missing value semantics using `pd.NA`. Unlike the | ||
traditional `object` dtype which lacks robust missing data | ||
handling, this new nullable dtype will add clarity and | ||
consistency in representing missing or undefined values | ||
within object arrays. | ||
|
||
## Motivation | ||
|
||
Currently, the `object` dtype in pandas is a catch-all for | ||
heterogeneous Python objects, but it does not enforce any | ||
particular missing-value semantics. As pandas has evolved to | ||
include extension types (like `string[python]`, `Int64`, or | ||
`boolean`), there is a clear benefit in extending these | ||
improvements to the object datatype. A nullable object dtype | ||
would help: | ||
- **Consistency**: Enforce a uniform approach to managing | ||
missing values with `pd.NA` across all dtypes. | ||
- **Interoperability**: Enable cleaner and more predictable | ||
behavior when performing operations on data previously | ||
stored as generic objects. | ||
- **Clarity**: Help users distinguish between truly “object” | ||
data and data that is better represented by a nullable | ||
container supporting missing values. | ||
|
||
This proposal is driven by frequent community discussions | ||
and development efforts that aim to unify missing value | ||
handling across pandas data types. | ||
|
||
## Detailed Proposal | ||
|
||
### Definition | ||
|
||
The proposal introduces a new extension type, tentatively | ||
named `"object_nullable"`, that stores an underlying array | ||
of Python objects alongside a boolean mask that indicates | ||
missing (i.e., `pd.NA`) values. The API should mimic that of | ||
existing extension arrays, ensuring that missing value | ||
propagation, casting, and arithmetic comparisons (where | ||
applicable) behave consistently with other nullable types. | ||
|
||
### Key Features | ||
1. **Consistent Missing Value Semantics**: | ||
- Missing entries will be represented by `pd.NA`, | ||
ensuring compatibility with pandas nullable dtypes that | ||
use `pd.NA` as the missing value indicator as well as | ||
the experimental `ArrowDType`. | ||
- Operations that encounter missing values will handle | ||
`pd.NA` uniformly consistent with other pandas nullable | ||
dtypes that use `pd.NA` as the missing value indicator. | ||
2. **Underlying Data Storage**: | ||
- The core data structure will consist of a NumPy array | ||
of Python objects and an associated boolean mask. (not | ||
so different from the current `object` backed nullable | ||
string array variant that uses `pd.NA` as the missing | ||
value.) | ||
- Consideration should be given to performance, ensuring | ||
that operations remain as vectorized as possible despite | ||
the inherent overhead of handling Python objects. | ||
3. **API Integration**: | ||
- The new dtype will implement the ExtensionArray | ||
interface. | ||
- Methods such as `astype`, `isna`, `fillna`, and | ||
element-wise operations are already defined to respect | ||
missing values in the other pandas nullable dtypes. | ||
- All operations on a nullable object array will return | ||
a pandas nullable array except where requested, such as | ||
`astype`. Methods like `fillna` would still return a | ||
nullable object array even though there are no missing | ||
values to avoid introducing mixed-propagation behavior. | ||
- Ensure compatibility with pandas functions, like | ||
groupby, concatenation, and merging, where the semantics | ||
of missing values are critical. | ||
4. **Transition and Interoperability**: | ||
- Users should be able to convert from the legacy object | ||
dtype to object_nullable using a constructor or an | ||
explicit method (e.g., `pd.array(old_array, | ||
dtype="object_nullable")`) using the existing api. | ||
Comment on lines
+87
to
+90
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could one use |
||
- Operations on existing pandas nullable dtypes that | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. examples? the only one that comes to mind is concat. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the str.split issue came to mind when I wrote this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's #61303. It appears to be fixed on main but not bisected to find where this was changed. |
||
would normally produce an object dtype should be updated | ||
(or made configurable as a transition path) to yield | ||
"object_nullable" in all cases even when missing values | ||
are not present to avoid introducing mixed-propagation | ||
behavior. | ||
- `ArrowDType` does not offer an `object` dtype for | ||
heterogeneous Python objects and therefore a user | ||
requesting arrow dtypes could be given "object_nullable" | ||
arrays where appropriate to avoid mixed `pd.NA`/`np.nan` | ||
semantics when using `dtype_backend="pyarrow"`. | ||
|
||
|
||
### Implementation Considerations | ||
1. **Performance**: | ||
- Handling arbitrary Python objects is inherently slower | ||
than operations on native numerical types. | ||
- Expanding the EA interface to 2D is outside the scope | ||
of this PDEP. | ||
|
||
2. **Backward Compatibility**: | ||
- Existing code that uses the traditional object dtype | ||
should not break. (Making the pandas nullable object | ||
dtype the default is not part of this proposal and would | ||
be discussed in conjunction with moving the other pandas | ||
nullable dtypes to be default.) | ||
- Existing code that uses the pandas nullable dtypes | ||
should not break without warnings, even though they are | ||
considered experimental, as these dtypes have been | ||
available to users for a long time. The new dtype can be | ||
offered as an opt-in feature initially. | ||
3. **Testing and Documentation**: | ||
- Extensive tests will be required to validate behavior | ||
against edge cases. | ||
- Updated documentation should explain differences | ||
between the legacy object dtype and object_nullable, | ||
including examples and migration tips. | ||
4. **Community Feedback**: | ||
- Continuous discussions on GitHub, mailing lists, and | ||
related channels will inform refinements. The nullable | ||
object dtype should be available as opt-in for at least | ||
2 minor versions to allow sufficient time for feedback | ||
before the return types of the existing pandas nullable | ||
dtypes are changed. | ||
|
||
## Alternatives Considered | ||
- Continuing with the Legacy Object Dtype: | ||
- Retaining the ambiguous missing value semantics of the | ||
legacy object dtype does not provide a robust and | ||
consistent solution, aligning with the design of other | ||
extension arrays. | ||
- Not having a nullable object dtype could potentially | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. am i right in thinking this is the underlying motivation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried not too much to "sell" the concept until I get a temperature check from the other core devs. I'm happy to remove this. polars has a nullable object type and some contributors seem to see polars as a threat or maybe just prefer us to match polars behavior. That could be a motivation if one is needed. Ideally I would want to avoid object dtype where possible, PDEP-10 does that for many new datatypes. I opened this as it was looking like we will be rejecting PDEP-10 and if we have object dtype I think we should have a nullable one. |
||
be a blocker for a potential future nullable by default | ||
policy. | ||
|
||
## Drawbacks and Future Directions | ||
1. **Overhead Cost**: | ||
The additional memory required for a boolean mask and | ||
possible performance penalties in highly heterogeneous | ||
arrays are acknowledged trade-offs. | ||
2. **Integration Complexity**: | ||
Ensuring seamless integration with the full suite of pandas | ||
functionality may reveal edge cases that require careful | ||
handling. | ||
3. **Incompatibility**: | ||
The existing object array can hold any python object, even | ||
`pd.NA` itself. The proposed nullable object array will be | ||
unable to hold `np.nan`, `None` or `pd.NaT` as these will be | ||
considered missing in the constructors and other conversions | ||
when following the existing API for the other nullable | ||
types. Users will not be able to round-trip between the | ||
legacy and nullable object dtypes. | ||
|
||
## Conclusion | ||
Introducing a nullable object dtype in pandas will offer a | ||
clearer semantic for missing values and align the behavior | ||
of object arrays with other nullable types. This proposal is | ||
aimed at fostering discussion and soliciting community | ||
feedback to refine the design and implementation roadmap. | ||
|
||
|
||
|
||
## PDEP-18 History | ||
|
||
- 07 June 2025: Initial version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there links for any of these? off the top of my head i dont recall any users asking for this in particular
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to list the issues in the OP before officially opening the discussion period, but yes this does need references. Thanks for highlighting this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the following part of the sentence
i could potentially reference PDEP-16?