Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] R arrow cannot handle labelled data in arrow tables #45601

Open
EinMaulwurf opened this issue Feb 21, 2025 · 2 comments
Open

[R] R arrow cannot handle labelled data in arrow tables #45601

EinMaulwurf opened this issue Feb 21, 2025 · 2 comments
Labels
Component: R Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug

Comments

@EinMaulwurf
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

Arrow seems to have issues with labelled data, as it might come from STATA datasets.

Filter labelled data in an arrow table does not work and throws an error. So far, so good, I can deal with that.

library(haven)
library(arrow)
library(tibble)
library(dplyr)

d <- tibble(
  a = labelled(x = 1:5, label = "example variable a"),
  b = labelled(x = 11:15, label = "example variable b")
)

d
#> # A tibble: 5 × 2
#>   a         b        
#>   <int+lbl> <int+lbl>
#> 1 1         11       
#> 2 2         12       
#> 3 3         13       
#> 4 4         14       
#> 5 5         15

d %>%
  as_arrow_table() %>%
  filter(a > 3) %>%
  collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! NotImplemented: Function 'greater' has no kernel matching input types (<labelled<integer>[0]>: example variable a, <labelled<integer>[0]>: example variable a)

But when leaving out the final collect() to execute the query, the R session crashes completely:

d %>%
  as_arrow_table() %>%
  filter(a > 5)
# R crashes....

Component(s)

R

@amoeba amoeba changed the title R arrow cannot handle labelled data in arrow tables [R] R arrow cannot handle labelled data in arrow tables Feb 21, 2025
@amoeba amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Feb 21, 2025
@amoeba
Copy link
Member

amoeba commented Feb 21, 2025

Thanks for the report, this needs to be fixed. For the crash, we're crashing when just trying to print:

    --->8--- 
    frame #6: 0x000000011b7ea404 arrow.so`arrow::internal::InvalidValueOrDie(arrow::Status const&) + 244
    frame #7: 0x000000011b7ef470 arrow.so`arrow::Scalar::ToString() const + 1008
    frame #8: 0x000000011ba76668 arrow.so`arrow::compute::Expression::ToString() const::$_4::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) const + 260
    frame #9: 0x000000011ba75a38 arrow.so`arrow::compute::Expression::ToString() const + 1008
    frame #10: 0x000000011b267b6c arrow.so`_arrow_compute___expr__ToString + 108
    --->8--- 

and with extra error context turned on I see:

ValueOrDie called on an error: NotImplemented: construction from scalar of type <labelled[0]>: example variable a

We could probably also make arrow work seamlessly with labeled dataframes, is that what you were expecting when you ran into this? We could drop the labels when converting to an Arrow table which would be functionally equivalent to this:

> d %>%
+   zap_labels() %>%
+   as_arrow_table() %>%
+   filter(a > 3) %>%
+   collect()
# A tibble: 2 × 2
      a     b
* <int> <int>
1     4    14
2     5    15

@EinMaulwurf
Copy link
Author

Yes exactly, I don't care about the labels but could not use open_dataset() for querying my data because of it. I had to load it into RAM and remove the labels as you showed.

Having open_dataset() or as_arrow_table() dropping the labels would be great. Maybe just add a message that labels were dropped, so the users knows what's going on.

Oh and also, there are different kind of labels, see here, so it would probably be best to deal with them all at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: R Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Projects
None yet
Development

No branches or pull requests

2 participants