Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45619: [Python] Use f-string instead of string.format #45629

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

chilin0525
Copy link

@chilin0525 chilin0525 commented Feb 25, 2025

Rationale for this change

See #45619.

What changes are included in this PR?

Refactor using f-string instead of string.format. But do not use f-string for following case, string.format allows passing parameters, making the code more reusable.

_read_table_docstring = """
{0}
Parameters
----------
source : str, pyarrow.NativeFile, or file-like object
If a string passed, can be a single file name or directory name. For
file-like objects, only read a single file. Use pyarrow.BufferReader to
read a file contained in a bytes or buffer-like object.
columns : list
If not None, only these columns will be read from the file. A column
name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
'a.c', and 'a.d.e'. If empty, no columns will be read. Note
that the table will still have the correct num_rows set despite having
no columns.
use_threads : bool, default True
Perform multi-threaded column reads.
schema : Schema, optional
Optionally provide the Schema for the parquet dataset, in which case it
will not be inferred from the source.
{1}
filesystem : FileSystem, default None
If nothing passed, will be inferred based on path.
Path will try to be found in the local on-disk filesystem otherwise
it will be parsed as an URI to determine the filesystem.
filters : pyarrow.compute.Expression or List[Tuple] or List[List[Tuple]], default None
Rows which do not match the filter predicate will be removed from scanned
data. Partition keys embedded in a nested directory structure will be
exploited to avoid loading files at all if they contain no matching rows.
Within-file level filtering and different partitioning schemes are supported.
{3}
use_legacy_dataset : bool, optional
Deprecated and has no effect from PyArrow version 15.0.0.
ignore_prefixes : list, optional
Files matching any of these prefixes will be ignored by the
discovery process.
This is matched to the basename of a path.
By default this is ['.', '_'].
Note that discovery happens only if a directory is passed as source.
pre_buffer : bool, default True
Coalesce and issue file reads in parallel to improve performance on
high-latency filesystems (e.g. S3). If True, Arrow will use a
background I/O thread pool. If using a filesystem layer that itself
performs readahead (e.g. fsspec's S3FS), disable readahead for best
results.
coerce_int96_timestamp_unit : str, default None
Cast timestamps that are stored in INT96 format to a particular
resolution (e.g. 'ms'). Setting to None is equivalent to 'ns'
and therefore INT96 timestamps will be inferred as timestamps
in nanoseconds.
decryption_properties : FileDecryptionProperties or None
File-level decryption properties.
The decryption properties can be created using
``CryptoFactory.file_decryption_properties()``.
thrift_string_size_limit : int, default None
If not None, override the maximum total string size allocated
when decoding Thrift structures. The default limit should be
sufficient for most Parquet files.
thrift_container_size_limit : int, default None
If not None, override the maximum total size of containers allocated
when decoding Thrift structures. The default limit should be
sufficient for most Parquet files.
page_checksum_verification : bool, default False
If True, verify the checksum for each page read from the file.
Returns
-------
{2}
{4}
"""

Are these changes tested?

Via CI.

Are there any user-facing changes?

No.

@kou
Copy link
Member

kou commented Feb 26, 2025

We have many string.format codes in https://github.com/apache/arrow/tree/main/python/pyarrow . If you want to work on this step by step (for example, 1 PR per file), could you open a sub-issue for each PR instead of associating all PRs to GH-45619? (GitHub added sub-issue related features recently.)

@chilin0525 chilin0525 marked this pull request as draft February 26, 2025 00:51
@chilin0525
Copy link
Author

@kou Thank you for the reminder🙏. I personally prefer to implement all changes within a single PR, so I am converting the PR to draft status.

@chilin0525 chilin0525 marked this pull request as ready for review March 1, 2025 13:58
@chilin0525
Copy link
Author

I have already changed all the files under the pyarrow folder. As discussed in #45619, certain scenarios where the template is reused across multiple methods using string.format will not be refactored to f-strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants