GH-45619: [Python] Use f-string instead of string.format #45629

chilin0525 · 2025-02-25T17:31:17Z

Rationale for this change

What changes are included in this PR?

Refactor using f-string instead of string.format. But do not use f-string for following case, string.format allows passing parameters, making the code more reusable.

arrow/python/pyarrow/parquet/core.py

Lines 1624 to 1695 in 0fbf982

    
           _read_table_docstring = """ 
        
           {0} 
        
           Parameters 
        
           ---------- 
        
           source : str, pyarrow.NativeFile, or file-like object 
        
               If a string passed, can be a single file name or directory name. For 
        
               file-like objects, only read a single file. Use pyarrow.BufferReader to 
        
               read a file contained in a bytes or buffer-like object. 
        
           columns : list 
        
               If not None, only these columns will be read from the file. A column 
        
               name may be a prefix of a nested field, e.g. 'a' will select 'a.b', 
        
               'a.c', and 'a.d.e'. If empty, no columns will be read. Note 
        
               that the table will still have the correct num_rows set despite having 
        
               no columns. 
        
           use_threads : bool, default True 
        
               Perform multi-threaded column reads. 
        
           schema : Schema, optional 
        
               Optionally provide the Schema for the parquet dataset, in which case it 
        
               will not be inferred from the source. 
        
           {1} 
        
           filesystem : FileSystem, default None 
        
               If nothing passed, will be inferred based on path. 
        
               Path will try to be found in the local on-disk filesystem otherwise 
        
               it will be parsed as an URI to determine the filesystem. 
        
           filters : pyarrow.compute.Expression or List[Tuple] or List[List[Tuple]], default None 
        
               Rows which do not match the filter predicate will be removed from scanned 
        
               data. Partition keys embedded in a nested directory structure will be 
        
               exploited to avoid loading files at all if they contain no matching rows. 
        
               Within-file level filtering and different partitioning schemes are supported. 
        
               {3} 
        
           use_legacy_dataset : bool, optional 
        
               Deprecated and has no effect from PyArrow version 15.0.0. 
        
           ignore_prefixes : list, optional 
        
               Files matching any of these prefixes will be ignored by the 
        
               discovery process. 
        
               This is matched to the basename of a path. 
        
               By default this is ['.', '_']. 
        
               Note that discovery happens only if a directory is passed as source. 
        
           pre_buffer : bool, default True 
        
               Coalesce and issue file reads in parallel to improve performance on 
        
               high-latency filesystems (e.g. S3). If True, Arrow will use a 
        
               background I/O thread pool. If using a filesystem layer that itself 
        
               performs readahead (e.g. fsspec's S3FS), disable readahead for best 
        
               results. 
        
           coerce_int96_timestamp_unit : str, default None 
        
               Cast timestamps that are stored in INT96 format to a particular 
        
               resolution (e.g. 'ms'). Setting to None is equivalent to 'ns' 
        
               and therefore INT96 timestamps will be inferred as timestamps 
        
               in nanoseconds. 
        
           decryption_properties : FileDecryptionProperties or None 
        
               File-level decryption properties. 
        
               The decryption properties can be created using 
        
               ``CryptoFactory.file_decryption_properties()``. 
        
           thrift_string_size_limit : int, default None 
        
               If not None, override the maximum total string size allocated 
        
               when decoding Thrift structures. The default limit should be 
        
               sufficient for most Parquet files. 
        
           thrift_container_size_limit : int, default None 
        
               If not None, override the maximum total size of containers allocated 
        
               when decoding Thrift structures. The default limit should be 
        
               sufficient for most Parquet files. 
        
           page_checksum_verification : bool, default False 
        
               If True, verify the checksum for each page read from the file. 
        
           Returns 
        
           ------- 
        
           {2} 
        
           {4} 
        
           """

Are these changes tested?

Via CI.

Are there any user-facing changes?

No.

GitHub Issue: [Python] Use f-string instead of string.format #45619

kou · 2025-02-26T00:00:46Z

We have many string.format codes in https://github.com/apache/arrow/tree/main/python/pyarrow . If you want to work on this step by step (for example, 1 PR per file), could you open a sub-issue for each PR instead of associating all PRs to GH-45619? (GitHub added sub-issue related features recently.)

chilin0525 · 2025-02-26T01:52:43Z

@kou Thank you for the reminder🙏. I personally prefer to implement all changes within a single PR, so I am converting the PR to draft status.

chilin0525 · 2025-03-01T14:01:37Z

I have already changed all the files under the pyarrow folder. As discussed in #45619, certain scenarios where the template is reused across multiple methods using string.format will not be refactored to f-strings.

Refactor using f-string instead of string.format

66b3a5a

github-actions bot added Component: Python awaiting review Awaiting review labels Feb 25, 2025

chilin0525 marked this pull request as draft February 26, 2025 00:51

Refactor using f-string instead of string.format

f5b0933

github-actions bot added Component: FlightRPC Component: Gandiva labels Feb 28, 2025

chilin0525 added 5 commits March 1, 2025 02:55

Merge branch 'main' into using-f-string-instead-string-format

a5c1112

Rollback _filesystem_uri to test fail testcase on CI

4b46338

Refactor using f-string instead of string.format

02cbd03

Refactor using f-string instead of string.format

e63da8b

Refactor using f-string instead of string.format

edab1db

chilin0525 marked this pull request as ready for review March 1, 2025 13:58

chilin0525 requested review from wjones127 and lidavidm as code owners March 1, 2025 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-45619: [Python] Use f-string instead of string.format #45629

GH-45619: [Python] Use f-string instead of string.format #45629

chilin0525 commented Feb 25, 2025 •

edited

Loading

kou commented Feb 26, 2025

chilin0525 commented Feb 26, 2025

chilin0525 commented Mar 1, 2025

	_read_table_docstring = """
	{0}

	Parameters
	----------
	source : str, pyarrow.NativeFile, or file-like object
	If a string passed, can be a single file name or directory name. For
	file-like objects, only read a single file. Use pyarrow.BufferReader to
	read a file contained in a bytes or buffer-like object.
	columns : list
	If not None, only these columns will be read from the file. A column
	name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
	'a.c', and 'a.d.e'. If empty, no columns will be read. Note
	that the table will still have the correct num_rows set despite having
	no columns.
	use_threads : bool, default True
	Perform multi-threaded column reads.
	schema : Schema, optional
	Optionally provide the Schema for the parquet dataset, in which case it
	will not be inferred from the source.
	{1}
	filesystem : FileSystem, default None
	If nothing passed, will be inferred based on path.
	Path will try to be found in the local on-disk filesystem otherwise
	it will be parsed as an URI to determine the filesystem.
	filters : pyarrow.compute.Expression or List[Tuple] or List[List[Tuple]], default None
	Rows which do not match the filter predicate will be removed from scanned
	data. Partition keys embedded in a nested directory structure will be
	exploited to avoid loading files at all if they contain no matching rows.
	Within-file level filtering and different partitioning schemes are supported.

	{3}
	use_legacy_dataset : bool, optional
	Deprecated and has no effect from PyArrow version 15.0.0.
	ignore_prefixes : list, optional
	Files matching any of these prefixes will be ignored by the
	discovery process.
	This is matched to the basename of a path.
	By default this is ['.', '_'].
	Note that discovery happens only if a directory is passed as source.
	pre_buffer : bool, default True
	Coalesce and issue file reads in parallel to improve performance on
	high-latency filesystems (e.g. S3). If True, Arrow will use a
	background I/O thread pool. If using a filesystem layer that itself
	performs readahead (e.g. fsspec's S3FS), disable readahead for best
	results.
	coerce_int96_timestamp_unit : str, default None
	Cast timestamps that are stored in INT96 format to a particular
	resolution (e.g. 'ms'). Setting to None is equivalent to 'ns'
	and therefore INT96 timestamps will be inferred as timestamps
	in nanoseconds.
	decryption_properties : FileDecryptionProperties or None
	File-level decryption properties.
	The decryption properties can be created using
	``CryptoFactory.file_decryption_properties()``.
	thrift_string_size_limit : int, default None
	If not None, override the maximum total string size allocated
	when decoding Thrift structures. The default limit should be
	sufficient for most Parquet files.
	thrift_container_size_limit : int, default None
	If not None, override the maximum total size of containers allocated
	when decoding Thrift structures. The default limit should be
	sufficient for most Parquet files.
	page_checksum_verification : bool, default False
	If True, verify the checksum for each page read from the file.

	Returns
	-------
	{2}

	{4}
	"""

GH-45619: [Python] Use f-string instead of string.format #45629

Are you sure you want to change the base?

GH-45619: [Python] Use f-string instead of string.format #45629

Conversation

chilin0525 commented Feb 25, 2025 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

kou commented Feb 26, 2025

chilin0525 commented Feb 26, 2025

chilin0525 commented Mar 1, 2025

chilin0525 commented Feb 25, 2025 •

edited

Loading