Description
Problem
The function wr.athena.read_sql_query
returns a dataframe with zero columns whenever the SQL query returns zero rows. In other words, the column schema gets dropped for empty query results.
Note: This only happens when ctas_approach=True
. If ctas_approach=False
, all columns (and their corresponding data types) are preserved in the outputted dataframe, which is the desired behavior.
Use Case
In the real world, empty queries happen. An empty query isn't necessarily a bug in my application. It may simply represent an absence of data. For example, suppose I'm querying sales data. Suppose I drill down into a particular item. If this item was unpopular and never sold, then the query may (correctly) return zero rows. Ideally, Wrangler shouldn't wipe out the column schema in this scenario.
In a larger pipeline, the absence of columns can cause downstream transformations to fail, e.g. selects, joins, etc.
Proposed Solution
When ctas_approach
is enabled, how is the dataframe's column schema determined? I've only skimmed the code, but I suspect it's determined from the underlying parquet's schema. This is a reasonable approach. However, if the parquet doesn't exist, e.g. Athena CTAS results in zero rows, then our dataframe ends up with an empty schema.
I was playing around in AWS Console. I notice that if an Athena CTAS yields zero rows, the resultant (empty) Glue table still has accurate column metadata. So, perhaps a back-up plan could be implemented for this edge case: If no parquet exists, use the Glue schema (rather than parquet schema) to determine the dataframe's column structure.
Replication Steps
import awswrangler as wr
import boto3
import sys
print(f'python version = {sys.version}')
print(f'boto3 version = {boto3.__version__}')
print(f'awswrangler version = {wr.__version__}')
sql = '''
with sample_data as
(
select 'A' as col1, 101 as col2, 1.01 as col3 union
select 'B' as col1, 102 as col2, 1.02 as col3 union
select 'C' as col1, 103 as col2, 1.03 as col3
)
select * from sample_data where 0 = 1
'''
df = wr.athena.read_sql_query(
sql=sql,
database='MY_DATABASE',
ctas_approach=True,
)
print(f'shape = {df.shape}')
This code produces the following output on my machine:
python version = 3.7.7 (tags/v3.7.7:d7c567b08f, Mar 10 2020, 10:41:24) [MSC v.1900 64 bit (AMD64)]
boto3 version = 1.17.55
awswrangler version = 2.7.0
shape = (0, 0)
Notice the shape is (0, 0)
instead of (0, 3)
.