Big query billing to be scrutinised by the team #10

siyara-m-yral · 2025-02-17T05:59:05Z

We have looked at all the heavy queries. There are user query and that is vector search
All the cost was mostly from 1 day
Frequency was leading to increase in the cost
Hence, the high cost is attributed to back filling

jay-dhanwant-yral · 2025-02-18T02:49:06Z

Most of the cost is from one day backfill as shown in the image below

However, we found another potential risk while investigating this issue.
This is due to the query below

        SELECT base.uri, base.post_id, base.canister_id, base.timestamp, distance FROM
        VECTOR_SEARCH(
            (
            SELECT * FROM `hot-or-not-feed-intelligence.yral_ds.video_index` 
            WHERE uri NOT IN <watch_history>
            AND is_nsfw = False AND nsfw_ec = 'neutral'
            AND post_id is not null 
            AND canister_id is not null 
            AND TIMESTAMP_TRUNC(TIMESTAMP(SUBSTR(timestamp, 1, 26)), MICROSECOND) > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 DAY)
            ),
            'embedding',
            (
            SELECT embedding
            FROM `hot-or-not-feed-intelligence.yral_ds.video_index`
            WHERE uri IN <watch_history>
            AND is_nsfw = False AND nsfw_ec = 'neutral'
            AND post_id is not null
            AND canister_id is not null
            ),
            top_k => 12,
            options => '{"fraction_lists_to_search":0.6}' -- CAUTION: This is high at the moment owing to the sparsity of the data, as an when we will have good number of recent uploads, this has to go down!

        )
        ORDER BY distance

This is scanning ~7.3 GB of data. The issue for this is due to the parameter fraction_list_to_search being 0.6

fraction_lists_to_search: This is a number that specifies the percentage of lists to search. For example, options => '{"fraction_lists_to_search":0.15}'. The fraction_lists_to_search value must be in the range 0.0 to 1.0, exclusive.
Specifying a higher percentage leads to higher recall and slower performance, and the converse is true when specifying a lower percentage.
Since the diversity in the existing data that we have is already less, having a lower fraction_lists_to_search would further reduce the recall. This is the reason for the fraction_lists_to_search is high.
Once we have more data to index, we can reduce the cost by safely reducing this parameter.

siyara-m-yral assigned jay-dhanwant-yral Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big query billing to be scrutinised by the team #10

Big query billing to be scrutinised by the team #10

siyara-m-yral commented Feb 17, 2025

jay-dhanwant-yral commented Feb 18, 2025

Big query billing to be scrutinised by the team #10

Big query billing to be scrutinised by the team #10

Comments

siyara-m-yral commented Feb 17, 2025

jay-dhanwant-yral commented Feb 18, 2025