Skip to content

Big query billing to be scrutinised by the team #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
siyara-m-yral opened this issue Feb 17, 2025 · 1 comment
Open

Big query billing to be scrutinised by the team #10

siyara-m-yral opened this issue Feb 17, 2025 · 1 comment
Assignees

Comments

@siyara-m-yral
Copy link

  • We have looked at all the heavy queries. There are user query and that is vector search
  • All the cost was mostly from 1 day
  • Frequency was leading to increase in the cost
  • Hence, the high cost is attributed to back filling
@jay-dhanwant-yral
Copy link
Contributor

Most of the cost is from one day backfill as shown in the image below
Image

However, we found another potential risk while investigating this issue.
This is due to the query below

        SELECT base.uri, base.post_id, base.canister_id, base.timestamp, distance FROM
        VECTOR_SEARCH(
            (
            SELECT * FROM `hot-or-not-feed-intelligence.yral_ds.video_index` 
            WHERE uri NOT IN <watch_history>
            AND is_nsfw = False AND nsfw_ec = 'neutral'
            AND post_id is not null 
            AND canister_id is not null 
            AND TIMESTAMP_TRUNC(TIMESTAMP(SUBSTR(timestamp, 1, 26)), MICROSECOND) > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 DAY)
            ),
            'embedding',
            (
            SELECT embedding
            FROM `hot-or-not-feed-intelligence.yral_ds.video_index`
            WHERE uri IN <watch_history>
            AND is_nsfw = False AND nsfw_ec = 'neutral'
            AND post_id is not null
            AND canister_id is not null
            ),
            top_k => 12,
            options => '{"fraction_lists_to_search":0.6}' -- CAUTION: This is high at the moment owing to the sparsity of the data, as an when we will have good number of recent uploads, this has to go down!

        )
        ORDER BY distance 

This is scanning ~7.3 GB of data. The issue for this is due to the parameter fraction_list_to_search being 0.6

fraction_lists_to_search: This is a number that specifies the percentage of lists to search. For example, options => '{"fraction_lists_to_search":0.15}'. The fraction_lists_to_search value must be in the range 0.0 to 1.0, exclusive.
Specifying a higher percentage leads to higher recall and slower performance, and the converse is true when specifying a lower percentage.
Since the diversity in the existing data that we have is already less, having a lower fraction_lists_to_search would further reduce the recall. This is the reason for the fraction_lists_to_search is high.
Once we have more data to index, we can reduce the cost by safely reducing this parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants