Skip to content

Test importing the parquet export #2038

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
manuelwedler opened this issue Mar 25, 2025 · 5 comments
Open

Test importing the parquet export #2038

manuelwedler opened this issue Mar 25, 2025 · 5 comments
Assignees
Labels

Comments

@manuelwedler
Copy link
Contributor

It would be nice to try importing the parquet export, and see if we have any issues. Maybe we can learn something from it and improve the documentation based on it.

One thing I'm also thinking of here is that one may want to run Sourcify based on the export. However, when the parquet file is imported, it wouldn't include the functions and constraints. Maybe some documentation about how to add also these would be good.

@halcyonet
Copy link

halcyonet commented Apr 2, 2025

hello @manuelwedler , I want to contribute in this issue. Can you please assign me ?

@manuelwedler
Copy link
Contributor Author

Hello @halcyonet , sure you can do if you like. You would need to do the following:

During that process it would be good to document the commands you used for the import and any problems you faced. Especially, we need to:

Let me know if that works for you. We can help you in the process.

@manuelwedler manuelwedler changed the title Test parquet export Test importing the parquet export Apr 30, 2025
@kuzdogan
Copy link
Member

Hi @halcyonet are you still working on this?

@kuzdogan kuzdogan marked this as a duplicate of #2126 Apr 30, 2025
@marcocastignoli
Copy link
Member

I remember @clonker already did something for this in python, probably we want a script in ts/js but we can take inspiration from his code :)

@clonker
Copy link
Member

clonker commented Apr 30, 2025

This is what I used. It creates an SQLite DB though. It can probably be somewhat improved, it'll (f ex) store the entire db twice in the file, which of course is unfortunate but I imagine easy to fix.

Code
import requests
import json
import sqlite3
import pandas as pd

class ANSI:
    red = "\033[0;31m"
    green = "\033[0;32m"
    reset = "\033[0m"
    gray = "\033[1;30m"
    cyan = "\033[0;36m"

manifest = requests.get('https://export.sourcify.dev/manifest.json').json()

db = sqlite3.connect(database='sourcify.sqlite')
cursor = db.cursor()
cursor.execute('create table if not exists manifest (kind text, file text)')

failed = []
for kind, files in manifest['files'].items():
    for file in files:
        cursor.execute(f'SELECT ROWID FROM manifest WHERE kind = "{kind}" and file = "{file}"')
        print(f"Fetching {ANSI.cyan}{file}{ANSI.reset}: ", end='', flush=True)
        if cursor.fetchone():
            print(f"{ANSI.gray}SKIP{ANSI.reset}")
        else:
            try:
                pq = pd.read_parquet(f"https://export.sourcify.dev/{file}", storage_options={"User-Agent": "pandas"})
                pq.to_sql(kind, db, index=False, if_exists='append')
                cursor.execute('insert into manifest values(?,?)', (kind,file))
                db.commit()
                print(f"{ANSI.green}DONE{ANSI.reset}")
            except:
                print(f"{ANSI.red}FAIL{ANSI.reset}")
                failed.append(file)

if failed:
    print(f"{ANSI.red}Failed: {ANSI.cyan}{','.join(failed)}{ANSI.reset}")

Little extra on how to interact with it...

Code
class SourcifyDB:
    def __init__(self, filename):
        self._db = sqlite3.connect(database=filename)
        self.cursor = self._db.cursor()

    def contract_ids_by_compiler_and_version(self, compiler, version):
        ids = self.cursor.execute('select id, compiler_settings from compiled_contracts where compiler == ? and version like ?;', (compiler, version)).fetchall()
        ids = [ids[i][0] for i in range(len(ids))]
        return ids

    def source_hash_and_path_from_contract_id(self, contract_id):
        hashes_and_paths = self.cursor.execute('select source_hash, path from compiled_contracts_sources where compilation_id == ?;', (contract_id,)).fetchall()
        source_hashes = [hashes_and_paths[i][0] for i in range(len(hashes_and_paths))]
        paths = [hashes_and_paths[i][1] for i in range(len(hashes_and_paths))]
        return source_hashes, paths

    def contract_content(self, source_hash):
        c = self.cursor.execute('select content from sources where source_hash == ?;', (source_hash,)).fetchall()
        return c[0][0]

    def __enter__(self):
        return self

    def __exit__(self, *args, **kw):
        self._db.close()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Backlog
Development

No branches or pull requests

5 participants