Skip to content

get_structured_schema function misses some labels/types on large databases #350

Open
@androna-xm

Description

@androna-xm

In the current implementation of the get_structured_schema function, the calls to apoc.meta.data do not specify sample: -1, which causes sampling to occur with the default skip rate (see: apoc.meta.data documentation) . On large databases, this sampling behavior means some labels and relationship types are not discovered or included in the resulting schema.

Steps to Reproduce

To reproduce the error access the offshoreleaks database in the Neo4j Dataset Demo server

Suggested Fix

In the schema.py file , the calls to apoc.meta.data in NODE_PROPERTIES_QUERY, REL_PROPERTIES_QUERY , and REL_QUERY could be updated to include the parameter:
CALL apoc.meta.data({sample: -1})
This change would ensure that all nodes and relationships are scanned, avoiding missed labels and properties — especially in large databases.

However, using sample: -1 forces a full scan of the database, which can significantly impact performance on large datasets. To provide flexibility, this could be exposed as an optional function parameter (e.g., skip_sampling=True) in get_structured_schema, so users can control the behavior based on their needs.

Would you agree with this approach? If so, I’d be happy to open a PR implementing the change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions