Description
In the current implementation of the get_structured_schema
function, the calls to apoc.meta.data
do not specify sample: -1
, which causes sampling to occur with the default skip rate (see: apoc.meta.data documentation) . On large databases, this sampling behavior means some labels and relationship types are not discovered or included in the resulting schema.
Steps to Reproduce
To reproduce the error access the offshoreleaks database in the Neo4j Dataset Demo server
- server_url = https://demo.neo4jlabs.com:7473/
- username = offshoreleaks
- password = offshoreleaks
- database_name = offshoreleaks
Suggested Fix
In the schema.py file , the calls to apoc.meta.data
in NODE_PROPERTIES_QUERY
, REL_PROPERTIES_QUERY
, and REL_QUERY
could be updated to include the parameter:
CALL apoc.meta.data({sample: -1})
This change would ensure that all nodes and relationships are scanned, avoiding missed labels and properties — especially in large databases.
However, using sample: -1 forces a full scan of the database, which can significantly impact performance on large datasets. To provide flexibility, this could be exposed as an optional function parameter (e.g., skip_sampling=True) in get_structured_schema, so users can control the behavior based on their needs.
Would you agree with this approach? If so, I’d be happy to open a PR implementing the change.