Skip to content

Document node not created in lexical graph when running SimpleKGPipeline with text input (i.e. not pdf) #353

Open
@briardoty

Description

@briardoty

There does not appear to be a way to pass document_info to the LexicalGraphBuilder when running the SimpleKGPipeline with text input.

e.g. when I run

import asyncio
import os

from langchain_neo4j import Neo4jGraph
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import (
    FixedSizeSplitter,
)
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.llm import OpenAILLM

OPENAI_MODEL = "gpt-4o-mini"
NODE_TYPES = [
    "Person",
    "Organization",
    "Location",
    "Event",
    "Legislation",
    "Claim",
    "Topic",
]
PROMPT_TEMPLATE = """
You are a research assistant tasked with extracting information from news articles 
to assemble a knowledge graph that can be used to help readers make better sense of
the content, context, and implications of news stories via Q&A.

Extract nodes and relationships from the following input text.

Return result as JSON using the following format:
{{"nodes": [ {{"id": "0", "label": "Person", "properties": {{"name": "John"}} }}],
"relationships": [{{"type": "KNOWS", "start_node_id": "0", "end_node_id": "1", "properties": {{"since": "2024-08-01"}} }}] }}

- Assign a unique ID (string) to each node, and reuse it to define relationships
- Respect the source and target node types for relationships and their directions
- Use only information from the input text to create graph components and properties
- Create as many nodes and relationships as needed to sufficiently characterize the input text
- Your output should only contain the JSON object, nothing else

Use only the following nodes and relationships (if provided):
{schema}

Input text:
{text}
"""


async def main():
    graph = Neo4jGraph(
        url=os.getenv("NEO4J_URI"),
        username=os.getenv("NEO4J_USERNAME"),
        password=os.getenv("NEO4J_PASSWORD"),
        refresh_schema=False,
        database=os.getenv("NEO4J_DATABASE"),
    )
    llm = OpenAILLM(
        model_name=OPENAI_MODEL,
        model_params={"response_format": {"type": "json_object"}, "temperature": 0},
    )
    kg_pipeline = SimpleKGPipeline(
        llm=llm,
        driver=graph._driver,
        text_splitter=FixedSizeSplitter(chunk_size=500, chunk_overlap=100),
        embedder=OpenAIEmbeddings(),
        entities=NODE_TYPES,
        prompt_template=PROMPT_TEMPLATE,
        from_pdf=False,
        perform_entity_resolution=True,
    )
    pipeline_result = await kg_pipeline.run_async(text="Some long article text here...")


if __name__ == "__main__":
    asyncio.run(main())

I see in the logs:

neo4j_graphrag.experimental.components.lexical_graph - INFO - Document node not created in the lexical graph because no document metadata is provided

Is there some way around this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions