-
Notifications
You must be signed in to change notification settings - Fork 6
Deploy Neptune and Rhizomer
In this section, we walk you through deploying Neptune and Rhizomer in order to interactively explore knowledge graphs.
For instructions on deploying a Neptune database, see the Neptune User Guide. You can start with the smallest instance type available. Later, we demonstrate how to load large amounts of data, which requires a bigger instance.
Rhizomer is based on a server-side API and a client-side web application. Both are available as open-source projects from GitHub:
-
RhizomerEye is the front end. It's developed with Angular and consumes RhizomerAPI.
-
RhizomerAPI is the backend. It's developed using Spring and provides the API consumed by RhizomerEye.
To facilitate their deployment, they're also available as Docker images that you can launch using Amazon Elastic Container Service (Amazon ECS). Amazon ECS makes it easy to deploy, manage, and scale Docker containers from its command-line tool, which supports docker-compose configuration files. For more information, see Tutorial: Creating a Cluster with an EC2 Task Using the Amazon ECS CLI, which details how to install ecs-cli and configure it.
We also demonstrate how to use the AWS Command Line Interface (AWS CLI). For more information, see the AWS CLI User Guide.
- When the ECS CLI tool is ready, we first create a cluster configuration:
ecs-cli configure --cluster rhizomer --region us-east-1 --default-launch-type EC2 --config-name rhizomer-config
- Then we configure a profile using your access key and secret key as detailed in configure ecs-cli:
ecs-cli configure profile --access-key $AWS_ACCESS_KEY_ID --secret-key $AWS_SECRET_ACCESS_KEY --profile-name rhizomer-profile
Now you can create the cluster where the containers are launched.
- First, we need a security group for the cluster that opens port 80 for the client and 8080 for the API. We can do so from the command line:
aws ec2 create-security-group --group-name rhizomer-security-group --description "Rhizomer security group" --vpc-id vpc-1234567a
- We open input traffic to ports 80 and 8080 from anywhere:
aws ec2 authorize-security-group-ingress --group-name rhizomer-security-group --protocol tcp --port 80 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-name rhizomer-security-group --protocol tcp --port 8080 --cidr 0.0.0.0/0
- Now we can finally create the cluster, associated with the previous security group through its returned identifier. Additionally, we configure the identifier of the VPC where the Neptune cluster is (and the subnets corresponding to this VPC) plus the instance type, ECS configuration, and the profile to use:
ecs-cli up --security-group sg-0123456789101112 --capability-iam --vpc vpc-1234567a --subnets subnet-1abcd234 --instance-type t2.micro --size 1 --cluster rhizomer --cluster-config rhizomer-config --ecs-profile rhizomer-profile --force
- We use a docker-compose file to define and configure the Docker images to load into our cluster. The following content should be available in a file called
docker-compose.yml
:
version: '3'
services:
rhizomer-api:
image: rhizomik/rhizomer-api
ports:
- "8080:8080"
environment:
- ALLOWED_ORIGINS=http://${HOSTNAME}
- RHIZOMER_DEFAULT_PASSWORD=password
rhizomer:
image: rhizomik/rhizomer-eye
ports:
- "80:80"
environment:
- API_URL=http://${HOSTNAME}:8080
In addition to the details provided by the docker-compose file, Amazon ECS requires some additional details about memory usage limits per container. We have roughly 0.90 GB in a t3.micro
instance, which we share among the containers as detailed in a file called ecs-params.yml
in the same folder, whose content should be as follows:
version: 1
task_definition:
services:
rhizomer:
mem_limit: 0.20GB
rhizomer-api:
mem_limit: 0.70GB
Now we can launch the docker-compose through the ECS CLI. However, we need to set the HOSTNAME variable used in the docker-compose.yml
to the public DNS name of the EC2 instance in our cluster.
- On the Amazon ECS console, set the environment variable HOSTNAME. Alternatively, you can also enter the following command to set the environment variable from the command line:
export HOSTNAME=$(aws ecs list-container-instances --cluster rhizomer --query "containerInstanceArns" --output text | xargs aws ecs describe-container-instances --cluster rhizomer --container-instances --query "containerInstances[].ec2InstanceId" --output text | xargs aws ec2 describe-instances --instance-ids --query 'Reservations[].Instances[].PublicDnsName' --output text)
- We start the containers from the same folder where the
docker-compose.yml
andecs-params.yml
files are located with the following command:
ecs-cli compose up --cluster-config rhizomer-config --ecs-profile rhizomer-profile --force-update
- After the command is complete, we can list the running containers:
ecs-cli ps --cluster rhizomer
This command outputs something similar to the following code, which shows containers as RUNNING, one corresponding to the Rhizomer client attached to port 80 and one for the API at port 8080:
Name State Ports TaskDefinition Health
847bf8a8-954e-4ce6-957a-9e26e23d2425/rhizomer-api RUNNING 54.87.29.177:8080->8080/tcp roberto:8 UNKNOWN
847bf8a8-954e-4ce6-957a-9e26e23d2425/rhizomer RUNNING 54.87.29.177:80->80/tcp roberto:8 UNKNOWN
Now we can start interacting with Neptune through the just deployed Rhizomer. When we're finished, we can easily remove all the involved resources (such as the cluster and instances) using the following command:
ecs-cli down --cluster-config rhizomer-config --ecs-profile rhizomer-profile
To enhance the security of your Rhizomer deployment, we recommend activating the encryption of the EBS volumes of the instances running the frontend and the backend. The easiest way to do so is by activating EBS encryption by default as detailed in https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSEncryption.html#encryption-by-default
Additionally, it is recommended to use HTTPS when deploying Rhizomer. The easiest way to accomplish this is to use an AWS Load Balancer that secures the connection using a certificate provided by AWS Certificate Manager (ACM). To do so, follow the instructions at https://docs.aws.amazon.com/es_es/elasticloadbalancing/latest/classic/elb-create-https-ssl-load-balancer.html
Now that we have deployed Rhizomer through Amazon ECS, we can start using its web user interface. It's available from the public DNS name we previously stored in the HOSTNAME environment variable, which we can retrieve with the following code:
echo $HOSTNAME
Enter the DNS name in your preferred browser. You should see Rhizomer's About page.

To manage users and registered datasets, sign in with the username admin and the default password provided in the docker-compose.yml during deployment, for instance password.
We can now define a new dataset to explore.
- On the Neptune console, choose Datasets.
- Choose New dataset.
- For name, enter a name for your dataset.
- For Query Type, choose your type of query (for this post, we choose Detailed).

-
For SPARQL Server type¸ choose Amazon Neptune.
This allows you to generate SPARQL queries optimized for Neptune.
-
Provide the details of the SPARQL endpoint.
You don't need to define a separate SPARQL update endpoint because Neptune uses the same for querying.
-
Optionally, you can make the endpoint writable or password protected.
You can retrieve your Neptune endpoint on the Instances page of the Neptune console.
-
In the Dataset Graphs section, we load data into the new graph.
We can use a URI (either an URL or URN) to identify the new graph (for this post, we enter urn:game_of_thrones).
-
Choose Add Graph.
We can now see our graph in the list of dataset graphs.
-
Choose Load data to load data into the new graph.
For this post, we load semantic data about the Game of Thrones characters from the file got.ttl.
-
Choose Submit to load the data into Neptune.
After we load the data, we can explore the graph
urn:game_of_thrones
.

When we choose Explore on the dataset detail page, we can start inspecting the data we loaded into Fuseki. The first thing Rhizomer does when interacting with a dataset is present an overview of the data:
- A word cloud generated from the classes in the dataset, if the dataset
Query Type
was set toOptimized
. Each word in the cloud corresponds to a class, and its size is relative to the number of instances of the class in the dataset. - A network overview of the main classes and relationships among them, if the dataset
Query Type
was set toDetailed
.
These visualizations are generated automatically by sending SPARQL queries to the endpoint associated with the explored dataset. The queries includes in the FROM clause the graph or graphs selected for exploration. For instance, to retrieve the classes:
SELECT ?class (COUNT(?instance) AS ?n)
FROM <urn:game_of_thrones>
WHERE
{ ?instance a ?class
FILTER ( ! isBlank(?class) )
}
GROUP BY ?class
As shown in the following network overview, we have four classes: FictionalCharacter, Noble, Book, and Organisation. All the characters in the dataset are instances of FictionalCharacter, but some of them are also Noble. They appear in Books and have allegiance with houses, which are Organisations.

We can choose a class to explore it further. For example, if we choose Noble, we see the following faceted view of all Game of Thrones characters classified as nobles.

This visualization shows the number of instances for the selected class, initially unconstrained so all 430 of 430 nobles are listed. Rhizomer uses the following query to retrieve the count from Neptune:
SELECT (COUNT(?instance) AS ?n)
FROM <urn:game_of_thrones>
WHERE
{ ?instance a <http://dbpedia.org/ontology/Noble> }
Some of the instances are displayed using pagination, which is implemented by the underlying SPARQL query using OFFSET and LIMIT on an embedded SELECT query that retrieves the instances to display. To retrieve all the triples describing the resources included in the page, the DESCRIBE SPARQL query is used:
DESCRIBE ?instance
FROM <urn:game_of_thrones>
WHERE
{ { SELECT DISTINCT ?instance
WHERE
{ ?instance a <http://dbpedia.org/ontology/Noble> }
OFFSET 0
LIMIT 10
}
}
The visualization lists the facets for the class. Each facet corresponds to a property used to describe instances, and you can see how many times the corresponding property is used. You can also see how many different values are used and whether all are literals or not. This view is generated automatically by Rhizomer using the following SPARQL query:
PREFIX hint: <http://aws.amazon.com/neptune/vocab/v01/QueryHints#>
SELECT ?property (COUNT(?instance) AS ?uses) (COUNT(DISTINCT ?object) AS ?values) (MIN(?isLiteral) AS ?allLiteral)
FROM <urn:game_of_thrones>
WHERE
{
hint:Query hint:joinOrder "Ordered"
{ SELECT ?instance
WHERE
{ ?instance a <http://dbpedia.org/ontology/Noble> }
}
?instance ?property ?object
BIND(isLiteral(?object) AS ?isLiteral)
}
GROUP BY ?property
You can expand each facet to show the 10 most common values for the selected instances using the following query, which also retrieves more readable labels (preferably in English, if available):
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX hint: <http://aws.amazon.com/neptune/vocab/v01/QueryHints#>
SELECT ?value ?label (COUNT(?value) AS ?count)
FROM <urn:game_of_thrones>
WHERE
{
hint:Query hint:joinOrder "Ordered"
{ SELECT DISTINCT ?instance
WHERE
{ ?instance a <http://dbpedia.org/ontology/Noble> }
}
?instance rdfs:comment ?resource
OPTIONAL
{ ?resource rdfs:label ?label
FILTER langMatches(lang(?label), "en")
}
OPTIONAL
{ ?resource rdfs:label ?label }
BIND(str(?resource) AS ?value)
}
GROUP BY ?value ?label
ORDER BY DESC(?count)
LIMIT 10
From this visualization, you can further filter the available instance by one or more specific facet values. There is also an input form for each facet that allows filtering by any of the values.
As we mentioned earlier, Rhizomer also works as a linked data browser, so if you choose any resource in the instance descriptions, its description is retrieved and presented if available locally or remotely, from the resource URL. You don't need any prior knowledge about the dataset to explore it, because the overview and faceted views inform you of what classes are present in the dataset and how they're described using properties and values. Rhizomer does all the hard work through SPARQL queries, so you don't need to worry about it.
The mechanism provided by Rhizomer to load data into Neptune is only suitable for small data files. For bigger data files, in the order of millions of triples, we recommend using the bulk loader provided by Neptune.
The data to upload should be available in an Amazon Simple Storage Service (Amazon S3) bucket, which Network needs to read and list access to. To prepare the S3 bucket with DBpedia data, follow the instructions in Prepare DBpedia Dataset.
DBpedia is a large dataset; it's a semantic version of Wikipedia featuring millions of triples. Therefore, we use a bigger instance for Neptune.
-
We start a new instance with instance class db.r5.2xlarge, which has 8 vCPU and 64 GiB RAM.
Next, Neptune requires permission to access the S3 storage bucket. This is granted through an AWS Identity and Access Management (IAM) role that has access to the bucket.
-
Add the role to your Neptune cluster.
-
Create a VPC endpoint for Amazon S3 (for the Neptune loader to use).
Now the Neptune loader has access to the S3 bucket with the DBpedia data. Next, we instruct the loader to get the data from Amazon S3. The loader provides a web API that, for security reasons, is accessible just from machines connected to the VPC where our Neptune instance is.
-
The easiest way to get one is to start a new t2.micro instance and, during the networking part of the configuration process, define that it's connected to the same VPC as Neptune.
In our case, this vpc-id is vpc-1234567a, the same one we used when launching Rhizomer's containers.
-
From the command line of the new instance, we can use curl to interact with the loader web API. In the following code, the URL points to your own Neptune instance and the iamRoleArn points to the new IAM role:
curl -X POST \
-H 'Content-Type: application/json' \
https://neptune.abcdefghi1cba.us-east-1.neptune.amazonaws.com:8182/loader -d '
{
"source" : "s3://your-bucket/dbpedia/",
"format" : "turtle",
"iamRoleArn" : "arn:aws:iam::1234567890:role/NeptuneLoadFromS3",
"region" : "us-east-1",
"failOnError" : "FALSE",
"parserConfiguration" : { "namedGraphUri": "http://aws.amazon.com/neptune/vocab/v01/DefaultNamedGraph" }
}'
The response from the loader indicates the identifier for the load process just triggered:
{
"status" : "200 OK",
"payload" : { "loadId" : "d4f889ca-f5b3-47b7-a873-2d4343896a77" }
}
- Use the loadId to check the status of the loading process:
curl -G 'https://neptune.abcdefghi1cba.us-east-1.neptune.amazonaws.com:8182/loader/d4f889ca-f5b3-47b7-a873-2d4343896a77?details=true'
For more information about the Neptune loader, see Neptune Loader Reference. The status, when the load is complete, should look like the following code:
{
"status" : "200 OK",
"payload" : {
"feedCount" : [ { "LOAD_COMPLETED" : 18 } ],
"overallStatus" : {
"fullUri" : "s3://your-bucket/dbpedia/",
"runNumber" : 1,
"retryNumber" : 0,
"status" : "LOAD_COMPLETED",
"totalTimeSpent" : 3622,
"totalRecords" : 131861846,
"totalDuplicates" : 21012430,
"parsingErrors" : 0,
"datatypeMismatchErrors" : 0,
"insertErrors" : 0
}
}
}
The status contains the number of files loaded from the S3 bucket and the total amount of triples finally stored. For the subset of DBpedia contained in the bucket, 131,861,846 triples are loaded but 21,012,430 are duplicates, the same triple is present in more than one file. This means that we actually have 110,849,416 triples now in Neptune. It took 3622 seconds to load it, slightly more than one hour.
- We can check this by sending a query to the Neptune using SPARQL that gets the total amount of triples stored:
curl -X POST --data-binary 'query=SELECT (COUNT(?s) AS ?n) WHERE { ?s ?p ?o }' https://neptune.abcdefghi1cba.us-east-1.neptune.amazonaws.com:8182/sparql
The following code is the result:
{
"head" : { "vars" : [ "n" ] },
"results" : { "bindings" : [ {
"n" : {
"datatype" : "http://www.w3.org/2001/XMLSchema#integer",
"type" : "literal",
"value" : "110849416" }
} ]
}
}
However, it's inconvenient to explore the loaded data through manually crafted SPARQL queries sent from the command line. It is better to use Rhizomer to interactively explore all the data we loaded: https://neptune-rhizomer.rhizomik.net/datasets/dbpedia
In an EC2 instance, we use wget to get the following files from the DBpedia 2016-10 dump. These are just a subset of DBpedia (about 130 million triples), but the most useful for data exploration.
File | Dataset | Description | Size (triples) |
---|---|---|---|
article_categories_en.ttl.bz2 | Article Categories | Links from concepts to categories using the SKOS vocabulary. | 23,990,514 |
category_labels_en.ttl.bz2 | Category Labels | Labels for categories. | 1,475,015 |
disambiguations_en.ttl.bz2 | Disambiguations | Links extracted from Wikipedia disambiguation pages. Because Wikipedia has no syntax to distinguish disambiguation links from ordinary links, DBpedia uses heuristics. | 1,537,180 |
geo_coordinates_en.ttl.bz2 | Geo Coordinates | Geographic coordinates extracted from Wikipedia. | 2,323,568 |
geo_coordinates_mappingbased_en.ttl.bz2 | Geo Coordinates Mappingbased | Geographic coordinates extracted from Wikipedia originating from mapped infoboxes in the mappings wiki. | 2,450,527 |
geonames_links_en.ttl.bz2 | Geonames Links | This file contains the back-links (owl:sameAs) to the Geonames dataset. | 535,380 |
homepages_en.ttl.bz2 | Homepages | Links to homepages of persons, organizations, etc. | 688,563 |
images_en.ttl.bz2 | Images | Main image and corresponding thumbnail from the Wikipedia article. | 11,869,354 |
instance_types_en.ttl.bz2 | Instance Types | Contains triples of the form $object rdf:type $class from the mapping-based extraction. | 5,150,432 |
labels_en.ttl.bz2 | Labels | Titles of all Wikipedia articles in the corresponding language. In Wikidata, it contains all the languages a vailable in the mappings Wiki; labels_nmw contains the rest. | 1,2845,252 |
long_abstracts_en.ttl.bz2 | Long Abstracts | Long abstracts (full abstracts) of Wikipedia articles, usually the first section. | 4,935,279 |
mappingbased_literals_en.ttl.bz2 | Mappingbased Literals | High-quality data extracted from infoboxes using the mapping-based extraction (literal properties only). The predicates in this dataset are in the ontology namespace. This data is of much higher quality than the raw infobox properties in the property namespace. | 14,388,537 |
mappingbased_objects_en.ttl.bz2 | Mappingbased Objects | High-quality data extracted from infoboxes using the mapping-based extraction (object properties only). The predicates in this dataset are in the ontology namespace. This data is of much higher quality than the raw infobox properties in the property namespace. | 18,746,174 |
mappingbased_objects_uncleaned_en.ttl.bz2 | Mappingbased objects uncleaned | The DBpedia dataset mappingbased_objects_uncleaned | 18,806,500 |
short_abstracts_en.ttl.bz2 | Short Abstracts | Short abstracts (about 600 characters long) of Wikipedia articles. | 4,935,279 |
skos_categories_en.ttl.bz2 | SKOS Categories | Information of which concept is a category and how categories are related using the SKOS vocabulary. | 6,083,029 |
specific_mappingbased_properties_en.ttl.bz2 | Specific Mappingbased Properties | Infobox data from the mapping-based extraction, using units of measurement more convenient for the resource type, such as square kilometers instead of square meters for the area of a city. | 915,714 |
topical_concepts_en.ttl.bz2 | Topical Concepts | Resources that describe a category. | 186,680 |
Total | 131,862,977 |
Neptune is just able to load the input files from Amazon S3. Therefore, the DBpedia files should be uploaded to S3 first:
aws s3 cp . s3://your-bucket/ --recursive --exclude "*" --include "*.bz2"
Finally, we're ready to load DBpedia into Neptune and explore our knowledge graph following the instructions in Bulk load DBpedia in Neptune.
The DBpedia files can be also downloaded from DBpedia Databus using the Databus Client.
First, download the client as a JAR file from its GitHub releases:
wget https://github.com/dbpedia/databus-client/releases/download/v0.3.1/databus-client-1.0-SNAPSHOT.jar
Then, execute the client (Java required) to retrieve the "Rhizomer Dump 2021.09.01-en" collection defined to select the subset of DBpedia files that makes it browsable using Rhizomer:
java -jar databus-client-1.0-SNAPSHOT.jar -f ttl -c bz2 -t ./databus-download/ -s "https://databus.dbpedia.org/rogargon/collections/browsable_core"
This will download the following 12 files in the collection (about 2 GB of data) into the databus-download
folder:
Dataset | Downloads | Variant | Format |
---|---|---|---|
Cleaned object properties extracted with mappings (2021.09.01) | 175.5 MB | en | ttl |
383 KB | disjointDomain, en | ttl | |
685 KB | disjointRange, en | ttl | |
Numeric Literals converted to designated units with class-specific property mappings (2021.09.01) | 8.3 MB | en | ttl |
Extracted facts from Wikipedia Infoboxes (2021.09.01) | 820 MB | en | ttl |
DBpedia Ontology instance types (2021.09.01) | 42.4 MB | en, specific | ttl |
Geo-coordinates extracted with mappings (2021.09.01) | 17.1 MB | en | ttl |
geo-coordinates dataset (2021.09.01) | 32.7 MB | en | ttl |
images dataset (2021.09.01) | 604.9 MB | en | ttl |
Wikipedia page title as rdfs:label (2021.09.01) | 153.3 MB | en | ttl |
homepages dataset (2021.09.01) | 11.5 MB | en | ttl |
Literals extracted with mappings (2021.09.01) | 139.1 MB | en | ttl |