Skip to content

[docs] Guide - GPU Indexing + Reindexing #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 2 additions & 10 deletions docs/src/concepts/indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,22 +109,14 @@ Then the greedy search routine operates as follows:

Embeddings for a given dataset are made searchable via an **index**. The index is constructed by using data structures that store the embeddings such that it's very efficient to perform scans and lookups on them. A key distinguishing feature of LanceDB is it uses a disk-based index: IVF-PQ, which is a variant of the Inverted File Index (IVF) that uses Product Quantization (PQ) to compress the embeddings.

### Reindexing Process
## Reindexing and Incremental Indexing

Reindexing is the process of updating the index to account for new data, keeping good performance for queries. This applies to either a full-text search (FTS) index or a vector index. For ANN search, new data will always be included in query results, but queries on tables with unindexed data will fallback to slower search methods for the new parts of the table. This is another important operation to run periodically as your data grows, as it also improves performance. This is especially important if you're appending large amounts of data to an existing dataset.

!!! tip
When adding new data to a dataset that has an existing index (either FTS or vector), LanceDB doesn't immediately update the index until a reindex operation is complete.

Both LanceDB OSS and Cloud support reindexing, but the process (at least for now) is different for each, depending on the type of index.
> Both LanceDB OSS and Cloud support reindexing, but the process (at least for now) is different for each, depending on the type of index.

When a reindex job is triggered in the background, the entire data is reindexed, but in the interim as new queries come in, LanceDB will combine results from the existing index with exhaustive kNN search on the new data. This is done to ensure that you're still searching on all your data, but it does come at a performance cost. The more data that you add without reindexing, the impact on latency (due to exhaustive search) can be noticeable.

#### Vector Index Reindexing

* LanceDB Cloud supports incremental reindexing, where a background process will trigger a new index build for you automatically when new data is added to a dataset
* LanceDB OSS requires you to manually trigger a reindex operation -- we are working on adding incremental reindexing to LanceDB OSS as well

#### FTS Index Reindexing

FTS reindexing is supported in both LanceDB OSS and Cloud, but requires that it's manually rebuilt once you have a significant enough amount of new data added that needs to be reindexed. We [updated](https://github.com/lancedb/lancedb/pull/762) Tantivy's default heap size from 128MB to 1GB in LanceDB to make it much faster to reindex, by up to 10x from the default settings.
5 changes: 0 additions & 5 deletions docs/src/enterprise/enterprise-faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,11 +169,6 @@ keeping data reasonably fresh for most applications.

## Indexing

### Can I use GPU for indexing?
Yes! Please contact the LanceDB team to enable GPU-based indexing for your deployment.
Then you just need to call `create_index`, and the backend will use GPU for indexing.
LanceDB is able to index a few billion vectors under 4 hours.

## Cluster Configuration

### What are the parameters that can be configured for my LanceDB cluster?
Expand Down
60 changes: 56 additions & 4 deletions docs/src/guides/indexing/gpu-indexing.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,63 @@
---
title: "GPU-Powered Vector Indexing | LanceDB Cloud"
title: "GPU-Powered Vector Indexing in LanceDB"
description: "Learn about LanceDB's high-performance GPU-based vector indexing capabilities. Scale your vector search to billions of rows with accelerated indexing performance."
keywords: "LanceDB GPU indexing, vector database acceleration, enterprise vector search, GPU-powered indexing, large-scale vector search, enterprise features"
---

# GPU-Powered Vector Indexing in LanceDB Cloud
# GPU-Powered Vector Indexing

> This feature is currently only available in LanceDB Enterprise. Please [contact us](mailto:[email protected]) to enable GPU indexing for your deployment.
With LanceDB's GPU-powered indexing you can create vector indexes for billions of rows in just a few hours. This can significantly accelerate your vector search operations.

With GPU-powered indexing, LanceDB can create vector indexes with billions of rows in a few hours.
> In our tests, LanceDB's GPU-powered indexing can process billions of vectors in under four hours, providing significant performance improvements over CPU-based indexing.

## Automatic GPU Indexing in LanceDB Enterprise

!!! info "LanceDB Enterprise Only"
Automatic GPU Indexing is currently only available in LanceDB Enterprise. Please [contact us](mailto:[email protected]) to enable this feature for your deployment.

GPU indexing is automatic in LanceDB Enterprise and it is used to build either the IVF or HNSW indexes.

> Whenever you `create_index`, the backend will use GPU resources to build either the IVF or HNSW indexes. The system automatically selects the optimal GPU configuration based on your data size and available hardware.

This process is also asynchronous by default, but you can use `wait_for_index` to convert it into a synchronous process by waiting until the index is built.

## Manual GPU Indexing in LanceDB OSS

You can use the Python SDK to manually create the IVF_PQ index. You will need [PyTorch>2.0](https://pytorch.org/). Please keep in mind that GPU based indexing is currently only supported by the synchronous SDK.

You can specify the GPU device to train IVF partitions via `accelerator`. Specify parameters `cuda` or `mps` (on Apple Silicon) to enable GPU training.

=== "Linux"

<!-- skip-test -->
``` { .python .copy }
# Create index using CUDA on Nvidia GPUs.
tbl.create_index(
num_partitions=256,
num_sub_vectors=96,
accelerator="cuda"
)
```

=== "MacOS"

<!-- skip-test -->
```python
# Create index using MPS on Apple Silicon.
tbl.create_index(
num_partitions=256,
num_sub_vectors=96,
accelerator="mps"
)
```

## Performance Considerations

- GPU memory usage scales with `num_partitions` and vector dimensions
- For optimal performance, ensure GPU memory exceeds dataset size
- Batch size is automatically tuned based on available GPU memory
- Indexing speed improves with larger batch sizes

## Troubleshooting

If you encounter the error `AssertionError: Torch not compiled with CUDA enabled`, you need to [install PyTorch with CUDA support](https://pytorch.org/get-started/locally/).
33 changes: 19 additions & 14 deletions docs/src/guides/indexing/reindexing.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,57 @@
---
title: "Updating Indexes in LanceDB | Index Updates Guide"
title: "Reindexing and Incremental Indexing in LanceDB"
description: "Learn how to efficiently update and manage indexes in LanceDB using incremental indexing. Includes best practices for adding new records without full reindexing."
keywords: "LanceDB incremental indexing, index updates, database optimization, vector search indexing, index management"
---

# Updating Indexes with New Data in LanceDB Cloud
# Reindexing and Incremental Indexing

When new data is added to a table, LanceDB Cloud automatically updates indices in the background.
## Incremental Indexing in LanceDB Cloud

To check index status, use `index_stats()` to view the number of unindexed rows. This will be zero when indices are fully up-to-date.
**LanceDB Cloud & Enterprise** support incremental reindexing through an automated background process. When new data is added to a table, the system automatically triggers a new index build. As the dataset grows, indexes are continuously updated in the background.

While indices are being updated, queries use brute force methods for unindexed rows, which may temporarily increase latency. To avoid this, set `fast_search=True` to search only indexed data.
> While indexes are being rebuilt, queries use brute force methods on unindexed rows, which may temporarily increase latency. To avoid this, set `fast_search=True` to search only indexed data.

OSS______________
!!! note "Checking Index Status"
Use `index_stats()` to view the number of unindexed rows. This will be zero when indexes are fully up-to-date.

## Incremental Indexing in LanceDB OSS

LanceDB supports incremental indexing, which means you can add new records to the table without reindexing the entire table.
**LanceDB OSS** supports incremental indexing, which means you can add new records to the table without reindexing the entire table.

This can make the query more efficient, especially when the table is large and the new records are relatively small.

=== "Python"

=== "Sync API"

```python
--8<-- "python/python/tests/docs/test_search.py:fts_incremental_index"
```
=== "Async API"

```python
--8<-- "python/python/tests/docs/test_search.py:fts_incremental_index_async"
```

=== "TypeScript"

```typescript
await tbl.add([{ vector: [3.1, 4.1], text: "Frodo was a happy puppy" }]);
await tbl.optimize();
```

=== "Rust"

```rust
let more_data: Box<dyn RecordBatchReader + Send> = create_some_records()?;
tbl.add(more_data).execute().await?;
tbl.optimize(OptimizeAction::All).execute().await?;
```
!!! note

New data added after creating the FTS index will appear in search results while incremental index is still progress, but with increased latency due to a flat search on the unindexed portion. LanceDB Cloud automates this merging process, minimizing the impact on search speed.
!!! note "Performance Considerations"
New data added after creating the FTS index will appear in search results while the incremental index is still in progress, but with increased latency due to a flat search on the unindexed portion. LanceDB Cloud & Enterprise automate this merging process, minimizing the impact on search speed.

## FTS Index Reindexing

FTS Reindexing is **supported in LanceDB OSS, Cloud & Enterprise**. However, it requires manual rebuilding when a significant amount of new data needs to be reindexed.

We [updated](https://github.com/lancedb/lancedb/pull/762) Tantivy's default heap size from 128MB to 1GB in LanceDB, making reindexing up to 10x faster than with default settings.