DocArray
DocArray is a versatile, open-source tool for managing your multi-modal data. It lets you shape your data however you want, and offers the flexibility to store and search it using various document index backends. Plus, it gets even better - you can utilize your
DocArray
document index to create aDocArrayRetriever
, and build awesome Langchain apps!
This notebook is split into two sections. The first section offers an introduction to all five supported document index backends. It provides guidance on setting up and indexing each backend and also instructs you on how to build a DocArrayRetriever
for finding relevant documents.
In the second section, we'll select one of these backends and illustrate how to use it through a basic example.
Document Index Backends
import random
from docarray import BaseDoc
from docarray.typing import NdArray
from langchain_community.embeddings import FakeEmbeddings
from langchain_community.retrievers import DocArrayRetriever
embeddings = FakeEmbeddings(size=32)
Before you start building the index, it's important to define your document schema. This determines what fields your documents will have and what type of data each field will hold.
For this demonstration, we'll create a somewhat random schema containing 'title' (str), 'title_embedding' (numpy array), 'year' (int), and 'color' (str)
class MyDoc(BaseDoc):
title: str
title_embedding: NdArray[32]
year: int
color: str
InMemoryExactNNIndex
InMemoryExactNNIndex
stores all Documents in memory. It is a great starting point for small datasets, where you may not want to launch a database server.
Learn more here: https://docs.docarray.org/user_guide/storing/index_in_memory/
from docarray.index import InMemoryExactNNIndex
# initialize the index
db = InMemoryExactNNIndex[MyDoc]()
# index data
db.index(
[
MyDoc(
title=f"My document {i}",
title_embedding=embeddings.embed_query(f"query {i}"),
year=i,
color=random.choice(["red", "green", "blue"]),
)
for i in range(100)
]
)
# optionally, you can create a filter query
filter_query = {"year": {"$lte": 90}}
# create a retriever
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="title_embedding",
content_field="title",
filters=filter_query,
)
# find the relevant document
doc = retriever.invoke("some query")
print(doc)
[Document(page_content='My document 56', metadata={'id': '1f33e58b6468ab722f3786b96b20afe6', 'year': 56, 'color': 'red'})]
HnswDocumentIndex
HnswDocumentIndex
is a lightweight Document Index implementation that runs fully locally and is best suited for small- to medium-sized datasets. It stores vectors on disk in hnswlib, and stores all other data in SQLite.
Learn more here: https://docs.docarray.org/user_guide/storing/index_hnswlib/
from docarray.index import HnswDocumentIndex
# initialize the index
db = HnswDocumentIndex[MyDoc](work_dir="hnsw_index")
# index data
db.index(
[
MyDoc(
title=f"My document {i}",
title_embedding=embeddings.embed_query(f"query {i}"),
year=i,
color=random.choice(["red", "green", "blue"]),
)
for i in range(100)
]
)
# optionally, you can create a filter query
filter_query = {"year": {"$lte": 90}}
# create a retriever
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="title_embedding",
content_field="title",
filters=filter_query,
)
# find the relevant document
doc = retriever.invoke("some query")
print(doc)
[Document(page_content='My document 28', metadata={'id': 'ca9f3f4268eec7c97a7d6e77f541cb82', 'year': 28, 'color': 'red'})]
WeaviateDocumentIndex
WeaviateDocumentIndex
is a document index that is built upon Weaviate vector database.
Learn more here: https://docs.docarray.org/user_guide/storing/index_weaviate/
# There's a small difference with the Weaviate backend compared to the others.
# Here, you need to 'mark' the field used for vector search with 'is_embedding=True'.
# So, let's create a new schema for Weaviate that takes care of this requirement.
from pydantic import Field
class WeaviateDoc(BaseDoc):
title: str
title_embedding: NdArray[32] = Field(is_embedding=True)
year: int
color: str
from docarray.index import WeaviateDocumentIndex
# initialize the index
dbconfig = WeaviateDocumentIndex.DBConfig(host="http://localhost:8080")
db = WeaviateDocumentIndex[WeaviateDoc](db_config=dbconfig)
# index data
db.index(
[
MyDoc(
title=f"My document {i}",
title_embedding=embeddings.embed_query(f"query {i}"),
year=i,
color=random.choice(["red", "green", "blue"]),
)
for i in range(100)
]
)
# optionally, you can create a filter query
filter_query = {"path": ["year"], "operator": "LessThanEqual", "valueInt": "90"}
# create a retriever
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="title_embedding",
content_field="title",
filters=filter_query,
)
# find the relevant document
doc = retriever.invoke("some query")
print(doc)
[Document(page_content='My document 17', metadata={'id': '3a5b76e85f0d0a01785dc8f9d965ce40', 'year': 17, 'color': 'red'})]
ElasticDocIndex
ElasticDocIndex
is a document index that is built upon ElasticSearch
Learn more here
from docarray.index import ElasticDocIndex
# initialize the index
db = ElasticDocIndex[MyDoc](
hosts="http://localhost:9200", index_name="docarray_retriever"
)
# index data
db.index(
[
MyDoc(
title=f"My document {i}",
title_embedding=embeddings.embed_query(f"query {i}"),
year=i,
color=random.choice(["red", "green", "blue"]),
)
for i in range(100)
]
)
# optionally, you can create a filter query
filter_query = {"range": {"year": {"lte": 90}}}
# create a retriever
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="title_embedding",
content_field="title",
filters=filter_query,
)
# find the relevant document
doc = retriever.invoke("some query")
print(doc)
[Document(page_content='My document 46', metadata={'id': 'edbc721bac1c2ad323414ad1301528a4', 'year': 46, 'color': 'green'})]
QdrantDocumentIndex
QdrantDocumentIndex
is a document index that is built upon Qdrant vector database
Learn more here
from docarray.index import QdrantDocumentIndex
from qdrant_client.http import models as rest
# initialize the index
qdrant_config = QdrantDocumentIndex.DBConfig(path=":memory:")
db = QdrantDocumentIndex[MyDoc](qdrant_config)
# index data
db.index(
[
MyDoc(
title=f"My document {i}",
title_embedding=embeddings.embed_query(f"query {i}"),
year=i,
color=random.choice(["red", "green", "blue"]),
)
for i in range(100)
]
)
# optionally, you can create a filter query
filter_query = rest.Filter(
must=[
rest.FieldCondition(
key="year",
range=rest.Range(
gte=10,
lt=90,
),
)
]
)
WARNING:root:Payload indexes have no effect in the local Qdrant. Please use server Qdrant if you need payload indexes.
# create a retriever
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="title_embedding",
content_field="title",
filters=filter_query,
)
# find the relevant document
doc = retriever.invoke("some query")
print(doc)
[Document(page_content='My document 80', metadata={'id': '97465f98d0810f1f330e4ecc29b13d20', 'year': 80, 'color': 'blue'})]