Architecture
This page describes the low-level architecture of Wagtail Vector Index. It is intended for those who wish to contribute to the package, or to customise it's behaviour at the lowest levels.
The APIs described here should not be considered 'public' or final unless they are documented elsewhere.
The main goal of wagtail-vector-index is to allow a developer to generate and store vector embeddings of 'things' (mainly Wagtail Pages). To do this, and to allow some flexibility in how that is done, that process is broken down in to a few key components:
VectorIndex - storage/base.py
VectorIndexs are the type most developers will be interacting with. A VectorIndex represents a set of Documents that can be queried.
A simple implementation of VectorIndex implements the three public API methods; query, find_similar and search.
The rebuild_index method on a VectorIndex takes those documents and stores them somewhere.
Where it is stored depends on the implementation of the VectorIndex. The package comes with a set of pre-existing StorageProviders - these represent some system where vectors can be stored, ranging from NumpyStorageProvider where everything is managed in-memory, through PgvectorStorageProvider which users your existing PostgreSQL database, to WeaviateStorageProvider which enables support for specific SaaS/self-hosted databases.
Each of these storage providers comes with a mixin class that provides the provider-specific methods for inserting and managing entries in the index.
e.g. the PgvectorIndexMixin can be mixed in to a VectorIndex to store documents in PgVector.
Document - storage/base.py
Documents are a dataclass representing something that is stored in a VectorIndex. They have a reference to an Embedding database object, a vector (for the embedding) and an unstructured metadata dict. This class allows us to store anything in a VectorIndex without needing to build indexes that hold specific types of object.
Documents have an embedding_pk field, a reference to an Embedding model instance. Theis stores an embedding in the application database. This enables quickly repopulating vector backends, as well as some performance optimisations as we can get use generic foreign keys to return our related model instances.
Whenever we are working with VectorIndexs, we are working with Document objects but as a user, these Documents aren't usually what we want to be working with. We would prefer to deal with our models and Pages, and let the package transparently handle converting them back and forth to Documents.
This is where DocumentConverters come in.
DocumentConverter - storage/base.py
DocumentConverter is a protocol that defines how to convert an object to Documents and Documents back to an object.
To go from an object to a Document is usually a case of:
- Determining a representation of the object that should be embedded
- Splitting that representation up in to chunks to fit within the the embedding model's limit
- Generating embeddings for each chunk
- Returning one or more
Documentobjects containing the embedding and some metadata about the original object
To go from a Document back to an object we have to rely on the Document metadata. This could be something like a primary key or UUID which will enable us to retrieve the original object from a database/filesystem, or it could be more complex metadata allowing us to reconstruct the object.
A Converter is also responsible for the creation of Embedding model instances.
Model-specific implementations - storage/models.py
While all of the above are intended to be generic and usable for any object type, the main use-case for wagtail-vector-index is to index Django models or Wagtail Pages.
For this, we implement specialised versions of these classes/protocols and some utilities around them that are more likely to be consumed by developers.
EmbeddableFieldsMixinis a way to let developers specify what fields of their model they want to index by adding the mixin and addingembedding_fieldsto a model. This doesn't do anything interesting by itself.EmbeddableFieldsDocumentConverterknows how to convert anything with theEmbeddableFieldsMixinto a document, and when instantiated with abase_model, knows how to convertDocumentsback to thatbase_model.EmbeddableFieldsVectorIndexMixincan be subclassed with a list ofQuerySets of models withEmbeddableFieldsMixinand manages the index for them. It usesEmbeddableFieldsDocumentConverterto shepherd documents back and forth.GeneratedIndexMixinis a convenience mixin which allows a developer to accessvector_indexon their model to return an automatically generatedEmbeddableFieldsVectorIndex.VectorIndexedMixincombinesGeneratedIndexMixinandEmbeddableFieldsMixinto create a single mixin that developers can use to easily implementwagtail-vector-indexfeatures without needing to know the underlying mixins.
In Summary
VectorIndexs are responsible for fetching all the documents to be indexed, the interfaces for searching those documents, and storing those documents in someStorageProvider. They have aget_convertermethod which returns an instance ofDocumentConverterto ues for shepherdingDocuments.DocumentConverters convertDocuments to and from the type the user is dealing with. They might need to be specific to a certain model, or they could be written in a more generic way to convert based on metadata in theDocument.