feat(search): add Weaviate vector search engine

Add semantic search capabilities using Weaviate vector database.

Features:
- Hybrid search (BM25 + semantic vectors)
- Multiple search modes: hybrid, bm25, nearText
- Incremental rebuild with hash-based change detection
- Orphan page cleanup after rebuild
- Rate limit handling with exponential backoff
- Configurable batch indexing (size, delay, max bytes)
- Result caching with cluster-wide invalidation
- Search analytics (top searches, zero-result tracking)
- Health monitoring and metrics

Requirements:
- Weaviate 1.32+
- Pre-configured Weaviate class with vectorizer

Closes #XXXX
pull/7861/head
YouriM 4 days ago
parent 407aacfa19
commit 8bf25e7103

@ -0,0 +1,176 @@
# Weaviate Search Module
Semantic search engine for Wiki.js using Weaviate vector database.
## Features
- **Hybrid Search**: Combined BM25 keyword + semantic vector search
- **Multiple Search Modes**: hybrid, bm25, nearText
- **Incremental Rebuild**: Only reindex changed pages (hash-based detection)
- **Orphan Cleanup**: Automatically removes deleted pages from index
- **Rate Limit Handling**: Exponential backoff with configurable batch delays
- **Result Caching**: Configurable TTL with cluster-wide invalidation
- **Result Highlighting**: Query terms highlighted with `<mark>` tags
- **Search Analytics**: Track top searches and zero-result queries
- **Health Monitoring**: Periodic health checks with metrics
## Requirements
- Wiki.js 2.5+
- Weaviate 1.32+
- Node.js 18+
- Weaviate class must be pre-created with vectorizer configured
## Installation
Copy the module to your Wiki.js installation:
```bash
cp -r server/modules/search/weaviate/ /path/to/wikijs/server/modules/search/
```
Restart Wiki.js - the module will appear in Administration > Search.
## Configuration
### Connection
| Setting | Description | Default |
|---------|-------------|---------|
| host | Weaviate hostname (without protocol) | localhost |
| httpPort | HTTP/REST API port | 18080 |
| httpSecure | Use HTTPS | true |
| grpcPort | gRPC port | 50051 |
| grpcSecure | Use TLS for gRPC | true |
| skipTLSVerify | Skip TLS certificate verification (see warning below) | false |
| apiKey | Authentication key | - |
| timeout | Connection timeout (ms) | 10000 |
> **Warning**: `skipTLSVerify` sets `NODE_TLS_REJECT_UNAUTHORIZED=0` globally due to weaviate-client limitations. This affects all HTTPS connections in the process.
### Schema
| Setting | Description | Default |
|---------|-------------|---------|
| className | Weaviate collection name | WikiPage |
### Search
| Setting | Description | Default |
|---------|-------------|---------|
| searchType | hybrid / bm25 / nearText | hybrid |
| alpha | Hybrid balance: 0=keyword, 1=semantic | 0.5 |
| searchLimit | Max results per query | 50 |
| cacheTtl | Cache duration (seconds) | 300 |
| boostTitle | Title field boost | 3 |
| boostDescription | Description field boost | 2 |
| boostTags | Tags field boost | 1 |
### Indexing
| Setting | Description | Default |
|---------|-------------|---------|
| batchSize | Documents per batch | 100 |
| batchDelayMs | Delay between batches (ms) | 1000 |
| maxBatchBytes | Max batch size (bytes) | 10MB |
| forceFullRebuild | Delete all before rebuild | false |
| debugSql | Log sync table SQL queries | false |
## Weaviate Schema
Create this class in Weaviate **before** enabling the module:
```json
{
"class": "WikiPage",
"vectorizer": "text2vec-transformers",
"properties": [
{ "name": "pageId", "dataType": ["int"], "indexFilterable": true },
{ "name": "title", "dataType": ["text"], "indexSearchable": true },
{ "name": "description", "dataType": ["text"], "indexSearchable": true },
{ "name": "content", "dataType": ["text"], "indexSearchable": true },
{ "name": "path", "dataType": ["text"], "indexFilterable": true },
{ "name": "locale", "dataType": ["text"], "indexFilterable": true },
{ "name": "tags", "dataType": ["text[]"], "indexSearchable": true }
]
}
```
## Search Types
### Hybrid (recommended)
Combined keyword + semantic search. Adjust `alpha`:
| Alpha | Behavior |
|-------|----------|
| 0.0 | Pure BM25 keyword |
| 0.5 | Balanced (default) |
| 1.0 | Pure semantic |
### BM25
Pure keyword search with configurable field boosting.
### NearText
Pure semantic search based on embedding similarity.
## Rebuild Modes
### Incremental (default)
- Compares content hash to detect changes
- Only reindexes modified pages
- Tracks sync status in `weaviate_sync_status` table
- Cleans orphan pages after rebuild
### Full (`forceFullRebuild: true`)
- Deletes all documents from Weaviate
- Reindexes all pages from scratch
- Use when schema changes or index is corrupted
Both modes use streaming to avoid memory issues with large wikis.
## API Reference
| Method | Description |
|--------|-------------|
| `query(q, opts)` | Search for pages |
| `created(page)` | Index a new page |
| `updated(page)` | Update indexed page |
| `deleted(pageId)` | Remove from index |
| `renamed(page)` | Handle path change |
| `rebuild()` | Rebuild index |
| `isHealthy()` | Health check |
| `getStats()` | Index statistics |
| `getMetrics()` | Module metrics |
| `getSearchAnalytics(opts)` | Search analytics |
| `suggest(prefix, opts)` | Auto-complete |
## Troubleshooting
### Class not found
The module does not create the class. Create it manually with your vectorizer.
### Connection timeout
1. Verify both HTTP and gRPC ports are accessible
2. Check TLS certificates
3. Check firewall rules
### Empty results after rebuild
1. Check logs for indexing errors
2. Verify vectorizer is configured
3. Try full rebuild (`forceFullRebuild: true`)
### Rate limiting during rebuild
Increase `batchDelayMs` (e.g., 2000ms) or decrease `batchSize`.
## License
AGPL-3.0 (same as Wiki.js)

@ -0,0 +1,147 @@
key: weaviate
title: Weaviate
description: Weaviate vector search engine with semantic search capabilities.
author: YM
logo: https://weaviate.io/img/site/weaviate-logo-light.png
website: https://weaviate.io
isAvailable: true
props:
# === Connection ===
host:
type: String
title: Host
hint: 'Weaviate hostname without protocol (e.g., weaviate.example.com)'
default: 'localhost'
order: 1
httpPort:
type: Number
title: HTTP Port
hint: 'Weaviate HTTP port'
default: 18080
order: 2
httpSecure:
type: Boolean
title: HTTP Secure (HTTPS)
hint: 'Use HTTPS for HTTP connections'
default: true
order: 3
grpcPort:
type: Number
title: gRPC Port
hint: 'Weaviate gRPC port'
default: 50051
order: 4
grpcSecure:
type: Boolean
title: gRPC Secure (TLS)
hint: 'Use TLS for gRPC connections'
default: true
order: 5
skipTLSVerify:
type: Boolean
title: Skip TLS Verification
hint: 'Skip TLS certificate verification (for self-signed certs)'
default: false
order: 6
apiKey:
type: String
title: API Key
hint: 'API key for authentication'
sensitive: true
order: 7
timeout:
type: Number
title: Timeout (ms)
hint: 'Connection timeout in milliseconds (default: 10000)'
default: 10000
order: 8
# === Schema ===
className:
type: String
title: Class Name
hint: 'Weaviate class name (must exist with vectorizer pre-configured)'
default: 'WikiPage'
order: 9
# === Search Settings ===
searchType:
type: String
title: Search Type
hint: 'Type of search to perform'
default: 'hybrid'
enum:
- 'nearText'
- 'bm25'
- 'hybrid'
order: 10
alpha:
type: Number
title: Hybrid Alpha
hint: '0 = keyword only, 1 = semantic only'
default: 0.5
order: 11
searchLimit:
type: Number
title: Search Results Limit
hint: 'Maximum number of search results to return (default: 50)'
default: 50
order: 12
cacheTtl:
type: Number
title: Cache TTL (seconds)
hint: 'Time to live for search cache in seconds (default: 300 = 5 minutes)'
default: 300
order: 13
# === Boost Settings ===
boostTitle:
type: Number
title: Title Boost
hint: 'Boost factor for title field (default: 3)'
default: 3
order: 14
boostDescription:
type: Number
title: Description Boost
hint: 'Boost factor for description field (default: 2)'
default: 2
order: 15
boostTags:
type: Number
title: Tags Boost
hint: 'Boost factor for tags field (default: 1)'
default: 1
order: 16
# === Indexing Settings ===
batchSize:
type: Number
title: Batch Size
hint: 'Number of documents per batch during indexing (default: 100)'
default: 100
order: 17
batchDelayMs:
type: Number
title: Batch Delay (ms)
hint: 'Delay between batches to avoid rate limiting (default: 1000)'
default: 1000
order: 18
maxBatchBytes:
type: Number
title: Max Batch Size (bytes)
hint: 'Maximum batch size in bytes (default: 10485760 = 10MB)'
default: 10485760
order: 19
forceFullRebuild:
type: Boolean
title: Force Full Rebuild
hint: 'If enabled, rebuild will delete all and reindex. Otherwise, incremental rebuild (only changed pages).'
default: false
order: 20
debugSql:
type: Boolean
title: Debug SQL Queries
hint: 'Log all SQL queries to the sync status table (for debugging)'
default: false
order: 21

File diff suppressed because it is too large Load Diff
Loading…
Cancel
Save