feat(search): add Weaviate vector search engine

Add semantic search capabilities using Weaviate vector database. Features: - Hybrid search (BM25 + semantic vectors) - Multiple search modes: hybrid, bm25, nearText - Incremental rebuild with hash-based change detection - Orphan page cleanup after rebuild - Rate limit handling with exponential backoff - Configurable batch indexing (size, delay, max bytes) - Result caching with cluster-wide invalidation - Search analytics (top searches, zero-result tracking) - Health monitoring and metrics Requirements: - Weaviate 1.32+ - Pre-configured Weaviate class with vectorizer Closes #XXXX
4 days ago · 8bf25e7103
parent 407aacfa19
commit 8bf25e7103
3 changed files with 2280 additions and 0 deletions
--- a/server/modules/search/weaviate/README.md
+++ b/server/modules/search/weaviate/README.md
@ -0,0 +1,176 @@
+# Weaviate Search Module
+
+Semantic search engine for Wiki.js using Weaviate vector database.
+
+## Features
+
+- **Hybrid Search**: Combined BM25 keyword + semantic vector search
+- **Multiple Search Modes**: hybrid, bm25, nearText
+- **Incremental Rebuild**: Only reindex changed pages (hash-based detection)
+- **Orphan Cleanup**: Automatically removes deleted pages from index
+- **Rate Limit Handling**: Exponential backoff with configurable batch delays
+- **Result Caching**: Configurable TTL with cluster-wide invalidation
+- **Result Highlighting**: Query terms highlighted with `<mark>` tags
+- **Search Analytics**: Track top searches and zero-result queries
+- **Health Monitoring**: Periodic health checks with metrics
+
+## Requirements
+
+- Wiki.js 2.5+
+- Weaviate 1.32+
+- Node.js 18+
+- Weaviate class must be pre-created with vectorizer configured
+
+## Installation
+
+Copy the module to your Wiki.js installation:
+
+```bash
+cp -r server/modules/search/weaviate/ /path/to/wikijs/server/modules/search/
+```
+
+Restart Wiki.js - the module will appear in Administration > Search.
+
+## Configuration
+
+### Connection
+
+| Setting | Description | Default |
+|---------|-------------|---------|
+| host | Weaviate hostname (without protocol) | localhost |
+| httpPort | HTTP/REST API port | 18080 |
+| httpSecure | Use HTTPS | true |
+| grpcPort | gRPC port | 50051 |
+| grpcSecure | Use TLS for gRPC | true |
+| skipTLSVerify | Skip TLS certificate verification (see warning below) | false |
+| apiKey | Authentication key | - |
+| timeout | Connection timeout (ms) | 10000 |
+
+> **Warning**: `skipTLSVerify` sets `NODE_TLS_REJECT_UNAUTHORIZED=0` globally due to weaviate-client limitations. This affects all HTTPS connections in the process.
+
+### Schema
+
+| Setting | Description | Default |
+|---------|-------------|---------|
+| className | Weaviate collection name | WikiPage |
+
+### Search
+
+| Setting | Description | Default |
+|---------|-------------|---------|
+| searchType | hybrid / bm25 / nearText | hybrid |
+| alpha | Hybrid balance: 0=keyword, 1=semantic | 0.5 |
+| searchLimit | Max results per query | 50 |
+| cacheTtl | Cache duration (seconds) | 300 |
+| boostTitle | Title field boost | 3 |
+| boostDescription | Description field boost | 2 |
+| boostTags | Tags field boost | 1 |
+
+### Indexing
+
+| Setting | Description | Default |
+|---------|-------------|---------|
+| batchSize | Documents per batch | 100 |
+| batchDelayMs | Delay between batches (ms) | 1000 |
+| maxBatchBytes | Max batch size (bytes) | 10MB |
+| forceFullRebuild | Delete all before rebuild | false |
+| debugSql | Log sync table SQL queries | false |
+
+## Weaviate Schema
+
+Create this class in Weaviate **before** enabling the module:
+
+```json
+{
+  "class": "WikiPage",
+  "vectorizer": "text2vec-transformers",
+  "properties": [
+    { "name": "pageId", "dataType": ["int"], "indexFilterable": true },
+    { "name": "title", "dataType": ["text"], "indexSearchable": true },
+    { "name": "description", "dataType": ["text"], "indexSearchable": true },
+    { "name": "content", "dataType": ["text"], "indexSearchable": true },
+    { "name": "path", "dataType": ["text"], "indexFilterable": true },
+    { "name": "locale", "dataType": ["text"], "indexFilterable": true },
+    { "name": "tags", "dataType": ["text[]"], "indexSearchable": true }
+  ]
+}
+```
+
+## Search Types
+
+### Hybrid (recommended)
+
+Combined keyword + semantic search. Adjust `alpha`:
+
+| Alpha | Behavior |
+|-------|----------|
+| 0.0 | Pure BM25 keyword |
+| 0.5 | Balanced (default) |
+| 1.0 | Pure semantic |
+
+### BM25
+
+Pure keyword search with configurable field boosting.
+
+### NearText
+
+Pure semantic search based on embedding similarity.
+
+## Rebuild Modes
+
+### Incremental (default)
+
+- Compares content hash to detect changes
+- Only reindexes modified pages
+- Tracks sync status in `weaviate_sync_status` table
+- Cleans orphan pages after rebuild
+
+### Full (`forceFullRebuild: true`)
+
+- Deletes all documents from Weaviate
+- Reindexes all pages from scratch
+- Use when schema changes or index is corrupted
+
+Both modes use streaming to avoid memory issues with large wikis.
+
+## API Reference
+
+| Method | Description |
+|--------|-------------|
+| `query(q, opts)` | Search for pages |
+| `created(page)` | Index a new page |
+| `updated(page)` | Update indexed page |
+| `deleted(pageId)` | Remove from index |
+| `renamed(page)` | Handle path change |
+| `rebuild()` | Rebuild index |
+| `isHealthy()` | Health check |
+| `getStats()` | Index statistics |
+| `getMetrics()` | Module metrics |
+| `getSearchAnalytics(opts)` | Search analytics |
+| `suggest(prefix, opts)` | Auto-complete |
+
+## Troubleshooting
+
+### Class not found
+
+The module does not create the class. Create it manually with your vectorizer.
+
+### Connection timeout
+
+1. Verify both HTTP and gRPC ports are accessible
+2. Check TLS certificates
+3. Check firewall rules
+
+### Empty results after rebuild
+
+1. Check logs for indexing errors
+2. Verify vectorizer is configured
+3. Try full rebuild (`forceFullRebuild: true`)
+
+### Rate limiting during rebuild
+
+Increase `batchDelayMs` (e.g., 2000ms) or decrease `batchSize`.
+
+## License
+
+AGPL-3.0 (same as Wiki.js)
--- a/server/modules/search/weaviate/definition.yml
+++ b/server/modules/search/weaviate/definition.yml
@ -0,0 +1,147 @@
+key: weaviate
+title: Weaviate
+description: Weaviate vector search engine with semantic search capabilities.
+author: YM
+logo: https://weaviate.io/img/site/weaviate-logo-light.png
+website: https://weaviate.io
+isAvailable: true
+props:
+  # === Connection ===
+  host:
+    type: String
+    title: Host
+    hint: 'Weaviate hostname without protocol (e.g., weaviate.example.com)'
+    default: 'localhost'
+    order: 1
+  httpPort:
+    type: Number
+    title: HTTP Port
+    hint: 'Weaviate HTTP port'
+    default: 18080
+    order: 2
+  httpSecure:
+    type: Boolean
+    title: HTTP Secure (HTTPS)
+    hint: 'Use HTTPS for HTTP connections'
+    default: true
+    order: 3
+  grpcPort:
+    type: Number
+    title: gRPC Port
+    hint: 'Weaviate gRPC port'
+    default: 50051
+    order: 4
+  grpcSecure:
+    type: Boolean
+    title: gRPC Secure (TLS)
+    hint: 'Use TLS for gRPC connections'
+    default: true
+    order: 5
+  skipTLSVerify:
+    type: Boolean
+    title: Skip TLS Verification
+    hint: 'Skip TLS certificate verification (for self-signed certs)'
+    default: false
+    order: 6
+  apiKey:
+    type: String
+    title: API Key
+    hint: 'API key for authentication'
+    sensitive: true
+    order: 7
+  timeout:
+    type: Number
+    title: Timeout (ms)
+    hint: 'Connection timeout in milliseconds (default: 10000)'
+    default: 10000
+    order: 8
+
+  # === Schema ===
+  className:
+    type: String
+    title: Class Name
+    hint: 'Weaviate class name (must exist with vectorizer pre-configured)'
+    default: 'WikiPage'
+    order: 9
+
+  # === Search Settings ===
+  searchType:
+    type: String
+    title: Search Type
+    hint: 'Type of search to perform'
+    default: 'hybrid'
+    enum:
+      - 'nearText'
+      - 'bm25'
+      - 'hybrid'
+    order: 10
+  alpha:
+    type: Number
+    title: Hybrid Alpha
+    hint: '0 = keyword only, 1 = semantic only'
+    default: 0.5
+    order: 11
+  searchLimit:
+    type: Number
+    title: Search Results Limit
+    hint: 'Maximum number of search results to return (default: 50)'
+    default: 50
+    order: 12
+  cacheTtl:
+    type: Number
+    title: Cache TTL (seconds)
+    hint: 'Time to live for search cache in seconds (default: 300 = 5 minutes)'
+    default: 300
+    order: 13
+
+  # === Boost Settings ===
+  boostTitle:
+    type: Number
+    title: Title Boost
+    hint: 'Boost factor for title field (default: 3)'
+    default: 3
+    order: 14
+  boostDescription:
+    type: Number
+    title: Description Boost
+    hint: 'Boost factor for description field (default: 2)'
+    default: 2
+    order: 15
+  boostTags:
+    type: Number
+    title: Tags Boost
+    hint: 'Boost factor for tags field (default: 1)'
+    default: 1
+    order: 16
+
+  # === Indexing Settings ===
+  batchSize:
+    type: Number
+    title: Batch Size
+    hint: 'Number of documents per batch during indexing (default: 100)'
+    default: 100
+    order: 17
+  batchDelayMs:
+    type: Number
+    title: Batch Delay (ms)
+    hint: 'Delay between batches to avoid rate limiting (default: 1000)'
+    default: 1000
+    order: 18
+  maxBatchBytes:
+    type: Number
+    title: Max Batch Size (bytes)
+    hint: 'Maximum batch size in bytes (default: 10485760 = 10MB)'
+    default: 10485760
+    order: 19
+  forceFullRebuild:
+    type: Boolean
+    title: Force Full Rebuild
+    hint: 'If enabled, rebuild will delete all and reindex. Otherwise, incremental rebuild (only changed pages).'
+    default: false
+    order: 20
+  debugSql:
+    type: Boolean
+    title: Debug SQL Queries
+    hint: 'Log all SQL queries to the sync status table (for debugging)'
+    default: false
+    order: 21
--- a/server/modules/search/weaviate/engine.js
+++ b/server/modules/search/weaviate/engine.js