mirror of https://github.com/requarks/wiki
Add semantic search capabilities using Weaviate vector database. Features: - Hybrid search (BM25 + semantic vectors) - Multiple search modes: hybrid, bm25, nearText - Incremental rebuild with hash-based change detection - Orphan page cleanup after rebuild - Rate limit handling with exponential backoff - Configurable batch indexing (size, delay, max bytes) - Result caching with cluster-wide invalidation - Search analytics (top searches, zero-result tracking) - Health monitoring and metrics Requirements: - Weaviate 1.32+ - Pre-configured Weaviate class with vectorizer Closes #XXXXpull/7861/head
parent
407aacfa19
commit
8bf25e7103
@ -0,0 +1,176 @@
|
||||
# Weaviate Search Module
|
||||
|
||||
Semantic search engine for Wiki.js using Weaviate vector database.
|
||||
|
||||
## Features
|
||||
|
||||
- **Hybrid Search**: Combined BM25 keyword + semantic vector search
|
||||
- **Multiple Search Modes**: hybrid, bm25, nearText
|
||||
- **Incremental Rebuild**: Only reindex changed pages (hash-based detection)
|
||||
- **Orphan Cleanup**: Automatically removes deleted pages from index
|
||||
- **Rate Limit Handling**: Exponential backoff with configurable batch delays
|
||||
- **Result Caching**: Configurable TTL with cluster-wide invalidation
|
||||
- **Result Highlighting**: Query terms highlighted with `<mark>` tags
|
||||
- **Search Analytics**: Track top searches and zero-result queries
|
||||
- **Health Monitoring**: Periodic health checks with metrics
|
||||
|
||||
## Requirements
|
||||
|
||||
- Wiki.js 2.5+
|
||||
- Weaviate 1.32+
|
||||
- Node.js 18+
|
||||
- Weaviate class must be pre-created with vectorizer configured
|
||||
|
||||
## Installation
|
||||
|
||||
Copy the module to your Wiki.js installation:
|
||||
|
||||
```bash
|
||||
cp -r server/modules/search/weaviate/ /path/to/wikijs/server/modules/search/
|
||||
```
|
||||
|
||||
Restart Wiki.js - the module will appear in Administration > Search.
|
||||
|
||||
## Configuration
|
||||
|
||||
### Connection
|
||||
|
||||
| Setting | Description | Default |
|
||||
|---------|-------------|---------|
|
||||
| host | Weaviate hostname (without protocol) | localhost |
|
||||
| httpPort | HTTP/REST API port | 18080 |
|
||||
| httpSecure | Use HTTPS | true |
|
||||
| grpcPort | gRPC port | 50051 |
|
||||
| grpcSecure | Use TLS for gRPC | true |
|
||||
| skipTLSVerify | Skip TLS certificate verification (see warning below) | false |
|
||||
| apiKey | Authentication key | - |
|
||||
| timeout | Connection timeout (ms) | 10000 |
|
||||
|
||||
> **Warning**: `skipTLSVerify` sets `NODE_TLS_REJECT_UNAUTHORIZED=0` globally due to weaviate-client limitations. This affects all HTTPS connections in the process.
|
||||
|
||||
### Schema
|
||||
|
||||
| Setting | Description | Default |
|
||||
|---------|-------------|---------|
|
||||
| className | Weaviate collection name | WikiPage |
|
||||
|
||||
### Search
|
||||
|
||||
| Setting | Description | Default |
|
||||
|---------|-------------|---------|
|
||||
| searchType | hybrid / bm25 / nearText | hybrid |
|
||||
| alpha | Hybrid balance: 0=keyword, 1=semantic | 0.5 |
|
||||
| searchLimit | Max results per query | 50 |
|
||||
| cacheTtl | Cache duration (seconds) | 300 |
|
||||
| boostTitle | Title field boost | 3 |
|
||||
| boostDescription | Description field boost | 2 |
|
||||
| boostTags | Tags field boost | 1 |
|
||||
|
||||
### Indexing
|
||||
|
||||
| Setting | Description | Default |
|
||||
|---------|-------------|---------|
|
||||
| batchSize | Documents per batch | 100 |
|
||||
| batchDelayMs | Delay between batches (ms) | 1000 |
|
||||
| maxBatchBytes | Max batch size (bytes) | 10MB |
|
||||
| forceFullRebuild | Delete all before rebuild | false |
|
||||
| debugSql | Log sync table SQL queries | false |
|
||||
|
||||
## Weaviate Schema
|
||||
|
||||
Create this class in Weaviate **before** enabling the module:
|
||||
|
||||
```json
|
||||
{
|
||||
"class": "WikiPage",
|
||||
"vectorizer": "text2vec-transformers",
|
||||
"properties": [
|
||||
{ "name": "pageId", "dataType": ["int"], "indexFilterable": true },
|
||||
{ "name": "title", "dataType": ["text"], "indexSearchable": true },
|
||||
{ "name": "description", "dataType": ["text"], "indexSearchable": true },
|
||||
{ "name": "content", "dataType": ["text"], "indexSearchable": true },
|
||||
{ "name": "path", "dataType": ["text"], "indexFilterable": true },
|
||||
{ "name": "locale", "dataType": ["text"], "indexFilterable": true },
|
||||
{ "name": "tags", "dataType": ["text[]"], "indexSearchable": true }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Search Types
|
||||
|
||||
### Hybrid (recommended)
|
||||
|
||||
Combined keyword + semantic search. Adjust `alpha`:
|
||||
|
||||
| Alpha | Behavior |
|
||||
|-------|----------|
|
||||
| 0.0 | Pure BM25 keyword |
|
||||
| 0.5 | Balanced (default) |
|
||||
| 1.0 | Pure semantic |
|
||||
|
||||
### BM25
|
||||
|
||||
Pure keyword search with configurable field boosting.
|
||||
|
||||
### NearText
|
||||
|
||||
Pure semantic search based on embedding similarity.
|
||||
|
||||
## Rebuild Modes
|
||||
|
||||
### Incremental (default)
|
||||
|
||||
- Compares content hash to detect changes
|
||||
- Only reindexes modified pages
|
||||
- Tracks sync status in `weaviate_sync_status` table
|
||||
- Cleans orphan pages after rebuild
|
||||
|
||||
### Full (`forceFullRebuild: true`)
|
||||
|
||||
- Deletes all documents from Weaviate
|
||||
- Reindexes all pages from scratch
|
||||
- Use when schema changes or index is corrupted
|
||||
|
||||
Both modes use streaming to avoid memory issues with large wikis.
|
||||
|
||||
## API Reference
|
||||
|
||||
| Method | Description |
|
||||
|--------|-------------|
|
||||
| `query(q, opts)` | Search for pages |
|
||||
| `created(page)` | Index a new page |
|
||||
| `updated(page)` | Update indexed page |
|
||||
| `deleted(pageId)` | Remove from index |
|
||||
| `renamed(page)` | Handle path change |
|
||||
| `rebuild()` | Rebuild index |
|
||||
| `isHealthy()` | Health check |
|
||||
| `getStats()` | Index statistics |
|
||||
| `getMetrics()` | Module metrics |
|
||||
| `getSearchAnalytics(opts)` | Search analytics |
|
||||
| `suggest(prefix, opts)` | Auto-complete |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Class not found
|
||||
|
||||
The module does not create the class. Create it manually with your vectorizer.
|
||||
|
||||
### Connection timeout
|
||||
|
||||
1. Verify both HTTP and gRPC ports are accessible
|
||||
2. Check TLS certificates
|
||||
3. Check firewall rules
|
||||
|
||||
### Empty results after rebuild
|
||||
|
||||
1. Check logs for indexing errors
|
||||
2. Verify vectorizer is configured
|
||||
3. Try full rebuild (`forceFullRebuild: true`)
|
||||
|
||||
### Rate limiting during rebuild
|
||||
|
||||
Increase `batchDelayMs` (e.g., 2000ms) or decrease `batchSize`.
|
||||
|
||||
## License
|
||||
|
||||
AGPL-3.0 (same as Wiki.js)
|
||||
@ -0,0 +1,147 @@
|
||||
key: weaviate
|
||||
title: Weaviate
|
||||
description: Weaviate vector search engine with semantic search capabilities.
|
||||
author: YM
|
||||
logo: https://weaviate.io/img/site/weaviate-logo-light.png
|
||||
website: https://weaviate.io
|
||||
isAvailable: true
|
||||
props:
|
||||
# === Connection ===
|
||||
host:
|
||||
type: String
|
||||
title: Host
|
||||
hint: 'Weaviate hostname without protocol (e.g., weaviate.example.com)'
|
||||
default: 'localhost'
|
||||
order: 1
|
||||
httpPort:
|
||||
type: Number
|
||||
title: HTTP Port
|
||||
hint: 'Weaviate HTTP port'
|
||||
default: 18080
|
||||
order: 2
|
||||
httpSecure:
|
||||
type: Boolean
|
||||
title: HTTP Secure (HTTPS)
|
||||
hint: 'Use HTTPS for HTTP connections'
|
||||
default: true
|
||||
order: 3
|
||||
grpcPort:
|
||||
type: Number
|
||||
title: gRPC Port
|
||||
hint: 'Weaviate gRPC port'
|
||||
default: 50051
|
||||
order: 4
|
||||
grpcSecure:
|
||||
type: Boolean
|
||||
title: gRPC Secure (TLS)
|
||||
hint: 'Use TLS for gRPC connections'
|
||||
default: true
|
||||
order: 5
|
||||
skipTLSVerify:
|
||||
type: Boolean
|
||||
title: Skip TLS Verification
|
||||
hint: 'Skip TLS certificate verification (for self-signed certs)'
|
||||
default: false
|
||||
order: 6
|
||||
apiKey:
|
||||
type: String
|
||||
title: API Key
|
||||
hint: 'API key for authentication'
|
||||
sensitive: true
|
||||
order: 7
|
||||
timeout:
|
||||
type: Number
|
||||
title: Timeout (ms)
|
||||
hint: 'Connection timeout in milliseconds (default: 10000)'
|
||||
default: 10000
|
||||
order: 8
|
||||
|
||||
# === Schema ===
|
||||
className:
|
||||
type: String
|
||||
title: Class Name
|
||||
hint: 'Weaviate class name (must exist with vectorizer pre-configured)'
|
||||
default: 'WikiPage'
|
||||
order: 9
|
||||
|
||||
# === Search Settings ===
|
||||
searchType:
|
||||
type: String
|
||||
title: Search Type
|
||||
hint: 'Type of search to perform'
|
||||
default: 'hybrid'
|
||||
enum:
|
||||
- 'nearText'
|
||||
- 'bm25'
|
||||
- 'hybrid'
|
||||
order: 10
|
||||
alpha:
|
||||
type: Number
|
||||
title: Hybrid Alpha
|
||||
hint: '0 = keyword only, 1 = semantic only'
|
||||
default: 0.5
|
||||
order: 11
|
||||
searchLimit:
|
||||
type: Number
|
||||
title: Search Results Limit
|
||||
hint: 'Maximum number of search results to return (default: 50)'
|
||||
default: 50
|
||||
order: 12
|
||||
cacheTtl:
|
||||
type: Number
|
||||
title: Cache TTL (seconds)
|
||||
hint: 'Time to live for search cache in seconds (default: 300 = 5 minutes)'
|
||||
default: 300
|
||||
order: 13
|
||||
|
||||
# === Boost Settings ===
|
||||
boostTitle:
|
||||
type: Number
|
||||
title: Title Boost
|
||||
hint: 'Boost factor for title field (default: 3)'
|
||||
default: 3
|
||||
order: 14
|
||||
boostDescription:
|
||||
type: Number
|
||||
title: Description Boost
|
||||
hint: 'Boost factor for description field (default: 2)'
|
||||
default: 2
|
||||
order: 15
|
||||
boostTags:
|
||||
type: Number
|
||||
title: Tags Boost
|
||||
hint: 'Boost factor for tags field (default: 1)'
|
||||
default: 1
|
||||
order: 16
|
||||
|
||||
# === Indexing Settings ===
|
||||
batchSize:
|
||||
type: Number
|
||||
title: Batch Size
|
||||
hint: 'Number of documents per batch during indexing (default: 100)'
|
||||
default: 100
|
||||
order: 17
|
||||
batchDelayMs:
|
||||
type: Number
|
||||
title: Batch Delay (ms)
|
||||
hint: 'Delay between batches to avoid rate limiting (default: 1000)'
|
||||
default: 1000
|
||||
order: 18
|
||||
maxBatchBytes:
|
||||
type: Number
|
||||
title: Max Batch Size (bytes)
|
||||
hint: 'Maximum batch size in bytes (default: 10485760 = 10MB)'
|
||||
default: 10485760
|
||||
order: 19
|
||||
forceFullRebuild:
|
||||
type: Boolean
|
||||
title: Force Full Rebuild
|
||||
hint: 'If enabled, rebuild will delete all and reindex. Otherwise, incremental rebuild (only changed pages).'
|
||||
default: false
|
||||
order: 20
|
||||
debugSql:
|
||||
type: Boolean
|
||||
title: Debug SQL Queries
|
||||
hint: 'Log all SQL queries to the sync status table (for debugging)'
|
||||
default: false
|
||||
order: 21
|
||||
File diff suppressed because it is too large
Load Diff
Loading…
Reference in new issue