HetaDB
HetaDB is Heta's document knowledge-base layer. It ingests multimodal files, extracts a knowledge graph, stores vector embeddings, and answers natural-language questions with LLM-synthesised responses and inline citations.
Supported Formats
| Format | Extensions |
|---|---|
| Documents | .pdf, .docx, .doc |
| Presentations | .pptx, .ppt |
| Spreadsheets | .xlsx, .xls, .csv, .ods |
| Images | .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff |
| Plain text / markup | .txt, .text, .md, .markdown, .html |
| Archives | .zip, .7z, .rar, .tar, .tar.gz, .tar.xz, .tar.bz2 |
Six-Stage Processing Pipeline
Once you trigger a parse job, HetaDB runs each file through six sequential stages:
| # | Stage | What happens |
|---|---|---|
| 1 | File parsing | Four parser types run concurrently: doc_parser for PDF/DOC/DOCX/PPT/PPTX (extracts text, tables, and embedded images); html_parser for HTML; text_parser for TXT/MD; sheet_parser for CSV/XLS/XLSX (LLM generates table schema and descriptions, piped into the text stream). Second phase (serial): image_parser uses VLM to generate text descriptions for standalone images and images embedded in documents. Archives (ZIP/7Z/RAR/TAR) are recursively extracted before parsing |
| 2 | Text chunking | Split text into overlapping token-based chunks; LLM-assisted merge of semantically similar chunks (intermediate chunk vectors stored in Milvus). Rechunk: post-merge chunks are grouped by source document, each document's chunk tokens are concatenated and re-split uniformly; every new chunk records a source_chunk provenance field. Final rechunked chunks written to PostgreSQL |
| 3 | Graph extraction | LLM concurrently extracts entities and relations from the rechunked chunks produced in stage 2; raw output written as JSONL to kg_file/rechunked/ (no merging at this stage) |
| 4 | Node processing | LLM dedup → embedding → vector-similarity cluster merge → Milvus semantic dedup; final nodes stored in Milvus (embeddings) and PostgreSQL (metadata) |
| 5 | Relation processing | Node ID mapping applied; LLM dedup → embedding → cluster merge → Milvus semantic dedup; final relations stored in Milvus (embeddings) and PostgreSQL (metadata and relation-chunk provenance) |
| 6 | Table embedding | CSV files processed by LLM to generate table schema; raw data loaded into PostgreSQL; table node embeddings written to Milvus entity collection; enables natural-language-to-SQL queries |
The job is fully asynchronous. Poll GET /api/v1/hetadb/files/processing/tasks/{task_id}
(using the task ID returned by the parse call) until status is "completed" before issuing chat queries.
Query Modes
query_mode |
Strategy | Best for |
|---|---|---|
naive |
Parallel vector + KG retrieval, weighted scoring | Fast general queries; good default |
rerank |
BM25 + vector RRF fusion → cross-encoder rerank | Highest precision; factual questions |
rewriter |
LLM generates 3 query variants, aggregates results | Ambiguous or under-specified queries |
multihop |
ReAct reasoning loop (max 3 rounds) | Multi-step / chain-of-thought questions |
direct |
LLM answers from parametric knowledge only | Quick LLM opinions; no retrieval needed |
See Query Modes for per-mode examples and guidance.
Sub-pages
- Ingesting Documents — create a knowledge base, upload files, trigger parsing, check status
- Query Modes — detailed guide with curl examples for each retrieval strategy