- Rust 100%
| crates | ||
| .gitignore | ||
| Cargo.lock | ||
| Cargo.toml | ||
| CLAUDE.md | ||
| config.yaml | ||
| README.md | ||
| TODO.MD | ||
tywb — Tiny Wayback
A resource-efficient fulltext search engine and Wayback-compatible replay server for WARC files stored on S3-compatible object storage. Written in pure Rust. Designed to run on a 1–2 GB VPS or headless macOS machine.
tywb [--config <path>] <COMMAND>
Commands:
index Ingest WARC files from S3 into the fulltext and CDX indexes
server Run the HTTP search and replay server
stats Print a summary of the current index state
Features
- Fulltext search — Tantivy-powered, ~30 MB idle RAM regardless of index size
- Wayback replay —
GET /web/{timestamp}/{url}fetches only the relevant bytes via S3 Range GET; a 10 GB WARC costs one small range request per replay - Wayback toolbar — sticky archive bar injected into replayed HTML pages showing the capture date, original URL, and a link to other captures; uses wombat.js for client-side URL rewriting
- CDX API — Wayback-compatible
/cdxendpoint with exact and prefix URL lookup - CDX timemap —
GET /web/timemap/cdxcompatible with Zeno and gowarc deduplication - Domain browser — hierarchical TLD → domain → captures navigation at
/ui/browse - URL captures page —
/ui/url?url=<url>lists all captures for a specific URL - S3-compatible storage — works with AWS S3, MinIO, Cloudflare R2, Backblaze B2
- Incremental indexing — ETag-based state file skips unchanged objects on re-runs
- SQLite CDX index — WAL-mode, concurrent reads, no daemon overhead
- Compressed WARC replay — per-gzip-member offsets stored in CDX so
.warc.gzreplay is a targeted range GET, not a full decompression - Domain blacklist — exclude domains (and their subdomains) from indexing; existing entries are purged automatically on next run
- warcinfo storage — the
warcinforecord from each WARC file is stored in SQLite for audit and provenance tracking
Architecture
tywb/
├── crates/
│ ├── warc/ # Streaming WARC parser (sync, zero-copy, no deps)
│ ├── config/ # YAML + env-var config loading
│ ├── cdx/ # CDX record types, SURT canonicalization, SQLite store
│ ├── s3_store/ # S3 client, paginated listing, streaming GET, Range GET
│ ├── search/ # Tantivy fulltext index wrapper
│ └── tywb/ # bin: tywb — single binary with `index` and `server` subcommands
Quick start
1. Configure
cp config.yaml config.local.yaml
$EDITOR config.local.yaml
Minimal config:
s3:
bucket: my-warc-bucket
region: us-east-1
storage:
index_path: ./var/index
cdx_db_path: ./var/cdx.db
For MinIO or another S3-compatible service, add:
s3:
endpoint_url: "https://minio.example.com"
force_path_style: true
Credentials are loaded from AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY environment variables (recommended), from config.yaml, or from the standard AWS SDK chain (~/.aws/credentials, instance metadata, etc.).
2. Index
AWS_ACCESS_KEY_ID=... \
AWS_SECRET_ACCESS_KEY=... \
cargo run --release -p tywb -- --config config.local.yaml index
Streams each WARC file from S3, parses it record-by-record, writes CDX entries to SQLite, and adds extracted text to the Tantivy index. Saves progress after each file, so a crash on a large bucket is safe to resume.
After each WARC file is indexed, a CDX sidecar file is written to the same S3 bucket alongside the WARC (see CDX sidecar files below). If the sidecar already exists it is left untouched.
Indexer options:
| Flag | Description |
|---|---|
--file <KEY> |
Index only this S3 key (bypasses listing) |
--max-files <N> |
Stop after processing N WARC files |
--max-urls <N> |
Stop after writing N new CDX entries |
--force |
Re-process all WARC files even if their ETag matches the saved state (use to repair existing index data) |
Live status: press Ctrl+T (macOS/BSD) or send SIGUSR1 (Linux) to print current file, URL, throughput (rec/s, MB/s) to stderr without interrupting indexing.
Domain blacklist
Domains (and all their subdomains) can be excluded from indexing by listing them under indexer.blacklisted_domains in config.yaml:
indexer:
blacklisted_domains:
- spam-site.example.com
- ads.example.net
Two things happen automatically each time tywb index runs:
-
Skip during ingest — any WARC record whose
WARC-Target-URIbelongs to a blacklisted domain (or one of its subdomains) is silently skipped; nothing is written to the CDX or fulltext index for that record. -
Purge existing data — before processing new files, tywb removes all CDX records and fulltext index entries for every blacklisted domain from the current index. This means adding a domain to the blacklist and re-running
tywb indexis sufficient to clean up previously-indexed data — no manual database surgery required.
Subdomain matching is automatic: example.com in the blacklist covers www.example.com, cdn.example.com, deep.sub.example.com, etc.
3. Serve
cargo run --release -p tywb -- --config config.local.yaml server
The server binds to server.bind (default 0.0.0.0:8080).
API
Fulltext search
GET /search?q=<query>[&from=<timestamp>][&to=<timestamp>][&limit=<n>]
| Parameter | Description |
|---|---|
q |
Query string (required). Supports Tantivy query syntax: rust AND programming, "exact phrase", title:rust. |
from |
Lower bound timestamp, 14 digits: 20240101000000 |
to |
Upper bound timestamp, 14 digits: 20241231235959 |
limit |
Max results (default: server.max_results, capped at 500) |
Response — JSON array of search hits:
[
{
"url": "https://example.com/page",
"timestamp": "20240315120000",
"title": "Example Page",
"mime": "text/html",
"s3_key": "crawls/2024/archive.warc.gz",
"offset": 1048576,
"length": 8192,
"score": 1.42
}
]
curl 'http://localhost:8080/search?q=rust+programming&from=20240101000000&limit=10'
Wayback replay
GET /web/<timestamp>/<url>
Replays an archived page. The server looks up the closest CDX record, fetches only that WARC record's bytes from S3 via a Range GET, and serves the original HTTP response body with its original status and Content-Type.
curl 'http://localhost:8080/web/20240315120000/https://example.com/'
This is compatible with standard Wayback Machine client tooling and browser extensions.
Wayback toolbar — for text/html responses tywb automatically injects:
- A sticky dark toolbar at the top of the page showing the archive date, the original URL, and a link to all captures of that URL (
/ui/url?url=…). - wombat.js — Webrecorder's client-side URL rewriting library — loaded from
/_wb/wombat.jsand initialized so that links and resources resolve correctly within the archive context.
No configuration is required; the toolbar is always active for HTML replay responses.
CDX API
GET /cdx?url=<url>[&from=<ts>][&to=<ts>][&limit=<n>][&matchType=prefix]
Returns index entries as a JSON array of arrays (CDX-API format). Append * to url for a prefix search over a whole domain.
# Exact URL lookup
curl 'http://localhost:8080/cdx?url=https://example.com/'
# All pages under a domain
curl 'http://localhost:8080/cdx?url=example.com/*&limit=100'
Response:
[
["urlkey", "timestamp", "original", "mimetype", "statuscode", "digest", "length"],
["com,example)/", "20240315120000", "https://example.com/", "text/html", "200", "sha1:…", "8192"]
]
CDX timemap (Zeno / gowarc deduplication)
GET /web/timemap/cdx?url=<url>[&limit=<n>]
Returns captures for a URL as space-separated plain text, one record per line — the format used by Zeno and gowarc for CDX-based WARC deduplication.
Each line has seven fields:
{urlkey} {timestamp} {original} {mime} {status} {digest} {length}
The digest is returned without the hash-algorithm prefix (sha1:, sha256:, etc.) to match what gowarc expects.
limit follows the pywb convention: positive values return the N oldest captures; negative values return the N most-recent captures (e.g. limit=-1 returns the single latest capture). gowarc always uses limit=-1.
# Most recent capture (gowarc dedup query)
curl 'http://localhost:8080/web/timemap/cdx?url=https://example.com/&limit=-1'
# → com,example)/ 20240315120000 https://example.com/ text/html 200 ABCDEF1234 8192
# Five oldest captures
curl 'http://localhost:8080/web/timemap/cdx?url=https://example.com/&limit=5'
To use tywb as a deduplication server with Zeno, pass:
--warc-cdx-dedupe-server http://<tywb-host>:8080
Web UI
| Path | Description |
|---|---|
/ |
Homepage with index statistics (record count, date range, MIME breakdown) |
/ui/search |
HTML fulltext search form |
/ui/browse |
Domain browser — TLD → domain → captures hierarchy |
/ui/url?url=<url> |
All archived captures for a specific URL, sorted by date |
/ui/files |
List of indexed WARC files with per-file statistics |
Domain browser
/ui/browse provides a three-level hierarchy:
/ui/browse— TLDs sorted by capture count/ui/browse?tld=de— domains under a TLD/ui/browse?domain=example.de— all captures for a domain, deduplicated by URL (each URL appears once, with a count of how many captures exist; clicking links to/ui/url)
URL captures page
/ui/url?url=<url> shows every archived capture of the given URL in chronological order. Each row links directly to the Wayback replay at /web/<timestamp>/<url>. Accepts optional from and to parameters (14-digit timestamps) to filter by date range.
Health check
GET /healthz → 200 OK
SQLite schema
Four tables are maintained in cdx.db:
cdx — one row per indexed WARC record:
| Column | Description |
|---|---|
surt_url |
SURT-canonicalized URL (primary key component) |
timestamp |
14-digit capture timestamp (primary key component) |
original |
Original URL |
mime |
HTTP Content-Type of the response body (extracted from HTTP headers, not the WARC envelope) |
status |
HTTP status code |
digest |
WARC-Block-Digest (e.g. sha1:ABC…) |
s3_key |
S3 object key of the WARC file |
offset |
Byte offset of the record in the uncompressed stream |
length |
Content-Length of the record block |
c_offset |
Compressed byte offset of the gzip member (.warc.gz only; NULL for plain .warc) |
warc_files — one row per indexed WARC file:
| Column | Description |
|---|---|
s3_key |
S3 object key (primary key) |
etag |
S3 ETag, used for incremental skip logic |
bucket |
S3 bucket name |
first_seen / last_indexed |
ISO-8601 UTC timestamps |
warc_records |
Total WARC records parsed |
cdx_new / cdx_known |
New vs. updated CDX entries written |
fulltext_indexed |
Documents added to the Tantivy index |
skipped / errors |
Non-indexed records and parse errors |
duration_secs / bytes_per_sec / records_per_sec |
Throughput metrics |
warc_date_min / warc_date_max |
Earliest and latest WARC-Date values seen |
mime_summary |
JSON object mapping MIME type → record count |
warcinfo — the warcinfo WARC record from the start of each file:
| Column | Description |
|---|---|
s3_key |
S3 key of the source WARC (primary key) |
bucket |
S3 bucket name |
warc_date |
WARC-Date from the warcinfo record |
warc_filename |
WARC-Filename header value |
record_id |
WARC-Record-ID header value |
headers_json |
All WARC headers serialized as a JSON array of [name, value] pairs |
block_text |
Raw text content of the warcinfo block (crawler metadata, operator info, etc.) |
Useful for auditing crawler software versions and operator metadata across a large archive.
CDX sidecar files
After each WARC file is successfully indexed, tywb writes a CDX sidecar file into the same S3 bucket alongside the WARC. The sidecar key is the WARC key with .cdx appended:
crawls/2024/archive.warc.gz → crawls/2024/archive.warc.gz.cdx
crawls/2024/archive.warc → crawls/2024/archive.warc.cdx
If a sidecar already exists (detected via a HEAD request) it is left untouched. The write runs as a background task so it never slows down the main indexing loop.
Format
Sidecar files use the standard CDX-11 plain-text format (Content-Type: text/plain):
CDX N b a m s k r M S V g
com,example)/ 20240315120000 https://example.com/ text/html 200 sha1:ABC… - - 8192 0 archive.warc.gz
com,example)/page 20240315120001 https://example.com/page text/html 200 sha1:DEF… - - 4096 8192 archive.warc.gz
| Field | Header char | Content |
|---|---|---|
| SURT URL | N |
Canonicalized, sort-friendly URL |
| Timestamp | b |
14-digit YYYYMMDDHHmmss |
| Original URL | a |
Verbatim WARC-Target-URI |
| MIME type | m |
HTTP Content-Type of the response body |
| HTTP status | s |
e.g. 200, 301 |
| Digest | k |
WARC-Block-Digest, e.g. sha1:ABC… |
| Redirect | r |
Always - (not captured) |
| Meta | M |
Always - |
| Record length | S |
Content-Length of the WARC block in bytes |
| Byte offset | V |
For .warc.gz: compressed gzip-member offset (c_offset). For .warc: uncompressed stream offset. |
| Filename | g |
Basename of the WARC file |
The byte offset field (V) matches what is stored in the CDX SQLite database and is suitable for S3 Range GET requests for replay.
Why
CDX sidecar files make the archived content independently usable without tywb's SQLite database:
- Standard CDX consumers (pywb, OpenWayback, CDX server tools) can read them directly
- Provides a backup index that lives with the data in S3
- Enables other tools to locate and replay individual WARC records without running tywb
Index statistics
tywb --config config.yaml stats
Prints a human-readable summary of the current index state:
CDX index (./var/cdx.db)
Records: 1,234,567
Unique URLs: 890,123
WARC files: 42
Date range: 2020-01-01 00:00:00 → 2024-12-31 23:59:59
MIME types:
text/html 987,654
application/pdf 42,000
...
HTTP status:
200 1,100,000
301 80,000
...
Fulltext index (./var/index)
Documents: 987,654
Ingest state (./var/list_state.json)
Files seen: 42
Per-file metadata (MIME histogram, date range, throughput, error counts, bucket name) is recorded in the warc_files table of cdx.db after each successful index run.
Configuration reference
All values can be overridden by environment variables. Environment variables win.
| Environment variable | Config field | Default |
|---|---|---|
AWS_ACCESS_KEY_ID |
s3.access_key_id |
— |
AWS_SECRET_ACCESS_KEY |
s3.secret_access_key |
— |
AWS_DEFAULT_REGION |
s3.region |
us-east-1 |
AWS_ENDPOINT_URL |
s3.endpoint_url |
— |
WARC_S3_BUCKET |
s3.bucket |
— |
WARC_S3_PREFIX |
s3.prefix |
— |
WARC_S3_CONCURRENCY |
s3.concurrency |
4 |
WARC_INDEX_PATH |
storage.index_path |
/var/lib/warc-search/index |
WARC_CDX_DB_PATH |
storage.cdx_db_path |
/var/lib/warc-search/cdx.db |
WARC_SERVER_BIND |
server.bind |
0.0.0.0:8080 |
RUST_LOG |
log.level |
info |
Full annotated config: see config.yaml.
Building
# Debug build
cargo build
# Release build (use for deployment / benchmarking)
cargo build --release -p tywb
Cross-compile for Linux from macOS
cargo install cross
cross build --release --target x86_64-unknown-linux-musl -p tywb
Tests
# All crates
cargo test
# Config tests must be single-threaded (they mutate env vars)
cargo test -p warc-search-config -- --test-threads=1
# Specific crate
cargo test -p warc
cargo test -p warc-search-search
# With output
cargo test -p warc -- --nocapture
Resource usage
| Resource | Idle | During ingest |
|---|---|---|
| RAM (server) | ~60 MB | — |
| RAM (indexer) | — | ~100–200 MB (controlled by indexer.batch_size) |
| SQLite cache | 8 MiB (default) | configurable |
| Tantivy index | OS page cache | mmap-based, OS manages eviction |
Designed to fit comfortably on a 1 GB VPS. The indexer and server can run simultaneously — the server opens the Tantivy index read-only and picks up new segments as the indexer commits.
tywb index and tywb server: two processes, one machine
tywb index and tywb server are subcommands of the same binary but are designed to run as separate processes on the same machine, sharing the same data files:
tywb index |
tywb server |
|
|---|---|---|
| Process lifetime | One-shot batch job | Long-running daemon |
| Typical schedule | Nightly (cron / systemd timer) | Always running |
| CDX database | Writes new records | Reads only |
| Tantivy index | Writes new segments, commits | Reads only (no file lock held) |
Because the server opens the Tantivy index in read-only mode and SQLite runs in WAL mode, the two processes can run at the same time without conflict. When the indexer commits a batch, the server picks up the new segments automatically on the next query — no restart required.
This split keeps the server's RAM footprint small and predictable (~60 MB idle). The indexer's higher peak usage (~100–200 MB during ingest) is transient and does not affect the running server.
Deployment notes
Run the indexer periodically (e.g. nightly via cron or systemd timer) and keep the server running continuously:
# /etc/systemd/system/tywb-server.service
[Unit]
Description=tywb HTTP server
After=network.target
[Service]
ExecStart=/usr/local/bin/tywb --config /etc/tywb/config.yaml server
Environment=AWS_ACCESS_KEY_ID=...
Environment=AWS_SECRET_ACCESS_KEY=...
Restart=always
[Install]
WantedBy=multi-user.target
# /etc/systemd/system/tywb-index.service
[Unit]
Description=tywb indexer (one-shot)
After=network.target
[Service]
Type=oneshot
ExecStart=/usr/local/bin/tywb --config /etc/tywb/config.yaml index
Environment=AWS_ACCESS_KEY_ID=...
Environment=AWS_SECRET_ACCESS_KEY=...
# /etc/systemd/system/tywb-index.timer
[Unit]
Description=Run tywb indexer nightly
[Timer]
OnCalendar=daily
Persistent=true
[Install]
WantedBy=timers.target
Linting
cargo fmt --all
cargo clippy --all-targets --all-features -- -D warnings
License
Licensed under either of MIT or Apache 2.0 at your option.