No description
- Go 81.4%
- Shell 18.6%
| archiver | ||
| .gitignore | ||
| archive-domains.sh | ||
| CLAUDE.md | ||
| config.example.yaml | ||
| go.mod | ||
| go.sum | ||
| main.go | ||
| README.md | ||
| TODO.md | ||
warc-archiver
A Go tool that archives web pages from multiple sources simultaneously and writes them to a WARC file.
Sources
| Source | What it does |
|---|---|
commoncrawl |
Downloads WARC blocks directly from Common Crawl via Range requests (most efficient) |
memento |
Uses the Memento Protocol (RFC 7089) via TimeMap + TimeGate to fetch from multiple archives |
ia |
Queries the Internet Archive CDX API directly and downloads raw snapshots |
live |
Crawls the live web as a fallback |
All sources run in parallel. Results are deduplicated by (URL + content digest) before writing.
Key design principle
All records are written with the original URL as WARC-Target-URI, never the archive URL. This makes the WARC semantically correct and independent of the source.
Installation
git clone https://github.com/user/warc-archiver
cd warc-archiver
go mod tidy
go build -o warc-archiver .
Usage
Archive a single URL
./warc-archiver -url https://example.com -output example.warc.gz
Archive a single URL from specific sources only
./warc-archiver -url https://example.com -sources commoncrawl,memento -output example.warc.gz
Archive an entire domain
./warc-archiver -domain example.com -output example-domain.warc.gz
Archive with a time range filter
./warc-archiver \
-url https://example.com \
-from 2020-01-01T00:00:00Z \
-to 2023-12-31T23:59:59Z \
-output example-2020-2023.warc.gz
All options
-url string Single URL to archive
-domain string Domain to archive (e.g. example.com)
-output string Output WARC file path (default: archive.warc.gz)
-workers int Parallel workers per source (default: 5)
-from string From datetime (RFC3339)
-to string To datetime (RFC3339)
-sources string Comma-separated: commoncrawl,memento,ia,live or "all" (default: all)
-verbose Verbose logging
Architecture
main.go
└── archiver/
├── archiver.go # Orchestrator: starts all sources, consumes records
├── dedup.go # Thread-safe deduplication set
├── warc/
│ ├── writer.go # WARC file writer (wraps gowarc)
│ └── uuid.go # UUID v4 generator
└── sources/
├── source.go # Shared Record type + Source interface
├── httpclient.go # Rate-limited HTTP client with retry
├── commoncrawl.go # Common Crawl CDX + Range download
├── memento.go # Memento Protocol (TimeMap + TimeGate)
├── internetarchive.go # IA CDX API + Wayback Machine
└── livecrawler.go # Live web crawler with link extraction
WARC output format
Each record contains:
WARC-Target-URI: the original URLWARC-Date: the capture datetime from the archiveWARC-Source: which source produced this record (e.g.commoncrawl:CC-MAIN-2024-10)WARC-Payload-Digest: SHA-1 or SHA-256 of the content (used for deduplication)
Extending with more crawl indices
// Use multiple Common Crawl indices (more historical coverage):
src := sources.NewCommonCrawl(opts)
indexes, _ := src.FetchAvailableIndexes(ctx) // fetch all available CC crawl IDs
src.WithIndexes(indexes[:10]) // use latest 10
Dependencies
github.com/nlnwa/gowarc– WARC writinggolang.org/x/time/rate– rate limitinggolang.org/x/net/html– HTML parsing for link extraction