No description
  • Go 81.4%
  • Shell 18.6%
Find a file
2026-05-04 16:34:38 +02:00
archiver initial import 2026-05-04 16:34:38 +02:00
.gitignore initial import 2026-05-04 16:34:38 +02:00
archive-domains.sh initial import 2026-05-04 16:34:38 +02:00
CLAUDE.md initial import 2026-04-16 21:02:36 +02:00
config.example.yaml initial import 2026-05-04 16:34:38 +02:00
go.mod initial import 2026-04-16 21:02:36 +02:00
go.sum initial import 2026-04-16 21:02:36 +02:00
main.go initial import 2026-05-04 16:34:38 +02:00
README.md initial import 2026-04-16 21:02:36 +02:00
TODO.md initial import 2026-05-04 16:34:38 +02:00

warc-archiver

A Go tool that archives web pages from multiple sources simultaneously and writes them to a WARC file.

Sources

Source What it does
commoncrawl Downloads WARC blocks directly from Common Crawl via Range requests (most efficient)
memento Uses the Memento Protocol (RFC 7089) via TimeMap + TimeGate to fetch from multiple archives
ia Queries the Internet Archive CDX API directly and downloads raw snapshots
live Crawls the live web as a fallback

All sources run in parallel. Results are deduplicated by (URL + content digest) before writing.

Key design principle

All records are written with the original URL as WARC-Target-URI, never the archive URL. This makes the WARC semantically correct and independent of the source.

Installation

git clone https://github.com/user/warc-archiver
cd warc-archiver
go mod tidy
go build -o warc-archiver .

Usage

Archive a single URL

./warc-archiver -url https://example.com -output example.warc.gz

Archive a single URL from specific sources only

./warc-archiver -url https://example.com -sources commoncrawl,memento -output example.warc.gz

Archive an entire domain

./warc-archiver -domain example.com -output example-domain.warc.gz

Archive with a time range filter

./warc-archiver \
  -url https://example.com \
  -from 2020-01-01T00:00:00Z \
  -to   2023-12-31T23:59:59Z \
  -output example-2020-2023.warc.gz

All options

-url string       Single URL to archive
-domain string    Domain to archive (e.g. example.com)
-output string    Output WARC file path (default: archive.warc.gz)
-workers int      Parallel workers per source (default: 5)
-from string      From datetime (RFC3339)
-to string        To datetime (RFC3339)
-sources string   Comma-separated: commoncrawl,memento,ia,live or "all" (default: all)
-verbose          Verbose logging

Architecture

main.go
└── archiver/
    ├── archiver.go       # Orchestrator: starts all sources, consumes records
    ├── dedup.go          # Thread-safe deduplication set
    ├── warc/
    │   ├── writer.go     # WARC file writer (wraps gowarc)
    │   └── uuid.go       # UUID v4 generator
    └── sources/
        ├── source.go        # Shared Record type + Source interface
        ├── httpclient.go    # Rate-limited HTTP client with retry
        ├── commoncrawl.go   # Common Crawl CDX + Range download
        ├── memento.go       # Memento Protocol (TimeMap + TimeGate)
        ├── internetarchive.go # IA CDX API + Wayback Machine
        └── livecrawler.go   # Live web crawler with link extraction

WARC output format

Each record contains:

  • WARC-Target-URI: the original URL
  • WARC-Date: the capture datetime from the archive
  • WARC-Source: which source produced this record (e.g. commoncrawl:CC-MAIN-2024-10)
  • WARC-Payload-Digest: SHA-1 or SHA-256 of the content (used for deduplication)

Extending with more crawl indices

// Use multiple Common Crawl indices (more historical coverage):
src := sources.NewCommonCrawl(opts)
indexes, _ := src.FetchAvailableIndexes(ctx)  // fetch all available CC crawl IDs
src.WithIndexes(indexes[:10])                  // use latest 10

Dependencies