No description

Go 81.4%
Shell 18.6%

Find a file

Maximillian Dornseif 85be893c8d initial import		2026-05-04 16:34:38 +02:00
archiver	initial import	2026-05-04 16:34:38 +02:00
.gitignore	initial import	2026-05-04 16:34:38 +02:00
archive-domains.sh	initial import	2026-05-04 16:34:38 +02:00
CLAUDE.md	initial import	2026-04-16 21:02:36 +02:00
config.example.yaml	initial import	2026-05-04 16:34:38 +02:00
go.mod	initial import	2026-04-16 21:02:36 +02:00
go.sum	initial import	2026-04-16 21:02:36 +02:00
main.go	initial import	2026-05-04 16:34:38 +02:00
README.md	initial import	2026-04-16 21:02:36 +02:00
TODO.md	initial import	2026-05-04 16:34:38 +02:00

README.md

warc-archiver

A Go tool that archives web pages from multiple sources simultaneously and writes them to a WARC file.

Sources

Source	What it does
`commoncrawl`	Downloads WARC blocks directly from Common Crawl via Range requests (most efficient)
`memento`	Uses the Memento Protocol (RFC 7089) via TimeMap + TimeGate to fetch from multiple archives
`ia`	Queries the Internet Archive CDX API directly and downloads raw snapshots
`live`	Crawls the live web as a fallback

All sources run in parallel. Results are deduplicated by (URL + content digest) before writing.

Key design principle

All records are written with the original URL as WARC-Target-URI, never the archive URL. This makes the WARC semantically correct and independent of the source.

Installation

git clone https://github.com/user/warc-archiver
cd warc-archiver
go mod tidy
go build -o warc-archiver .

Usage

Archive a single URL

./warc-archiver -url https://example.com -output example.warc.gz

Archive a single URL from specific sources only

./warc-archiver -url https://example.com -sources commoncrawl,memento -output example.warc.gz

Archive an entire domain

./warc-archiver -domain example.com -output example-domain.warc.gz

Archive with a time range filter

./warc-archiver \
  -url https://example.com \
  -from 2020-01-01T00:00:00Z \
  -to   2023-12-31T23:59:59Z \
  -output example-2020-2023.warc.gz

All options

-url string       Single URL to archive
-domain string    Domain to archive (e.g. example.com)
-output string    Output WARC file path (default: archive.warc.gz)
-workers int      Parallel workers per source (default: 5)
-from string      From datetime (RFC3339)
-to string        To datetime (RFC3339)
-sources string   Comma-separated: commoncrawl,memento,ia,live or "all" (default: all)
-verbose          Verbose logging

Architecture

main.go
└── archiver/
    ├── archiver.go       # Orchestrator: starts all sources, consumes records
    ├── dedup.go          # Thread-safe deduplication set
    ├── warc/
    │   ├── writer.go     # WARC file writer (wraps gowarc)
    │   └── uuid.go       # UUID v4 generator
    └── sources/
        ├── source.go        # Shared Record type + Source interface
        ├── httpclient.go    # Rate-limited HTTP client with retry
        ├── commoncrawl.go   # Common Crawl CDX + Range download
        ├── memento.go       # Memento Protocol (TimeMap + TimeGate)
        ├── internetarchive.go # IA CDX API + Wayback Machine
        └── livecrawler.go   # Live web crawler with link extraction

WARC output format

Each record contains:

WARC-Target-URI: the original URL
WARC-Date: the capture datetime from the archive
WARC-Source: which source produced this record (e.g. commoncrawl:CC-MAIN-2024-10)
WARC-Payload-Digest: SHA-1 or SHA-256 of the content (used for deduplication)

Extending with more crawl indices

// Use multiple Common Crawl indices (more historical coverage):
src := sources.NewCommonCrawl(opts)
indexes, _ := src.FetchAvailableIndexes(ctx)  // fetch all available CC crawl IDs
src.WithIndexes(indexes[:10])                  // use latest 10

Dependencies

github.com/nlnwa/gowarc – WARC writing
golang.org/x/time/rate – rate limiting
golang.org/x/net/html – HTML parsing for link extraction

README.md Unescape Escape