Internals (deep dive)
This page explains how each module works and how they interact.
wayparam.cli
Responsibilities:
- define the public CLI contract (argparse)
- validate option interactions (--no-files requires --stdout)
- configure concurrency (asyncio semaphore)
- coordinate per-domain processing and aggregate results
Per-domain processing
For each raw URL returned by CDX:
1) filter boring URLs (filters.is_boring)
2) canonicalize (normalize.canonicalize_url)
3) filter again (canonicalized URL may reveal a static extension)
4) deduplicate and emit output (output.write_record / output.print_record_stdout)
wayparam.wayback
Endpoint
The CDX endpoint used is:
https://web.archive.org/cdx/search/cdx
Paging / resumeKey
CDX may return a resume key at the end of the response. wayparam:
- reads the last line
- detects resumeKey: forms and heuristic forms
- loops until no resumeKey (or repeat key safety break)
wayparam.http
Resilience
get_text() performs:
- retries on transient errors and HTTP status errors
- exponential backoff
- special handling for 429/503
- includes final status=... or no-status in the raised error message
This makes troubleshooting much easier in real-world environments (VPNs, flaky networks).
wayparam.normalize
Canonicalization steps: - require absolute URLs with scheme + netloc - drop fragments - normalize host casing and default ports - parse query string, optionally drop tracking params - optionally replace values with placeholder - sort params for stable output - optionally drop URLs with no query params (default behavior)
wayparam.filters
Filtering is based primarily on path extension (e.g., .png, .css, .js) and optional regex rules.
Modes:
- blacklist only (default)
- whitelist mode (if --ext-whitelist is set)
wayparam.output
Key rule: - keep stdout strictly machine-readable - send diagnostics to stderr
Formats:
- txt: URL per line
- jsonl: JSON object per line (record)
wayparam.ratelimit
A small async rate limiter that enforces a global RPS limit across all tasks.