notes

Developing with the Internet Archive

Documentation

Internet Archive Developer Portal: IA developer resources; mostly oriented towards collections, not the Wayback Machine.
File formats
- WARC and CDX specifications
- An introduction to WARC
Wayback Machine APIs
- API docs (2013)
- WARC
  - WARC and ARC files seem to be stored at https://archive.org/download/<filename>, where the filename is from a CDX query, as indicated in Archive-It documentation. They give a 403 error, though.
- CDX (Capture Index)
  - Wayback CDX Server API [docs source]
  - Differences between v1 and v2
  - Wayback Machine CDX image shards are privately stored in the waybackcdx collection.
  - sortkey is in SURT form (Sort-friendly URI Reordering Transform), which is implemented in the webarchive-commons org.archive.url package and the internetarchive/surt Python port.
- Memento
  - Fully compliant with the Memento protocol
  - Examples (2013)
- Availability JSON
- Save Page Now
  - Limits requests to 15 per minute, or it will block your IP for 5 minutes (as of 2019)
  - Save Page Now changelog
- URL formats
  - Advanced search
  - Wikipedia documentation

Tools

Lists
- IIPC’s Awesome Web Archiving
- Archive Team’s list of tools to restore a site from the Wayback Machine
Tools
- edgi-govdata-archiving/wayback [docs]: Python API to the Wayback Machine.
- ArchiveBox: Self-hosted archival tool.
- OutbackCDX: RocksDB-based CDX server for web archives, that’s used by national libraries with millions of records. Works with OpenWayback (XML) and pywb (JSON) CDX protocols.
- IA cdx-summary: Python CLI to summarize CDX files.
- IIPC jwarc: WARC parser and writer.
- IIPC urlcanon: Python URL parser, browser-style URL canonicalizer, and SSURT (improved SURT).
Source of IA projects:
- IA Heritrix [API docs] [Wikipedia]: IA’s web crawler.
- webrecorder/pywb [docs]: Provides the basic functionality of a Wayback Machine.
- IIPC OpenWayback [source wiki]: Project to build the Wayback Machine (no longer under development and recommends pywb).
Rust crates
- wayback-rs [source]: Downloader using the CDX v1 API. Saves pages with the Save Page Now API as an authenticated user. Guesses the body of a redirect and checks it against the digest, to reduce API calls. Handles retries and redirects.
- wayback-urls [source]: URL builder for Wayback Machine CDX v2 (timemap) API.
- wayback-mirror [source]: Simple downloader for Wayback Machine CDX v1 API.
- wayback-archiver [source]: CLI that saves pages with the Save Page Now API. Has good API status code handling.
- warc [source]: Reader and writer for WARC files.
- warc_nom_parser [source]: Small reader for WARC files using nom.
- rust_warc [source]: Small reader for WARC files.

Other archives

ArchiveTeam
- icka: IRCCloud keep-alive, for ArchiveTeam IRC channels.
Archive-It
- Web Archiving Systems API (WASAPI): Querying and downloading WARCs from Archive-It.
- grab-site: Web crawler to recursively crawl a site interactively with a dashboard from a URL and write WARCs, using a fork of wpull.
- Wpull: Wget-compatible (remake) web crawler and downloader.
Time Travel: Find mementos in IA, Archive-It, the British Library, archive.today, GitHub, and more.

This site is open source. Improve this page.