Mirroring Dead Websites

Today I checked in on cjdns to see where the project was. It looked about like it did the last time I checked it several months ago. While reading doc/bugs/policy.md I noticed a link that went to archive.org because the original site was dead due to domain name expiration. Thinking about how IPFS could have kept sites like this alive even when the original owner abandoned the site, I dove down a rabbit hole and ended up with a mirror of the site.

After a short web search on how to download websites from archive.org, I ended up at this GitHub page for a Wayback Machine Downloader written in ruby. Installed it and pointed it at http://roaming-initiative.com and instructed it to download all timestamps (-s) and after a bit of a wait (and about 350MB of data), I ended up with a directory with all the data archive.org had on roaming-initiative.com, including the domain expiration data.

A crude binary search later, I found the last valid crawl (2017-10-31 at 10:50 AM). Threw together a quick script in ruby to crawl the entire download and merge all the different timestamp directories into a single combined tree with a hard links to the latest copy of each file before adding the entire dataset to the public directory for website mirrors and launched the publish script.

You can look at the site here, but be warned, the site does not use relative links, so it is broken unless using a localhost ipfs gateway.

Comments