I had a thought this morning. Back in the day I used to filter ads at the DNS level by running a caching name server and feeding it a zone file drawn from a list of dodgy ad sites. This list is published with this in mind, but is in the format of an /etc/hosts file. A little tweaking and it serves nicely as zone file usable by BIND.
Having been thinking about the topic of ISP-level internet filtering a great deal of late, I'd been wondering how to reverse engineer the block lists, and this morning I remembered the existence of this hosts file.. BT censor internet domains by returning an A record pointing to a webserver which redirects to their 'blocked' page, so why not feed the list of hosts contained therein to BT's name servers and grep for responses indicating the site is blocked?
A line of code later...
for i in `grep ^0 hosts.txt | awk '{print $2}'`;
do host -t a $i |grep 213.120.234.1[40][69] > blocked.txt ;
done && awk '{print $1}' blocked.txt | sort -u > blocked-hosts.txt
No error checking, but this is quick n dirty proof of concept... We run this three times, having adjusted the level of filtering between each run. A further addition of a few hundred social networking sites and we have four output files. Diffs against them reveal which hosts are added at each increment in blocking level.
$ wc -l *
1345 blocked-light.hosts.txt
1722 blocked-moderate.hosts.txt
377 blocked-moderate-increment.txt
105 blocked-social-strict.hosts.txt
3096 blocked-strict.hosts.txt
1379 blocked-strict.increment.txt
8024 total
Wow! That's about 3200 domains proven to be blocked, across the varying filter levels... Now, where's that Alexa top million sites list...