Last night, I was innocently building a tracking script for my website when I noticed that I had some visitors that looked like Google but who were not, in fact, Google.
One was Wayback Machine. Another was archive.is. These web capture sites provide saved snapshots of a page at any given point in time, and apparently they do it by pretending to be Google.
Previously, we had a tutorial on how to bypass subscription paywalls by spoofing the Googlebot web crawler. For those who were too lazy or ethical to build that Chrome extension, another method for getting around paywalls is to paste the blocked URL into archive.is, and have archive.is do the spoofing for you:
If publishers don’t provide paywall exceptions for public services like the Wayback Machine, who do they provide exceptions for? I did a quick check by swapping out my HTTP request headers to match those of various web crawlers.
The Wall Street Journal allows permissioned access for Google and Bing’s crawlers, and no one else. Not Yandex nor Yahoo nor DuckDuckGo. Not even Baidu.
Then I checked the more-cosmopolitan FT.com. FT exposes its content to Google, Bing, and Yahoo, but not Yandex or DuckDuckGo. It works intermittently with Baidu.
(I didn’t bother to check any other subscription sites because I am lazy and mortal and by most accounts I still have a day job, although I’m really not sure how.)
So here’s the thing. Publishers are optimizing for Google search results, and maybe Bing as an afterthought. As a result, Google’s indexing bots have better access to content than any other web crawler.
A third-rate search engine like Yahoo could actually get better indexing results if it changed its web crawler User-Agent headers to Googlebot. It could also acquire way more users if it changed the name of its website to Google. Finally, it would provide far more shareholder value if it burned itself to the ground and redirected its domain to Google.com.
It’s tough to not be Google.
Update, a few hours later: I just noticed that the Wall Street Journal also lets the Facebook crawler bypass its paywall. But not Twitter! FT.com gives access to both Facebook and Twitter crawlers.
I almost forgot that Facebook has become a force to be reckoned with when it comes to content discovery.