by Isoroku Yamamoto
Wall Street Journal ended its practice of allowing special access for search engines. This means that a human visitor can no longer bypass the paywall by spoofing Google’s HTTP
request headers.
However, subscription-based publications face a problem when users click on a link through Twitter or Facebook on a mobile device. Social media apps implement their own in-app browser, which generally do not retain cookies. Websites that require a user login must request the login every time the app is reopened.
This makes for a cumbersome user experience. Thus, publications like the Wall Street Journal disable login checks when a page request appears to come from Twitter.
It does this by inspecting HTTP
request headers. The important headers are Referer and User-Agent.
When a link is shared on Twitter, the url is shortened to something like “https://t.co/9Mk58nL3xJ
.” This goes to a Twitter server, which redirects the browser to the intended destination. Websites determine whether Twitter initiated the redirect by checking that the HTTP Referer string begins with “https://t.co/
.” The rest of the string is ignored.
A web request from Twitter further identifies itself through the User-Agent header, which might look something like “Mobile/14C92 Twitter for iPhone
.”
By submitting this information in request headers, any web browser can appear to be the Twitter app. It is easy to do this using a Chrome extension.
The following builds on top of last year’s tutorial for mimicking Google’s web crawler.
1. Use the same manifest.json
file as before. Take care to list both http://
and https://
versions of the sites you are interested in, as many publishers now use ssl.
2. Modify the background.js
file. The modified version should look like the one below. It is worth noting that all cookies have been blocked.
var VIA_TWITTER = ["wsj.com"] function changeRefer(details) { foundReferer = false; foundUA = false; var useTwitter = VIA_TWITTER.map(function(url) { if (details.url.includes(url)) { return true; } return false; }) .reduce(function(a, b) { return a || b}, false); var reqHeaders = details.requestHeaders.filter(function(header) { // block cookies by default if (header.name !== "Cookie") { return header; } }).map(function(header) { if (header.name === "Referer") { header.value = setRefer(useTwitter); foundReferer = true; } if (header.name === "User-Agent") { header.value = setUserAgent(useTwitter); foundUA = true; } return header; }) // append referer if (!foundReferer) { reqHeaders.push({ "name": "Referer", "value": setRefer(useTwitter) }) } if (!foundUA) { reqHeaders.push({ "name": "User-Agent", "value": setUserAgent(useTwitter) }) } return {requestHeaders: reqHeaders}; } function blockCookies(details) { for (var i = 0; i < details.responseHeaders.length; ++i) { if (details.responseHeaders[i].name === "Set-Cookie") { details.responseHeaders.splice(i, 1); } } return {responseHeaders: details.responseHeaders}; } function setRefer(useTwitter) { if (useTwitter) return "https://t.co/T1323aaaa"; else return "https://www.google.com/"; } function setUserAgent(useTwitter) { if (useTwitter) return "Mozilla/5.0 (iPhone; CPU iPhone OS 10_2 like Mac OS X) AppleWebKit/602.1.32 (KHTML, like Gecko) Mobile/14C92 Twitter for iPhone"; else return "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"; } chrome.webRequest.onBeforeSendHeaders.addListener(changeRefer, { urls: ["<all_urls>"], types: ["main_frame"], }, ["requestHeaders", "blocking"]); chrome.webRequest.onHeadersReceived.addListener(blockCookies, { urls: ["<all_urls>"], types: ["main_frame"], }, ["responseHeaders", "blocking"]);
Save both files in the same directory. The updated source code can also be downloaded here.
Now type chrome://extensions/
in the browser address bar.
Reload the old extension, or Load it as an unpacked extension if you have not previously done so. Enable the chrome extension and visit wsj.com
.
There is always a tradeoff between security and usability. The fastest way to compromise a computer system is to accommodate lazy users. Or worse yet, accommodate lazy programmers.
Great! Thanks for this update.
// You want these function scoped not global
var foundReferer = false;
var foundUA = false;
Otherwise good stuff ojisan 😉
More tutorials please! 🙂
Any way to turn this into a Greasemonkey script?
Seriously? You’re teaching your readers to steal?
I teach the pitfalls of sacrificing security for usability, because the best way to build a good defense is to learn to play offense. If you decide to use this knowledge to steal something, that’s on you.
I am having a problem with loading the extension. It says “Manifest file is missing or unreadable.” although Chrome is pointing to the correct path
Magnificent. Thank you!
Is it possible to implement this on Android?
I cannot access the Financial Times: has anything changed?
I tried reading this article (http://www.telegraph.co.uk/news/2017/05/01/day-1776-illuminati-modern-day-conspiracy-theorists-favourite/), but the extension didn’t work, even though I added the site to manifest.json. Is there anything I did wrong?
I tried reading this article (http://www.telegraph.co.uk/news/2017/05/01/day-1776-illuminati-modern-day-conspiracy-theorists-favourite/), but the extension didn’t work, even though I added the site to manifest.json. Is there anything I did wrong?
(this is a duplicate post, I just forgot to check “notify me of comments.” PLEASE IGNORE THE LAST POST!!!!!!!!!!!!1111)
This technique doesn’t work for http://www.foreignaffairs.com ,I have tried modifying the manifest.json file ,but doesn’t help. I am curious what kind of sophasticated paywall they are using. Thanks a lot though for your updates,been following your blog regularly .
1 thing we all need to bear in mind is that the workaround that Mr. Yamamoto has suggested above does not flawlessly work on all of the paywalls. With some paywalls one can be lucky that the workaround works, but other paywalls are too robust & “smart” to be tricked.
Turns out you can even make it simpler. The only thing you need to spoof is Referer which is unfortunately not easy for browsers, definitely possible for chrome extensions.
But who cares about rules, I had some an article summarizer that did some internet scraping. I wanted to set it free on the news so thought I would implemented this hack and hey, it still works! Hooray! Owe you one Elaine 😉
https://explaintome.herokuapp.com/
This link is a bummer, nothing there.
It was helpful when it worked (I discovered this independently and chose not to publish it). Now that you’ve made it so easy to exploit, they’ll have to do something about it. I wish you had written about this more in the abstract and not explicitly mentioned the newspaper in question.
If you adjust your referer to facebook.com, you can get into wsj.com without much fanfare.
Has this been patched? Doesn’t seem to be working for me on WSJ .
still works as far as i can tell. eg, try this url: https://t.co/AS1B0xLS27
Are you able to view this article: https://blogs.wsj.com/cio/2017/10/20/in-the-digital-economy-education-level-increasingly-defines-wage-potential/
because it does not work for me. Some paid articles seem to load just fine, but this one still doesn’t.
yes, it works, but i’m also blocking cookies. Try it in incognito mode, does that work?
The Chronicle of Higher Education blocks me from accessing its website when using the extension (after adding https://www.chronicle.com to the list of sites, of course). It says I’m a bot.
And the most intriguing thing is that its articles behind the paywall (“premium content”) are fully indexed by Google.
Maybe it has a clever system that allows only the actual Google Bot to access its content?
The script works for a lot of sites, but not for all sites.
Yeah, I know. But the CHE is the first site with porous paywall that I found on which the script does not work. The real Googlebot can index the site, but we can’t.
Just a thought: have you added the CHE to the manifest.json file?
Elaine, is there any way to modify this for FIrefox?
Hi this script used to work for thetimes.co.uk, as of today it doesn’t. Is there any way it can be fixed? Would be most grateful for any help.