How Google’s Web Crawler Bypasses Paywalls

by Isoroku Yamamoto

Update: A newer version of the chrome extension is available here.

Wall Street Journal fixed their “paste a headline into Google News” paywall trick. However, Google can still index the content.

Digital publications allow discriminatory access for search engines by inspecting HTTP request headers. The two relevant headers are Referer and User-Agent.

Referer identifies the address of the web page that linked to the resource. Previously, when you clicked a link through Google search, the Referer would say https://www.google.com/. This is no longer enough.

More recently, websites started checking for User-Agent, a string that identifies the browser or app that made the request. Wall Street Journal wants to know that you not only came from Google, but also that you are an agent of Google.

By providing this information in request headers, anyone can appear to be a Google web crawler. In fact, I will show you how to make a Chrome extension that does just that.

1. Create a file called manifest.json. Paste the following in the file. Add any sites you would like to read to the permissions list.

{
  "name": "Innocuous Chrome Extension",
  "version": "0.1",
  "description": "This is an innocuous chrome extension.",
  "permissions": ["webRequest", "webRequestBlocking",
                  "http://www.ft.com/*",
                  "http://www.wsj.com/*",
                  "https://www.wsj.com/*",
                  "http://www.economist.com/*",
                  "http://www.nytimes.com/*",
                  "https://hbr.org/*",
                  "http://www.newyorker.com/*",
                  "http://www.forbes.com/*",
                  "http://online.barrons.com/*",
                  "http://www.barrons.com/*",
                  "http://www.investingdaily.com/*",
                  "http://realmoney.thestreet.com/*",
                  "http://www.washingtonpost.com/*"
                  ],
  "background": {
    "scripts": ["background.js"]
  },
  "manifest_version": 2
}

2. Create a file called background.js. Paste the following into the file:

var ALLOW_COOKIES = ["nytimes", "ft.com"]

function changeRefer(details) {
  foundReferer = false;
  foundUA = false

  var reqHeaders = details.requestHeaders.filter(function(header) {
    // block cookies by default
    if (header.name !== "Cookie") {
      return header;
    } 

    allowHeader = ALLOW_COOKIES.map(function(url) {
      if (details.url.includes(url)) {
        return true;
      }
      return false;
    });
    if (allowHeader.reduce(function(a, b) { return a || b}, false)) { return header; }

  }).map(function(header) {
    
    if (header.name === "Referer") {
      header.value = "https://www.google.com/";
      foundReferer = true;
    }
    if (header.name === "User-Agent") {
      header.value = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
      foundUA = true;
    }
    return header;
  })
  
  // append referer
  if (!foundReferer) {
    reqHeaders.push({
      "name": "Referer",
      "value": "https://www.google.com/"
    })
  }
  if (!foundUA) {
    reqHeaders.push({
      "name": "User-Agent",
      "value": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
    })
  }
  console.log(reqHeaders);
  return {requestHeaders: reqHeaders};
}

function blockCookies(details) {
  for (var i = 0; i < details.responseHeaders.length; ++i) {
    if (details.responseHeaders[i].name === "Set-Cookie") {
      details.responseHeaders.splice(i, 1);
    }
  }
  return {responseHeaders: details.responseHeaders};
}

chrome.webRequest.onBeforeSendHeaders.addListener(changeRefer, {
  urls: ["<all_urls>"],
  types: ["main_frame"],
}, ["requestHeaders", "blocking"]);

chrome.webRequest.onHeadersReceived.addListener(blockCookies, {
  urls: ["<all_urls>"],
  types: ["main_frame"],
}, ["responseHeaders", "blocking"]);

Save both files in one directory. These should be the only files in the directory. If you were too lazy to copy and paste, you can download the source code here.

Now type chrome://extensions/ in the browser address bar.

Click Load unpacked extension... (Make sure Developer Mode is checked in the upper right if you do not see the buttons.)

Select the directory where you saved the two files. Enable the chrome extension and visit wsj.com.

Remember: Any time you introduce an access point for a trusted third party, you inevitably end up allowing access to anybody.

94 thoughts on “How Google’s Web Crawler Bypasses Paywalls”

Jay says:

February 19, 2016 at 9:47 am

I thought Google’s rules were to be indexed you couldn’t be behind a paywall. Did I have that wrong, or did that change?

Reply
1. Isoroku says:
  
  February 19, 2016 at 2:44 pm
  
  Google allows it for publishers if you specify “registration or subscription required” in the sitemap.
  
  Reply
111010101001010 says:

February 19, 2016 at 10:45 am

Or you could just got to the article and click the x button in your browser really quickly before it redirects you.

Reply
anonymous says:

February 19, 2016 at 11:06 am

Interesting. I wouldn’t have expected them to make it that easy.

WSJ doesn’t even seem to care about the UA. On the command line e.g.,

curl –referer https://www.google.com/ http://www.wsj.com/articles/your-favourite-article

gives me the full article already.

Reply
1. Elaine says:
  
  February 19, 2016 at 2:34 pm
  
  nice… must be the lack of cookies.
  
  Reply
gurkan says:

February 19, 2016 at 12:06 pm

Until someone starts to filter by IP blocks..

Reply
1. sqdcn says:
  
  February 19, 2016 at 6:13 pm
  
  Not very likely. It’s not applicable to record every IP in Google’s network.
  
  Reply
  1. Olivier says:
    
    February 21, 2016 at 5:33 am
    
    There’s a way, with the reverse DNS: https://support.google.com/webmasters/answer/80553?hl=en
    
    Reply
2. zak says:
  
  February 19, 2016 at 10:53 pm
  
  then host your crawler on google cloud
  
  Reply
Cameron says:

February 19, 2016 at 12:49 pm

It doesn’t seem to be working for me. Have they already fixed it? Or have I messed up somewhere?

Reply
1. Elaine says:
  
  February 19, 2016 at 2:35 pm
  
  working last I checked… might need to delete cookies first.
  
  Reply
Lucas says:

February 19, 2016 at 1:18 pm

On my side, there is an error when inspecting your extension. It said there were errors with theses parts :

chrome.webRequest.onBeforeSendHeaders.addListener(changeRefer, {
urls: [“”],
types: [“main_frame”],
}, [“requestHeaders”, “blocking”]);

chrome.webRequest.onHeadersReceived.addListener(blockCookies, {
urls: [“”],
types: [“main_frame”],
}, [“responseHeaders”, “blocking”]);

Removed the first element from the urls array so now it looks like :

chrome.webRequest.onBeforeSendHeaders.addListener(changeRefer, {
urls: [],
types: [“main_frame”],
}, [“requestHeaders”, “blocking”]);

chrome.webRequest.onHeadersReceived.addListener(blockCookies, {
urls: [],
types: [“main_frame”],
}, [“responseHeaders”, “blocking”]);

And its working fine now! Great work

Reply
1. Isoroku says:
  
  February 19, 2016 at 2:38 pm
  
  Thanks… maybe my version only worked on my particular version of chrome.
  
  Reply
2. Isoroku says:
  
  February 19, 2016 at 2:48 pm
  
  Oh, I just realized that I originally had urls: [“<all_urls>”], and the text within <> disappeared. Thank you.
  
  Reply
3. J H says:
  
  February 7, 2017 at 12:48 pm
  
  Fantastic work you guys!
  I edited the code per Lucas’ tip and it works perfectly in incognito!!
  
  Reply
  1. J H says:
    
    February 7, 2017 at 1:14 pm
    
    hmmm… it worked momentarily, but now it won’t. Thoughts?
    
    Reply
alp says:

February 19, 2016 at 1:43 pm

Care to publish the extension so that we can just install? 🙂

Reply
1. Isoroku says:
  
  February 19, 2016 at 2:41 pm
  
  Sorry my friend, the goal of this post is only to educate.
  
  Reply
florinpatan says:

February 19, 2016 at 2:17 pm

Well, if they are smart enough they also check the IP for the request and it’s game over as you won’t be able to get the address of the Googlebot for example 🙂

Reply
Hammy Goonan says:

February 19, 2016 at 2:23 pm

Without wanting to be too obtuse, that’s why Apple can’t give the FBI a back door.

Reply
1. Unimportant information says:
  
  January 12, 2023 at 10:16 pm
  
  Nowadays you don’t need anymore than Bluetooth and the app clips on the target device to access anything you want on said targeted device…. Don’t fool your self.
  
  Reply
Erin Dachtler says:

February 19, 2016 at 2:40 pm

I got an error trying to use this.
The solution seems to be to replace the part about `urls: [“”]` with `urls: [“”]`

Reply
1. Isoroku says:
  
  February 19, 2016 at 2:48 pm
  
  Oops! I just realized that I originally had [“<all_urls>”], and the text within <> disappeared in the provided code. Just fixed it. Thank you for pointing this out.
  
  Reply
  1. Jaa says:
    
    February 21, 2016 at 2:10 pm
    
    Reply
    1. Jaa says:
      
      February 21, 2016 at 2:12 pm
      
      alert(“wow, such hack”);
      
      Reply
Josh McVey (@y3rsh) says:

February 19, 2016 at 3:44 pm

Love it.

Reply
Joe says:

February 19, 2016 at 4:40 pm

This doesn’t seem to be working on WSJ.

Reply
1. Shawn says:
  
  February 19, 2016 at 8:28 pm
  
  Same for me. Works for every site other than wsj. Great blog post!
  
  Reply
  1. Shawn says:
    
    February 19, 2016 at 8:41 pm
    
    And now it’s working
    
    Reply
deanalator says:

February 19, 2016 at 5:02 pm

clear cookies. reload extensions (ctrl-R) then go to wsj

Reply
Astro Jetson says:

February 19, 2016 at 6:07 pm

Any chance you can do similar education for Firefox?

Reply
1. Flandy says:
  
  February 19, 2016 at 11:40 pm
  
  To make this work in Firefox just add
  
  “applications”: {
  “gecko”: {
  “id”: “someidhere”
  }
  },
  
  to manifest.json. Then zip(don’t compress) the two files and rename the file from .zip to .xpi and drag the file into Firefox. Also you might have to change some permissions to allow unsigned extensions. Changing xpinstall.signatures.required to false in about:config might do it.
  
  Reply
  1. Astro Jetson says:
    
    February 21, 2016 at 7:33 am
    
    Thanks professor! 😉
    
    Reply
Karun says:

February 19, 2016 at 9:37 pm

I’ve always read paid articles on WSJ by Google’s its name. You just click on Google search results and you can read the full article.

Reply
1. besso (@gregorbeslic) says:
  
  February 20, 2016 at 5:47 am
  
  Like this: https://twitter.com/gregorbeslic/status/701038901536956416
  
  Reply
Pazu says:

February 20, 2016 at 9:12 am

Or use in IE F12, tab Emulation and set it directly there…

Reply
Josh says:

February 21, 2016 at 4:57 am

Though it’s possible for publishers to verify UAs claiming to be googlebot, so this could quite easily be blocked.

Reply
HQ Fanfan says:

February 21, 2016 at 7:53 pm

Thanks Isoroku for the interesting approach 🙂
Any clue on why this would work on some sites (wsj.com, scmp.com for instance) and not on others (gamekult.com, lemonde.fr…) – is it due to the paywall archutecture being different? Typically you can access beginning of the articles on the later (hence maybe treatment by crawlers isn’t beyond the paywall).

Not a tech guy here, but still interested in understanding. Thanks!

Reply
Pingback: Les liens de la semaine – Édition #172 | French Coding
V says:

February 22, 2016 at 9:19 am

This doesn’t seem to work for the thediplomat.com articles (after their 5 free articles per month limit)

Reply
1. Vibhore Singh says:
  
  May 17, 2017 at 4:06 pm
  
  For that just open the article in incognito mode. Been using the same approach since they limited access to 5 articles
  
  Reply
joewils says:

February 22, 2016 at 12:10 pm

Hope you don’t mind, but I packaged this bit of magic up and posted it to GitHub as a Gist: https://gist.github.com/joewils/fbd487eccb5b09ab79a6

Reply
Alan says:

February 22, 2016 at 12:47 pm

As of today Mon Feb 22 2016 at 12:30 Pacific time Windows Defender (in 8.1) rejects the zip as being malware.

Reply
Kyle says:

February 22, 2016 at 1:19 pm

Can anyone test wsj and new york times for me, can’t seem to get those working.

Reply
1. w says:
  
  February 22, 2016 at 4:44 pm
  
  try deleting cookies and try again, both work here.
  
  Reply
  1. [email protected] says:
    
    February 23, 2016 at 6:43 am
    
    Not working on WSJ.com as of this morning
    
    Reply
    1. Kyle says:
      
      February 24, 2016 at 8:28 am
      
      yeah, tried deleting my cookies as suggested above, still neither wsj nor nytimes work for me.
      
      blogs.wsj is working though
      
      Reply
      1. Kyle says:
        
        February 24, 2016 at 8:44 am
        
        wsj.com is working for me through clearing cookies, still no nytimes.com though
2. Vibhore says:
  
  May 17, 2017 at 4:08 pm
  
  https://www.youtube.com/watch?v=7aai0sgXUvk …This worked for me
  
  Reply
John says:

February 24, 2016 at 12:51 pm

wsj.com not working NYT and FT are working

Reply
1. John says:
  
  February 24, 2016 at 2:54 pm
  
  Spoke to soon. Only site working now in NYT.
  
  Reply
Broker says:

February 25, 2016 at 2:57 pm

Doesn’t work on WSJ.

Reply
jaypinho says:

February 26, 2016 at 11:39 am

No longer working on WSJ.

Reply
1. Elaine says:
  
  February 26, 2016 at 9:39 pm
  
  try deleting cookies, perhaps? it works over here.
  
  Reply
  1. Blah blah take a guess says:
    
    September 4, 2020 at 10:29 pm
    
    Does it still work?
    
    Reply
Pingback: Weekend Reading – February 26, 2016 | Healey.io
faseegh says:

March 15, 2016 at 4:24 am

You are a legend!! Works perfect!

Reply
1. faseeh shams says:
  
  March 22, 2016 at 3:05 am
  
  Hi, it worked perfectly well until yesterday – cant access it anymore 🙁 – any chance you can update it?
  
  Reply
  1. faseeh shams says:
    
    March 22, 2016 at 3:13 am
    
    It might have been a cache issue, think its working. 🙂 thanks again.
    
    Reply
Angry Thinker says:

March 26, 2016 at 5:20 am

The FT now redirects you to https://next.ft.com, which blocks you from getting behind the paywall. I have added https://next.ft.com to the manifest.json file, but to no avail. Any suggestions?

Reply
1. Angry Thinker says:
  
  March 27, 2016 at 2:29 am
  
  Edit: today the FT is accessible again, as is the WSJ.
  
  Reply
Butchmo says:

May 15, 2016 at 1:45 pm

Thanks for this great tip.

Now I’d like to find a way to bypass paywalls on iOS. Any ideas?

We’d need a browser that worked with extensions like this one, or that at least let us change both the referer and the user-agent…

Reply
1. Elaine says:
  
  May 17, 2016 at 9:54 am
  
  You can build build extensions for safari in iOS. I believe Apple gives you the ability to modify headers judging by existing extensions, but I haven’t checked for sure.
  
  Reply
Pingback: LinkedIn vs the Bots | Elaine's Idle Mind
Oliver says:

September 9, 2016 at 9:18 am

FT has made some big changes in their paywall. Does this still work?

Reply
1. Angry Thinker says:
  
  September 9, 2016 at 9:58 am
  
  yes it does Oliver
  
  Reply
2. Angry Thinker says:
  
  September 9, 2016 at 10:03 am
  
  Sorry, no it does not Oliver. Apologies for my earlier wrong answer.
  
  Reply
witcher says:

September 20, 2016 at 10:11 am

how to add https://slon.ru/ to permissions list without error?

Reply
Angry Thinker says:

September 21, 2016 at 11:34 pm

The FT seems to have made some changes that now block the use of the set-up presented here. WSJ for example still works.I wonder if the FT has a mole here.

Reply
1. Butchmo says:
  
  September 22, 2016 at 10:52 am
  
  Weird. It still works for me.
  
  Reply
  1. Steve says:
    
    September 22, 2016 at 11:16 am
    
    FT hasn’t been working properly for me for a few days. This is the message I get: The http://www.ft.com page isn’t working http://www.ft.com redirected you too many times.
    Try clearing your cookies.
    ERR_TOO_MANY_REDIRECTS
    —-
    
    I’m using a bypass extension on Chrome that had been fine for the past month.
    
    Reply
    1. Angry Thinker says:
      
      September 24, 2016 at 1:33 am
      
      I am getting the same as you. I assume your extension isn’t working anymore either. If it is still working, could you tell me which one it is?
      
      Reply
Steve says:

September 25, 2016 at 12:44 pm

I had been using Bypass, an extension in the Chrome store. It was recently blocked. If you know how to sideload an extension, here is the zipped file for Bypass. I need to find a tutorial [for sideloading], blow-by-blow, step-by-step because I can’t figure it out! https://github.com/cezary/bypass

Reply
Steve says:

September 26, 2016 at 8:03 pm

Angry Thinker,

I was able to sideload the file, but I’m having a problem with FT. That was the primary reason I downloaded it. It may work for other sites, though. One think you could do, or I hope that somebody does, is to contact Cezary through Twitter. He has stated that he intends to maintain, and improve upon, this extension. The ext. doesn’t appear to be blocking cookies, though. It is supposed to. Google may be preventing it from doing so. Hope you can post an answer real soon.

Reply
1. Angry Thinker says:
  
  September 30, 2016 at 9:49 am
  
  Steve I don’t use Twitter so can’t contact Cezary that way. Like you, my main reason for using the extension is for the FT website. I suppose you don’t use Twitter either otherwise you would not ask someone else to contact Cezary. If you have an email address for him I would be quite happy to follow up with him.
  
  Reply
  1. Steve says:
    
    September 30, 2016 at 10:40 am
    
    I sent him an e-mail on the 27th, but he hasn’t responded. There are issues posted on his github page from a month ago. He hasn’t responded to them either. I believe the problem is with FT, not with the Ext. See here for contact info. http://cezarywojtkowski.com/
    
    Reply
Johnny Tall says:

September 29, 2016 at 2:10 pm

I use this Chrome extension and it bypasses all the paywalls just fine:
https://chrome.google.com/webstore/detail/bypass-wsj-other-sites/gfbabigadapckiaabchaolgjfbgickop

Reply
Steve says:

September 30, 2016 at 8:02 am

That extension has the same problem as the one I mentioned. It works fine for all the websites that are listed, EXCEPT FOR the one website that I would use it for: Financial Times. You probably don’t visit that site, so you probably wouldn’t realize that that one doesn’t work. Try it.

Reply
Butchmo says:

October 11, 2016 at 1:51 am

Angry and Steve,

Only today I remembered my manifest.json file has a few extra lines for the FT. Maybe that’s why I can read it with no problems.

It’s like this:

“permissions”: [“webRequest”, “webRequestBlocking”,
“https://www.ft.com/*”,
“http://www.ft.com/*”,
“https://next.ft.com/*”,
“http://next.ft.com/*”,
],

I probably added the extra lines when the FT launched next.ft.com… I don’t even know if the next.ft.com lines are still needed.

Reply
1. Angry Thinker says:
  
  October 11, 2016 at 2:37 am
  
  Thanks for coming back to us Butchmo. I added the extra lines, but still no joy. Are you sure you can still access articles, and not just the headlines? I can access headlines (as before) but still no details per article. I think this time the FT have sealed up their site well & truly.
  
  Reply
  1. Butchmo says:
    
    October 11, 2016 at 11:36 am
    
    Yes, I can read all the articles.
    
    Have you tried clearing the cookies or disabling other extensions you use? I think the problem might not be your custom extension, but something else.
    
    A couple weeks ago it seemed my custom extension stopped working (for the FT), but everything was back to normal after I cleared the cookies.
    
    Reply
    1. Angry Thinker says:
      
      October 12, 2016 at 3:16 am
      
      I whitelisted ft.com in uBlockOrigin & in Opera’s ad blocker. I have access again :-)) Thanks for your help.
      
      Reply
Nmae says:

January 7, 2017 at 7:57 pm

No longer working at WSJ.

Reply
Casio Caro says:

January 12, 2017 at 8:19 am

It does not work for WSJ, any idea?

Reply
Elaine says:

January 13, 2017 at 8:59 am

Please help fix WSJ!

Reply
Willy says:

January 18, 2017 at 1:39 pm

I third this idea – reading WSJ content was such a treat. Please update to enable!

Reply
1. Elaine says:
  
  January 18, 2017 at 2:48 pm
  
  I suspect there will a fix coming soon…
  
  Reply
Pingback: How the Twitter App Bypasses Paywalls | Elaine's Idle Mind
Pingback: replique bracelet cartier
Klaas Vaak says:

April 29, 2017 at 5:36 am

I cannot access the Financial Times: has anything changed?

Reply
brendafdez says:

September 26, 2017 at 2:21 am

archive.is helps the lazy. Does work with the wsj and pretty much everything else.

Reply
BS says:

April 3, 2018 at 8:33 pm

Anyone manage to gain access to the sites listed?

Reply
JulienF says:

December 13, 2018 at 5:30 am

Doesn’t work on french sites such as : https://www.lemonde.fr/

Reply
Pingback: How to Bypass a Paywall (Articles, Blogs, etc.)

Elaine's Idle Mind

and Devil's Workshop

How Google’s Web Crawler Bypasses Paywalls

Like this:

Related

94 thoughts on “How Google’s Web Crawler Bypasses Paywalls”

Leave a Reply to JoeCancel reply

Go talk about it:

Like this:

Related

94 thoughts on “How Google’s Web Crawler Bypasses Paywalls”

Leave a Reply to JoeCancel reply