How Google’s Web Crawler Bypasses Paywalls

by Isoroku Yamamoto

Update: A newer version of the chrome extension is available here.

Wall Street Journal fixed their “paste a headline into Google News” paywall trick. However, Google can still index the content.

Digital publications allow discriminatory access for search engines by inspecting HTTP request headers. The two relevant headers are Referer and User-Agent.

Referer identifies the address of the web page that linked to the resource. Previously, when you clicked a link through Google search, the Referer would say https://www.google.com/. This is no longer enough.

More recently, websites started checking for User-Agent, a string that identifies the browser or app that made the request. Wall Street Journal wants to know that you not only came from Google, but also that you are an agent of Google.

By providing this information in request headers, anyone can appear to be a Google web crawler. In fact, I will show you how to make a Chrome extension that does just that.

1. Create a file called manifest.json. Paste the following in the file. Add any sites you would like to read to the permissions list.

{
  "name": "Innocuous Chrome Extension",
  "version": "0.1",
  "description": "This is an innocuous chrome extension.",
  "permissions": ["webRequest", "webRequestBlocking",
                  "http://www.ft.com/*",
                  "http://www.wsj.com/*",
                  "https://www.wsj.com/*",
                  "http://www.economist.com/*",
                  "http://www.nytimes.com/*",
                  "https://hbr.org/*",
                  "http://www.newyorker.com/*",
                  "http://www.forbes.com/*",
                  "http://online.barrons.com/*",
                  "http://www.barrons.com/*",
                  "http://www.investingdaily.com/*",
                  "http://realmoney.thestreet.com/*",
                  "http://www.washingtonpost.com/*"
                  ],
  "background": {
    "scripts": ["background.js"]
  },
  "manifest_version": 2
}

2. Create a file called background.js. Paste the following into the file:

var ALLOW_COOKIES = ["nytimes", "ft.com"]

function changeRefer(details) {
  foundReferer = false;
  foundUA = false

  var reqHeaders = details.requestHeaders.filter(function(header) {
    // block cookies by default
    if (header.name !== "Cookie") {
      return header;
    } 

    allowHeader = ALLOW_COOKIES.map(function(url) {
      if (details.url.includes(url)) {
        return true;
      }
      return false;
    });
    if (allowHeader.reduce(function(a, b) { return a || b}, false)) { return header; }

  }).map(function(header) {
    
    if (header.name === "Referer") {
      header.value = "https://www.google.com/";
      foundReferer = true;
    }
    if (header.name === "User-Agent") {
      header.value = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
      foundUA = true;
    }
    return header;
  })
  
  // append referer
  if (!foundReferer) {
    reqHeaders.push({
      "name": "Referer",
      "value": "https://www.google.com/"
    })
  }
  if (!foundUA) {
    reqHeaders.push({
      "name": "User-Agent",
      "value": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
    })
  }
  console.log(reqHeaders);
  return {requestHeaders: reqHeaders};
}

function blockCookies(details) {
  for (var i = 0; i < details.responseHeaders.length; ++i) {
    if (details.responseHeaders[i].name === "Set-Cookie") {
      details.responseHeaders.splice(i, 1);
    }
  }
  return {responseHeaders: details.responseHeaders};
}

chrome.webRequest.onBeforeSendHeaders.addListener(changeRefer, {
  urls: ["<all_urls>"],
  types: ["main_frame"],
}, ["requestHeaders", "blocking"]);

chrome.webRequest.onHeadersReceived.addListener(blockCookies, {
  urls: ["<all_urls>"],
  types: ["main_frame"],
}, ["responseHeaders", "blocking"]);

Save both files in one directory. These should be the only files in the directory. If you were too lazy to copy and paste, you can download the source code here.

Now type chrome://extensions/ in the browser address bar.

Click Load unpacked extension... (Make sure Developer Mode is checked in the upper right if you do not see the buttons.)

Screen Shot 2016-02-18 at 10.49.25 PM

Select the directory where you saved the two files. Enable the chrome extension and visit wsj.com.

Remember: Any time you introduce an access point for a trusted third party, you inevitably end up allowing access to anybody.

94 thoughts on “How Google’s Web Crawler Bypasses Paywalls

  1. I thought Google’s rules were to be indexed you couldn’t be behind a paywall. Did I have that wrong, or did that change?

  2. On my side, there is an error when inspecting your extension. It said there were errors with theses parts :

    chrome.webRequest.onBeforeSendHeaders.addListener(changeRefer, {
    urls: [“”],
    types: [“main_frame”],
    }, [“requestHeaders”, “blocking”]);

    chrome.webRequest.onHeadersReceived.addListener(blockCookies, {
    urls: [“”],
    types: [“main_frame”],
    }, [“responseHeaders”, “blocking”]);

    Removed the first element from the urls array so now it looks like :

    chrome.webRequest.onBeforeSendHeaders.addListener(changeRefer, {
    urls: [],
    types: [“main_frame”],
    }, [“requestHeaders”, “blocking”]);

    chrome.webRequest.onHeadersReceived.addListener(blockCookies, {
    urls: [],
    types: [“main_frame”],
    }, [“responseHeaders”, “blocking”]);

    And its working fine now! Great work

    1. Oh, I just realized that I originally had urls: [“<all_urls>”], and the text within <> disappeared. Thank you.

  3. Well, if they are smart enough they also check the IP for the request and it’s game over as you won’t be able to get the address of the Googlebot for example 🙂

    1. Oops! I just realized that I originally had [“<all_urls>”], and the text within <> disappeared in the provided code. Just fixed it. Thank you for pointing this out.

    1. To make this work in Firefox just add

      “applications”: {
      “gecko”: {
      “id”: “someidhere”
      }
      },

      to manifest.json. Then zip(don’t compress) the two files and rename the file from .zip to .xpi and drag the file into Firefox. Also you might have to change some permissions to allow unsigned extensions. Changing xpinstall.signatures.required to false in about:config might do it.

  4. Thanks Isoroku for the interesting approach 🙂
    Any clue on why this would work on some sites (wsj.com, scmp.com for instance) and not on others (gamekult.com, lemonde.fr…) – is it due to the paywall archutecture being different? Typically you can access beginning of the articles on the later (hence maybe treatment by crawlers isn’t beyond the paywall).

    Not a tech guy here, but still interested in understanding. Thanks!

    1. For that just open the article in incognito mode. Been using the same approach since they limited access to 5 articles

        1. yeah, tried deleting my cookies as suggested above, still neither wsj nor nytimes work for me.

          blogs.wsj is working though

  5. Thanks for this great tip.

    Now I’d like to find a way to bypass paywalls on iOS. Any ideas?

    We’d need a browser that worked with extensions like this one, or that at least let us change both the referer and the user-agent…

    1. You can build build extensions for safari in iOS. I believe Apple gives you the ability to modify headers judging by existing extensions, but I haven’t checked for sure.

  6. The FT seems to have made some changes that now block the use of the set-up presented here. WSJ for example still works.I wonder if the FT has a mole here.

      1. FT hasn’t been working properly for me for a few days. This is the message I get: The http://www.ft.com page isn’t working http://www.ft.com redirected you too many times.
        Try clearing your cookies.
        ERR_TOO_MANY_REDIRECTS
        —-

        I’m using a bypass extension on Chrome that had been fine for the past month.

        1. I am getting the same as you. I assume your extension isn’t working anymore either. If it is still working, could you tell me which one it is?

  7. I had been using Bypass, an extension in the Chrome store. It was recently blocked. If you know how to sideload an extension, here is the zipped file for Bypass. I need to find a tutorial [for sideloading], blow-by-blow, step-by-step because I can’t figure it out! https://github.com/cezary/bypass

  8. Angry Thinker,

    I was able to sideload the file, but I’m having a problem with FT. That was the primary reason I downloaded it. It may work for other sites, though. One think you could do, or I hope that somebody does, is to contact Cezary through Twitter. He has stated that he intends to maintain, and improve upon, this extension. The ext. doesn’t appear to be blocking cookies, though. It is supposed to. Google may be preventing it from doing so. Hope you can post an answer real soon.

    1. Steve I don’t use Twitter so can’t contact Cezary that way. Like you, my main reason for using the extension is for the FT website. I suppose you don’t use Twitter either otherwise you would not ask someone else to contact Cezary. If you have an email address for him I would be quite happy to follow up with him.

      1. I sent him an e-mail on the 27th, but he hasn’t responded. There are issues posted on his github page from a month ago. He hasn’t responded to them either. I believe the problem is with FT, not with the Ext. See here for contact info. http://cezarywojtkowski.com/

  9. That extension has the same problem as the one I mentioned. It works fine for all the websites that are listed, EXCEPT FOR the one website that I would use it for: Financial Times. You probably don’t visit that site, so you probably wouldn’t realize that that one doesn’t work. Try it.

  10. Angry and Steve,

    Only today I remembered my manifest.json file has a few extra lines for the FT. Maybe that’s why I can read it with no problems.

    It’s like this:

    “permissions”: [“webRequest”, “webRequestBlocking”,
    “https://www.ft.com/*”,
    “http://www.ft.com/*”,
    “https://next.ft.com/*”,
    “http://next.ft.com/*”,
    ],

    I probably added the extra lines when the FT launched next.ft.com… I don’t even know if the next.ft.com lines are still needed.

    1. Thanks for coming back to us Butchmo. I added the extra lines, but still no joy. Are you sure you can still access articles, and not just the headlines? I can access headlines (as before) but still no details per article. I think this time the FT have sealed up their site well & truly.

      1. Yes, I can read all the articles.

        Have you tried clearing the cookies or disabling other extensions you use? I think the problem might not be your custom extension, but something else.

        A couple weeks ago it seemed my custom extension stopped working (for the FT), but everything was back to normal after I cleared the cookies.

        1. I whitelisted ft.com in uBlockOrigin & in Opera’s ad blocker. I have access again :-)) Thanks for your help.

Leave a Reply