Can the New Google Crawler Teach Us About Website Security?


Dan Ennis CEO


A few days ago Google announced that it will start scanning websites using a browser. This is a profound change of concept, some may say a 180 degree turn. For us as security professionals this change has some very interesting implications which we’ll discuss in this post.

Why is this change so significant for Google?

For more than a decade, Google has analyzed only the static contents of HTML pages, missing out on a significant portion of the sites’ content. The fact that the Google crawler simulates the behavior of a user which is using a browser may seem at first like a step back as compared to the automatic static analysis of the page. However, this approach has two major advantages that made Google realize they must use it:

  1. All content in the site will be discovered.
  2. Only legal links will be discovered.

The context-aware site scanner that we’ve been working on at Sentrix over the last few years has been using this exact same method to protect sites using a whitelist. A whitelist is a list of requests that are allowed access to the sites. If analyzing the site is done correctly, the list would be complete, which means that we now know all the legal requests that users may send to the site. Any request that is not in the list is by default a hacking attempt.

In fact, the only possible way to create a complete whitelist is by using the method of crawl-by-browser. Using other methods like eavesdropping to the incoming traffic has proven time and time again to be flawed. When listening to incoming user traffic, you capture both legal and illegal requests, initiated by regular users and hackers alike. Distinguishing one from the other is time consuming, endless and virtually impossible.

This is what caused Google to recently update its crawling algorithm.

Writing a crawler that imitates a real user is a complex task; there are many pitfalls while scanning: content which only appears after a certain time or by asynchronous (AJAX) requests; links that are created by complex, stateful JavaScript code, and many others. Content can be produced by means of various interactions: clicking on links, inputting data into forms, pinching or swiping on a mobile device, etc. This requires a sophisticated algorithm that predicts which interactions may produce additional content; otherwise, a page will take hours to crawl.

Another concern is identifying identical pages, despite being loaded from different URLs (even something as innocuous as a randomized cachebuster query parameter – a common practice – can baffle a naive scanner). Furthermore, there is the problem of keeping up with the site. In an ideal world adhering to the cache control settings as sent by the customer’s web server would be enough. However, in practice this data is often incorrect or incomplete and pages that are rarely updated are not marked as cacheable and vice versa.

Moreover, our experience shows that most of the pages in a customer’s site are composed of repeating templates. Imagine Wikipedia for example. The English Wikipedia site consists of more than 4,500,000 pages, almost all of them have a menu bar. A smart scan will need to identify this pattern and avoid scanning it redundantly across millions of pages.

Lastly, another set of problems lies in overcoming CAPTCHAs, handling Flash and Silverlight content, interacting with HTML5 elements such as <canvas>, and others. This may require the use of artificial intelligence.

Google has always been at the forefront of technological innovations. This new approach to crawling marks a milestone in website analysis and we expect others to follow. We believe that only a smart site crawler that simulates a user can produce a complete whitelist of site links and thus provide a robust and bulletproof protection for the website owner and we have been using this approach in our context-aware website security solution. How would you approach this challenge?