Before jumping right into the solution, lets get around the basics of Internet Bots so that we can decide which solution is the best one and why.


What are Internet Bots?

Internet Bots are basically computer programs that perform automated tasks on websites. The tasks can vary from scraping information off the website to spamming the site with large number of hits making the site inaccessible for real audience.

Examples: Search engines use crawlers to go through websites for indexing. Another example could be the websites that compare products or prices of different products by taking information from other websites.


What are the effects of Internet Bots?

Approximately half of the internet traffic consists of Bot activities. Check out this trend by ‘statista’.

Percentage of bot traffic as share of total online traffic from 2012 to 2016

Just like the thing I relate to the most (Cholesterol), bots also come as good and bad.

The Good Fellas

Generally, reputable companies deploy ‘Good Bots’ at a large scale. These bots respect the rules that are created by webmasters to regulate their crawling activity and indexing rate. The rules are defined in a website’s robots.txt file for crawlers to see. We can also block particular crawlers to prevent from indexing websites. For example, Businesses that do not focus in China can block Baidu’s crawler.

Some other common bots also include social network bots, website monitoring bots, backlink checker bots (e.g. SEMRushBot), aggregator bots (e.g. Feedly), and more.

The Bad Boys

The bad bots are also used for a number of reasons like stealing content, scraping reviews and news headlines, submit forms, comment on posts etc.

One of the most harmful purposes out of them is DDOS attack. A Distributed Denial Of Service attack is achieved when a large number of hits to the site saturates it’s services. This can lead to temporary suspension of services, significant charges from hosting services, poor SEO and a bad reputation.

From Analytics point of view, these malicious visits’ data gets stored and is seen Google Analytics, Adobe Analytics etc. Although, Google Analytics does provide bot filtering feature to filter out “bot traffic” from views, it still leaves a considerable amount of traffic.

Hence, we should have a solution that prevents us from taking decisions based on incorrect data.


Leveraging reCaptcha v3

A “CAPTCHA” is a Turing test to tell human and bots apart. It is easy for humans to solve, but hard for “bots” and other malicious software to figure out.

Talking of Turing:

Have you noticed this guy is always playing a smart person

Sometime back, Google launched version three of its famous reCaptcha service. It succeeded the version one where scrambled text was used and version two where the user was asked to identify certain objects in a number of images.

The latest one helps us differentiate between human and bot behavior without actually asking the user to take a test. This is a huge improvement from the past versions where reCaptcha tests would create significant problem with user flow. For comparison, take earlier versions as Batman & Robin (1997) and version three as Nolan’s Batman Trilogy.

How does it work?

When a user enters on the website a score ranging from 0 (most likely bot) to 1 (most likely human) is generated. With time, reCaptcha learns how users on the website typically act, helping the machine learning algorithm underlying it to generate more accurate risk scores.

This score can be stored in variables (Custom Dimensions/eVars) to filter out bot traffic in Google or Adobe Suites.

Apart from this, during login or other high security events by a malicious bot/user, they can be directed to a two-factor authentication or other measures for assurance.

This is a two step process:

  1. Token Creation
  2. Scoring

Token Creation:

A library and code snippet is embedded in the site code. This code creates a user response token and sends it to the back-end. The code creation can be triggered by an event of our choice, be it form submission, login or even a page view.

Here is a snapshot of user response token in encrypted form:

user response token

Scoring

The response generated from first step is sent to Google using an API call from the back-end. The adaptive risk analysis engine of reCaptcha then sends a response from which score can be extracted.

Response from Adaptive Risk Analysis engine of reCaptcha

I wrote a Python based program that opens my portfolio page and scrapes some information out of it. Here are the results of the cases where a human interacted with the site and where the bot interacted:

Test Cases

End Game:

Google doesn’t exactly reveal how it creates a behavior profile to prevent scammers to imitate human behavior. According to two security researchers from the University of Toronto, who’ve studied reCaptcha, the score depends a lot if you have a Google cookie installed in your browser or not. Another source said reCaptcha’s API collects and sends software and hardware information, including application and device data to Google for analysis.

Google never addressed any potential privacy problems regarding it. We can consider reCaptcha v3 as a way of ensuring a safe, friction less online experience.

Source: The Register

This Post Has One Comment

Leave a Reply