How to protect the content from web scraping
- AI Analytics Big Data Captcha Crawling News Python Scraping Tools
Every website admin has two diametrically opposite goals: to help the legitimate web crawlers in indexing the website content while protecting it from illegal web scraping. Here is how it is done.
We have recently explained the basics of web scraping as well as the differences between Big Data scraping and web crawling. In this article, we will provide a short overview of the importance of web scraping prevention and the methods to use with limited or unlimited resources. Information protection is almost as resource-costly as creating the apps and services themselves.
However, devoting a significant amount of resources to protecting a fledgling startup website from scraping is ultimately useless, as the company is usually young and small, thus being not interesting as a target for potential attackers. Thus said, you should keep in mind you will not be able to 100% secure your data from prying eyes. Well, you actually can, yet it will demand so much time and effort that the information you are trying to protect will become irrelevant and your business will go bankrupt.
Thus said, it is important to evaluate your market position, the value of your website information, the resources available to you at any given point in time, and the measures you can take with these resources to protect your intellectual property with maximum efficiency. The startups should devote the absolute minimum of resources to the protection of their website content and aim their efforts at conquering their market niche instead.
Quite contrary, the enterprises usually have much more significant resources at their disposal and must devote a significant amount of them to protecting their know-how, as they cannot afford to rely on pure luck or simply “try a little bit”. Building a formidable defense for your product should start long before the product development actually begins and will go in parallel with your product development.
Thus said, there can be 2 levels of website content protection from web scraping:
- Limited resources
a) Protect SQL injection and its analogs
b) Well-written robots.txt
c) Don’t make your web page URLs easily listable. Facebook was scraped intensively as they had the links ending with /topic/11, /topic/12, /topic/13. Learn from the mistakes of the social media giant and make your URLs unique from the start
d) Limit the search results numbers
e) Limit the activity from one IP address
f) Demand authentication (helps siphon the fraudsters out if they use the same IP with legitimate users)
g) Monitor the logs regularly to track any suspicious activity
h) State the fictitious reason for ban (and the error code, for legitimate users)
- Literally unlimited resources
a) Make your UI dynamic (JS, AJAX) but exercise caution, as many legitimate search robots don’t render JS, so use it only on your search page
b) Good old CAPTCHA
c) Monitor the UI interactions (how quickly the forms are filled, where do the visitors click the button, are there any CSS/image downloads, as web scrapers need only the HTML)
d) Blacklist the IPs of the popular proxy services at once
e) Check your User Agent
f) Check the Referer
g) Check the cookies (and add special keys to them)
h) Use obfuscation to make your web page code excessively complicated for scrapers without hindering its performance
i) Hide your APIs, endpoints, database entries and any other potential breach points
j) Require authentication for all API calls, even for built-in APIs
k) Make your page markup unique (for different time zones, for different locations, for different visitors)
l) Update your HTML at least once a week
m) Fool the scrapers with honeypot data
n) Write your robots.txt according to the best practices, compose good Terms of Service and have a good lawyer at hand
o) Develop a public API and announce it (when people have a legitimate way to interact with your website, the incentive to try to scrape it will be lower)
Final thoughts on protecting the content from web scraping
We must end with the same statement we have begun with. Even devoting significant resources to protecting your data does not mean you are protected for life, as new vulnerabilities are discovered regularly (like that Meltdown and Spectre loopholes discovered January 2017). Thus said, ensuring the safety of your data is in no way a one-time deed; it is rather a continuous process your team should devote significant resources to.
The other approach is the one many companies follow: hiring a trustworthy contractor to deal with the challenge and rest assured all possible measures have been taken to protect the content from web scraping. Which way to choose is up to you. Stay put and good luck!
Feel free to browse through the latest insights and hints on the DevOps, Big Data, Machine Learning and Blockchain from IT Svit!
What if the transition to the cloud does not increase your profits?
The biggest fallacy of nowadays world is the so-called Productivity paradox, which states that the increase in investments in the IT operations does not lead to the growth of productivity.
Guide to AWS platform migration: AWS migration best practices
One of the most common tasks we perform at IT Svit is cloud migration from AWS to GCP, Azure, DigitalOcean and vice versa, or from legacy infrastructure to the cloud.
What if a company could really inspire self-development in employees?
There are tons of materials from multiple experts and gurus, who promise to teach businesses to inspire their teams, to make the employees grow and become more productive. However, their methods rarely work. Why so? We describe our vision of why conventional employee engagement practices fail, and how to really inspire the employees to self-development.
What your DevOps team will look like in 3 years
DevOps continues to gain momentum. This isn’t exactly news, as the term itself was coined about eight years ago, and prominent companies like Google, Amazon, Microsoft, and Netflix have made their share of meaningful donations to its development.