Web scraping / Web crawling ethics

Many of us are always thinking about what are the best practices one should follow when undertaking a web scrape projects. Although there have no major legal hurdles in scraping publicly available data to really write about (other than a one off case of Ryan Air), it is best advised to follow a few steps that will keep you on right side of law.

1. Never swamp the targeted site to extent of denying access to other legitimate users. You can do this by limiting your access to their non-peak hours and ramping up in the evenings till dawn, on weekends and public holidays. Some popular sites like Google, Yahoo, Amazon, Facebook etc. warn you if you access the content too fast. That is a warning signal for you to slow your scraper down.

2. Never download the same content more than once as you are just wasting their bandwidth. Try and download all content to your local machine in one go and then do the processing.

3. Try not to be the #1 user of the targeted site. If they ever get around to checking log files, you do not want to be at the top of their list. You may use proxy IPs to conceal your activity to not appear as #1 use of the site.

4. Ask the client if he has necessary permission from the site owner to download data. If the site owner finds value in sharing the data and gives permission, it is a huge plus in scraping the content.

5. If the targeted site demands you create an account (paid/free) to access data, do not use aliases. Use actual information and inform the client upfront or demand client provide access to website.

6. If the site sends a warning email, respect that. Immediately cease the scraping, delete all data and cease the project. The client will understand.

I hope the above are useful tips. Feel free to share your thoughts & experiences on best practices for web scraping projects.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>