When you are regularly scraping websites, you might want to outsource that process. There are clear benefits of using cloud resources for scraping:
- schedule your scrapers
- scale the number of scrapers
- no need for an internet connection with your local machine
- free up computational resources on your local machine
Although I run only a couple of scrapers, the major advantage for me is that I can schedule my scrapers and do not need to worry whether my laptop is connected to a wifi when I’m not at home.
This article describes the architecture and steps to set up a free and remote scraper using RServer and AWS. I will not go into too many details but rather explain the concept behind it.
Works also with Python and on Digital Ocean
For those of you who prefer Python with BeautifulSoup or DigitalOcean, you can build a similar setup. The architecture would be the same, and the necessary steps very are similar.
Step 1: Create an AWS account
First, head over to Amazon Webservices and sign up for a free account. If you sign up for the first time, you will get a 12 months trial period with free access to cloud resources. This is also referred to as free tier usage.
These free resources do not include much processing power, but they are more than sufficient for our purposes.
Step 2: Install RServer on an EC2 instance
Next, you create an EC2 instance and install RServer on it. There a great Youtube-Tutorials on how to do this. And it actually takes less than 10 minutes. Check out the two by Manuel using Ubuntu or CentOS. Both tutorials are great and only differ in the operating system of the EC2 instance.
With CentOS, you do not need to use any terminal commands for the installation. However, you might want to choose Ubuntu as there are more help-resources and tutorials available if you want to expand and configure your instance later.
Step 3: Install rvest and your favorite R packages
Now, you can log into your RServer with the IP address of your EC2 instance. Check out any tutorial from the previous step if you don’t know how.
In RServer, you have to install all packages you need for your script. To scrape a website, you will most likely use
rvest. Additionally, I installed all packages from tidyverse to clean and pre-process the scraping results.
Note: Before you can install
rvest, you might need to install
libxml2 first. To do this, log into your instance with the terminal and install both packages:
# On Ubuntu: ssh -i “.ssh/rstudio.pem” ec2-user@<server>.compute.amazonaws.com sudo apt-get install openssl-dev sudo apt-get install libxml2-dev # On CentOS: ssh -i “.ssh/rstudio.pem” ec2-user@<server>.compute.amazonaws.com sudo yum install openssl-devel sudo yum install libxml2-devel
Afterwards, you should be able to successfully install `rvest and scrape webpages.
Step 4: Install cronR on Rserver
So far, you are able to scrape websites from your AWS instance. But to leverage the advantages of your cloud instance, install
cronR . The cronR package allows you to schedule your scripts and scrapers using crontabs.
That’s it! Now you can scrape websites with R autonomously using an AWS instance.
Just upload your scripts to RServer and schedule them with
cronR . In addition, you can also connect more services to enhance your workflow:
Optional: Connect RServer with Github
If you are already using Github, this step might seem natural to you. If you are not using Github, start considering it. I have all my scrapers in a private repository and synchronize it with RServer and my local machine.
As such, I make sure that I’m always working with the most recent script which makes it easier to maintain my scrapers
Optional: Add a database to store your results
Finally, you can set up a database that stores the results of your scraper. With your AWS free tier usage, you can set up a MySQL, PostgreSQL, MariaDB, or Oracle database.
I use a MySQL database to store my scraper results. Every time a scraper is done, the results are added to the database. This way I make sure to push my results to a permanent storage that is unaffected if I pause or terminate the EC2 instance.
Another benefit is that I can directly access the most recent entries from further services. For example, dashboards and visualizations at Google Data Studio, Tableau, Plotly, etc. are always up to date. And I can, of course, also access the database from my local machine with programs like Sequel Pro.
This article has also been published at HackerNoon (Medium.com).
- Visualizing Bikesharing Trips from StadtRad with kepler.gl
- Learn How to Dockerize a ShinyApp in 7 Steps
- Deploy your ShinyApp with Docker instead of shinyapps.io
- A Brief Introduction to Wikidata
- Learn how to select the best performing linear regression for univariate models
- Clustering spatial data without ML to find homogeneous areas
- How you can use linear regression models to predict quadratic, root, and polynomial functions
- Facts + Figures of the Bike-Sharing System StadtRad (Hamburg)
- Using Big Data to Identify Bike-Sharing Customers