As you've probably guessed by the title of my article, I still consider Ruby on Rails as a relevant technology that offers a lot of value, especially when combined with ReactJS as it's frontend counterpart. Here's how I approach the topic.
Introduction – what will we use?
Have you ever had the problem that some services miss an API integration and you need to click through a page manually? Or you wanted to automate a process?
Here comes Watir. Watir is an open source Ruby library build for automated tests – but it’s not only used for that! We can also use it to build a web-scraper which simulates a human who clicks through a page to perform an action – log in, post a comment, download some data, and a lot of other things besides.
One of the key feature is that it uses a built-in Selenium web-driver, it means that we can scrape a rich, dynamic page build in JavaScript. In the past I had tried to build a web-scraper with the Mechanize gem – it’s perfect for simple pages that are static and don’t use a lot of JavaScript or AJAX.
Another advantage of Watir is the fact that it allows making a screenshot. Why is that helpful? Imagine a situation in which your application tries to parse a page but somewhere it fails… and what now? How do you handle an error? We can make a screenshot and upload it to S3 or save it locally! With Machanize it would be impossible – it uses Nokogiri and it doesn’t allow to make any screenshot.
Two modes
There are two options to run Watir – in a normal browser, eg. Chrome/Firefox, or in “headless” mode. What does that mean?
“Headless” mode allows you to parse a page without a monitor – in most UNIX systems Watir requires pre-installed Xvfb on your machine (if you’re using Ubuntu). In this mode, Watir uses PhantomJS to simulate a web-browser and run a page in an emulator. When you want to parse a page using Chrome, you need to install chrome-driver.
Another great feature is mobile/device testing mode. It allows you to run a page as an iPhone, iPad or other mobile devices. It could be a great way to test if a page is responsible and well-scaled.
In this article, I’ll try to show some of Watir’s features. I built a simple ruby gem that allows us to sign in, sign up, invite a friend or like a page on Facebook. I’ll describe each part of the gem and cover how it works.
The full source code can be found here.
Source Code
Let’s start.
def initialize(email, password) @email = email @password = password end
Well, I think that I don’t need to add any comments here – we just assign email and password to our class instance.
def browser @_browser ||= Watir::Browser.new(:chrome) end
Browser method keeps memoized Watir’s instance. Here you can specify which browser should be run – chrome, firefox etc. If you pass phantomjs there, it will be run in headless mode.
def login return true if @logged_in browser.goto('https://www.facebook.com/') form = browser.form(id: 'login_form') return false unless form.exist? form.text_field(name: 'email').set(email) form.text_field(name: 'pass').set(password) form.input(value: 'Log In').click sleep(2) @logged_in = main_page? end
Login method logs into Facebook with credentials passed during an instance initialization.
As you can find here, we use goto which changes the current page into passed parameter.
The form method searched for a form with passed params – in this case, we look for a form with id: login_form.
One important thing here, if you search for an element that doesn’t exist and you run some methods on it – your script will wait for this element (by default for 30s) and everything will be blocked. The best idea before running any method is to call the exist? Method to check if specified element really exists.
Text_field element looks for an input in a selected form with passed params and finally, the set method fills this input with the passed value.
As you can guess, the click method clicks on an element.
Why am I running the sleep method to wait for 2 seconds? To wait for all elements to load – javascript and all the other assets.
def main_page? browser.element(id: 'userNavigationLabel').exist? end
Main_page? Method checks if user navigation exists. If it exists it means that we successfully logged in!
def registration_params_valid?(params) return false unless params.keys.uniq.sort == REGISTRATION_INPUTS.uniq.sort return false if params.values.map(&:blank?).include?(true) return false if EMAIL_REGEX.match(params[:email]).nil? true end
Registation_params_valid? checks if all the sign up form’s field has been filled and validates if a passed email address is valid.
def create_account(**args) raise unless registration_params_valid?(args) browser.goto('https://www.facebook.com/') form = browser.form(id: 'reg') form.text_field(name: 'firstname').set(args[:first_name]) form.text_field(name: 'lastname').set(args[:last_name]) form.text_field(name: 'reg_email__').set(email) form.text_field(name: 'reg_email_confirmation__').set(email) form.text_field(name: 'reg_passwd__').set(password) form.select_list(name: 'birthday_day').select(args[:day]) form.select_list(name: 'birthday_month').select(args[:month]) form.select_list(name: 'birthday_year').select(args[:year]) form.radio(name: 'sex', value: sex(args[:sex])).set form.button(name: 'websubmit').click end
Create_account method tries to sign up on Facebook. It runs registration_params_valid? to check if it’s valid. Later it goes to the Facebook’s main page and fills in the sign-up form.
def sex(value) value.downcase.strip == 'male' ? '2' : '1' end
This method formats a parameter and returns a valid value for radio input in the sign up form.
def search(query) login unless logged_in form = browser.form(action: '/search/web/direct_search.php') form.inputs.last.to_subtype.clear sleep(0.5) form.inputs.last.to_subtype.set(query) form.button(type: 'submit').click end
This method searches for a requested query but first checks if we’re logged in. If not, we log in then search for a query. I use sleep here because sometimes watir has clicked too fast and not all the elements were loaded.
def perform(query, options = {}) login unless logged_in search(query) browser.link(href: "/search/#{options[:name]}/?q=#{query}&ref=top_filter").click button = browser.button(class_name: options[:class_name]) button.click if button.exist? end
This method performs an action – invites a friend, likes a page etc. We need to pass a query there and options – class name of the button, which should be click and a tab name. But remember, a first button will be clicked.
def like_page(name) perform(name, name: 'pages', class_name: 'PageLikeButton') end
It uses the perform method, just by passing a query and clicking the right button and switching to a correct tab.
def invite_friend(name) perform(name, name: 'people', class_name: 'FriendRequestAdd') end
It is the same as like_page method, but now it invites a friend.
Testing
Well, so that’s all methods. You can download a source and test it by yourself. How do you do it?
Just clone the gem to your directory, run bundle install and:
$ bundle console $ scraper = NopioScraper::Facebook.new(‘your_email’',’your password’) $ scraper.like_page('nopio')
And that’s all! Remember that I didn’t cover any unexpected cases here like browser popups or alerts. Every browser behaves in a different way so it’s hard to predict how yours will work.
As you can see, web scraping and simulating has no limits, you can write a code which can do almost everything; it’s up to you!
Here you can find full the documentation to researching more knowledge and examples:
I hope that you liked this article and that it might be useful to you! Happy web scraping!