Data scraping

Scraping data from HTML

Required gems:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

Open HTML file:

base_url = ''
page = Nokogiri::HTML(open(base_url))

Select method 1—using XPath:

page_content = page.xpath("//div[@class='page']/a")

Select method 2—using CSS selectors:

page_content = Nokogiri::HTML(open(page)).css(".page a")

This method is preferred, since you can subselect from it:

page_content = Nokogiri::HTML(open(base_url))
page_table = page_content.css("table")
table_special_row = page_table.css("td.special")

When many pages must be accessed, pause the script for some seconds, so you don’t overburden the site.

sleep 4

It’s also recommended to download site data if many requests will be needed:

file = open('page.html', 'w') {|f| f.write(page.readlines)}

Saving files

Required gems:

require 'rubygems'
require 'open-uri'

Create a directory:

Dir.mkdir('pictures') unless File.exists?('pictures')

Saving a list of image files saved in a text file:

# open txt file and removes its BOM, loops through its lines'pictures.txt', "r:bom|utf-8").readlines.each do |line|
  #removes spaces from URLs
  url = line.gsub(/\s+/, "")
  filename = url.gsub('', '')

  # use begin/reach to handle 404 errors so the script is not aborted
    picture = open(url)"fotos/#{filename}", 'wb') do |f|
      puts "saved #{filename}"
  rescue OpenURI::HTTPError
    puts "error saving #{filename}"

Useful resources


Creates a CSV file. Each line represents a row:

require 'csv'"scrape.csv", "w") do |csv|
 csv << ["value 1", "value 2", "value 3", "value 4"]
 csv << ["value 5", "value 6", "value 7", "value 8"]

When inserting data from strings or arrays, there might be some encoding problems, so the encode method must be used in each of the values of the row.

csv << [string.encode('UTF-8'), hash['key'].encode('UTF-8')]


Removes leading and trailing whitespace:

"   hello   ".strip #=> "hello"
 "\tgoodbye\r\n".strip #=> "goodbye"


Replaces values in string:

page_table_rows = page.css('tr')
page_table_rows.each do |row|
  row_string = row.to_s  # = <tr><td>Value 1</td><td>Value 2</td></tr>
  row_string.gsub!('<tr><td>', '')  # = Value 1</td><td>Value 2</td></tr>
  row_string.gsub!('</td><td>', ', ') # = Value 1, Value 2</td></tr>
  row_string.gsub!('</td></tr>', '') # = Value 1, Value 2


Use match and the array it creates to get values:

"22/7/2014".match('(0[1-9]|[12][0-9]|3[01])[- /.]([1-9]|1[012])[- /.](19|20)\d\d') => #<MatchData "22/7/2014" 1:"22" 2:"7" 3:"20">



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s