Engineering Leader, Former Magician and Explorer of Curiosities

Scraping Marketing Leads using Ruby and Selenium

Over my life I’ve started (and stopped!) dozens of small businesses and side hustles. Central to each business venture is finding marketing leads. In some cases this is simple. For example when I was a magician I would market locally in party supply stores and online. By forming a relationship with this business I was able to leave business cards, and this would do two things. First I would hopefully get phone calls for magic shows and second to date me today. But how would I do this for a local business that needs to actively reach out and market themselves online? There are some very clever things we can do.

In this case my approach is to find clients from local business directories. In my area there are several public listings for local businesses, but manually capturing this data would be exhausting. I’m a software creator, can I not create something to help with this problem? Yes.

I’m going to teach how I was able to capture it all automatically using Ruby in under 2 hours.

49 Pages!? That’s a bit too overwhelming to manually process.

There are 49 pages of businesses which would take ages to manually capture. This is where Ruby comes to the rescue!

Specifically I chose Selenium Webdriver wrapped in a Ruby script that I wrote in the root directory of a new Rails project.

Design Choices

Sometimes the tool choice can impact the final result as much or even more than the actual effort and design, so why did I decide on these tools? Here is my reasoning:

  1. Simplicity: Ruby is easy to read and this means its easier to write good code. Good code is easy to change and realistically this code may need changing a year or two down the road (in case I want to update my data). I feel this will be very easy to adjust as the HTML changes and to adapt it to other business directories.

  2. Speed of Setup: All the gems we need are included by default in a new Rails 7 project. This makes it an absolute breeze to start development on my new project/experiment/great idea. As with any creative endeavour we are constantly battling resistance which means getting started efficiently may mean the difference between this getting completed or not. All we have to do is type: rails new contact_builder. I don’t need any rails features, but it makes the getting started process very easy. I know that I can always cut the ruby code out and create a new Gemfile once the whole thing works but I like the overkill for now.

  3. Familiarity: I spent a decade at Shopify using Ruby to build things over the internet. Shopify is a pretty large Ruby shop and has showered me with an immense amount of information about this language. They say “go with what you know” and in this case it will make me faster at coding this up.

  4. Visual feedback: Using Selenium allows me to visually inspect what the browser is doing at each phase of the script. Rather than focussing exclusively on console logs I can augment this with the actual webpage I’m trying to scrape. Its much easier to reason about where in the script is when you can see the page open in Chrome. It also will indicate to me very clearly if we get blacklisted or the server starts timing out.

Hands to Keyboard

Where to begin? Thankfully when running a Selenium Webdriver script it will open a browser window so if you’re attentive you will see a hint of what it is doing before the script auto-exits. The first thing I added was the main page, some boilerplate to require and grab a Selenium WebDriver instance and a sleep after navigating to the main page.

require 'selenium-webdriver'

options = Selenium::WebDriver::Chrome::Options.new
@driver = Selenium::WebDriver.for :chrome, options: options

def scrape_pages
  root_url = 'https://example.com'
  retries = 0

  @driver.navigate.to root_url
  sleep 5

  @driver.quit
  end
end

Simple enough, but this doesn’t do anything - why!?

Start simple. So many things might go wrong that I try to make sure each step works before advancing. This ensures that when something does go wrong, you can rule out these more basic stages.

Divide and Conquer

When coaching and mentoring I try to encourage a smaller, incremental approach where it makes sense. Break more complex concepts into more basic elements the way first-principles thinking does and it will be much easier to reason about the parts.

(P.S. You can book me for coaching at igotanoffer.com )

In this case I mean: rather than grabbing all my business info I will first generate a CSV with each detail page URL.

Here is what that would look like:

We have a directory index with a link to the next page and links to details about each business. Rather than traverse each child now (especially while perfecting the script, and encountering failure) wouldn’t it be better to split this into two steps? This is the approach I took.

require 'selenium-webdriver'
require 'csv'
options = Selenium::WebDriver::Chrome::Options.new
@driver = Selenium::WebDriver.for :chrome, options: options

def scrape_pages
  root_url = 'https://directory.example.com/'
  retries = 0

  business_details_link_text = '' # Add the text for the link to the details here.
  next_page_class = '' # Add the class for the "next" page button here.
  loop do
    begin
      @driver.navigate.to root_url
      @driver.find_elements(partial_link_text: business_details_link_text).each do |link|
        CSV.open("pages.csv", "a+") do |csv|
          csv << ["url", "completed"] if csv.count.eql? 0
          csv << ["#{link.attribute('href')}", "false"]
        end
      end
    
      root_url = @driver.find_element(class: next_page_class).attribute('href')
      puts "Next url, sleeping for 2 seconds..."
      sleep 2
      retries = 0 # reset retries after successful scrape

    rescue Selenium::WebDriver::Error::NoSuchElementError, Net::ReadTimeout => e
      retries += 1
      if retries <= 3
        puts "Attempt #{retries} failed, retrying..."
        retry
      else
        puts "Failed to scrape data from #{root_url} after 3 attempts"
        break
      end
    end
  end

  @driver.quit
end

In this code I’m adding the csv library so I can save each page’s details to disk. This also lets me resume the work in case it fails part way through. Next I find the link text for the details of the business and the CSS class for the next page link and save them as strings.

I have wrapped the loop work within a begin/rescue clause because there might be a situation where the network fails, or Selenium can’t find the element in question. I’m sure there are more edge cases to account for but these were the main bugs I encountered where I knew I wanted to just retry.

Within the loop I open up a CSV with a simple format: url and completed status.

url completed
https://directory.example.com/Index/PageStructure/biz-name false

On to the finer details

With my first csv in hand (or on disk . . . or in memory?!) I’m ready to tackle the details.

Your needs may vary but I’d like to save my business info into a simple data structure that looks like this:

contact = {
  business_name: '',
  address: '',
  phone: '',
  email: '',
  website: '',
  social: '',
  contact_person: '',
  contact_title: '',
}

I’m happy smashing all social media links into one field. I can always process the CSV a third time offline but the essential role of this step is to get the details off the internet and into a spreadsheet.

As a starting point I open the CSV from the previous step (after making a copy locally) and iterate through each page, choosing to skip to the next record if the completed column is set to true. The next part is a lot more complicated to explain - thankfully we split this up from the earlier part!

The details of the business directory each have a div which looks like the snippet below, in particular the business details are in an unnamed span following a named span for eg <span id="Address">Address</span> which is perfect for selecting with Selenium but terrible for capturing the data form the next element, or is it?

HTML in question, notice how the address, phone etc are in unnamed divs. Darn!

<div class="col-md-12">
  <dl class="some-class">
    <dt>
      <span id="Address">Address</span>
    </dt>
    <dd>
      123 Anytown <br />
      New York, New York<br />
    </dd>

    <dt>
      <span id="Phone">Phone:</span>
    </dt>
    <dd>
      555.555.5555
    </dd>

    <dt>
      <span id="Email">Email:</span>
    </dt>
    <dd>
      <a href="mailto:example@example.com" >email text</a>
    </dd>

    <dt>
      <span id="Website">Website:</span>
    </dt>
    <dd>
      <a href="http://www.example.com" target="_blank">http://www.example.com</a>
    </dd>

    <dt>
      <span id="Details">Business details:</span>
    </dt>
    <dd>
       <p>Here are the details</p>
    </dd>
  </dl>
</div>

The rest of the script tries to find business names (if they exist, not all are listed), then it looks at the table of business details as a set of dt and dd pairs. Collecting the table this way allows us to check which element we’re looking at based on the text of the previous div. How handy!

The rest of the script is quite simple now. Each detail is saved into the contact data structure (if those details exist) and the entire contact is saved into the details.csv. If it is the first line in the CSV we write the keys (headers) rather than the values.

Finally if we get to this stage we save the status as completed in the original csv.

The script closes with some error catching, retry logic and completion notification.

There is one small bug, at the end it repeats the final page a few times and saves this data into the CSV. I manually removed it but this could of course be improved within the code.

def scrape_details
  # Load the list of pages from the csv file
  pages = CSV.read('pages.csv', headers: true)

  pages.each do |p|
    next if p['completed'] == 'true' # skip if already completed
    url = p['url']
    puts "Navigating to #{url}"
    retries = 0

    begin
      contact = {
        business_name: '',
        address: '',
        phone: '',
        email: '',
        website: '',
        social: '',
        contact_person: '',
        contact_title: '',
      }
      
      @driver.navigate.to url
      # sleep 2

      # This part will need to be customized for each site
      business_name, address, phone, email, website, contact_person, contact_title, social = nil
      begin
        business_name = @driver.find_element(:css, '#id-if-this-exists').text
      rescue Selenium::WebDriver::Error::NoSuchElementError
      end
      
      begin
        contact_person = @driver.find_element(:css, '.tablename.class td:nth-child(1)').text
      rescue Selenium::WebDriver::Error::NoSuchElementError
      end
      
      begin
        contact_title = @driver.find_element(:css, '.tablename.class td:nth-child(2)').text
      rescue Selenium::WebDriver::Error::NoSuchElementError
      end
      
      # This part grabs the table using the text. We could also have used the ID
      dl = @driver.find_element(:css, '.dl-horizontal')
      dt_dd_pairs = dl.find_elements(:xpath, './/dt | .//dd')
      
      dt_dd_pairs.each_with_index do |element, i|
        begin
          case element.text.strip
          when 'Address:'
            address = dt_dd_pairs[i+1].text
          when 'Phone:'
            phone = dt_dd_pairs[i+1].text
          when 'Email:'
            email = dt_dd_pairs[i+1].find_element(:tag_name, 'a').attribute('href').gsub('mailto:', '')
          when 'Website:'
            website = dt_dd_pairs[i+1].find_element(:tag_name, 'a').attribute('href')
          when 'Social:'
            social = dt_dd_pairs[i+1].find_elements(:tag_name, 'a').map { |a| a.attribute('href') }.join(', ')
          end
        rescue Selenium::WebDriver::Error::NoSuchElementError
          next
        end
      end

      puts business_name, address, phone, email, website, contact_person, contact_title, social

      # sleep 2
      contact[:business_name] = business_name if business_name
      contact[:address] = address if address
      contact[:phone] = phone if phone
      contact[:email] = email if email
      contact[:website] = website if website
      contact[:contact_person] = contact_person if contact_person
      contact[:contact_title] = contact_title if contact_title
      contact[:social] = social if social

    CSV.open("details.csv", "a+") do |csv|
      csv << contact.keys if csv.count.eql? 0
      csv << contact.values
    end
    
    # Mark as completed in the original CSV after successful scraping
    p['completed'] = true
    CSV.open("pages.csv", "w+") do |csv|
      csv << pages.headers
      pages.each do |row|
        csv << row
      end
    end
    
  rescue Selenium::WebDriver::Error::NoSuchElementError, Net::ReadTimeout => e
    retries += 1
    if retries <= 3
      puts "Attempt #{retries} failed, retrying..."
      retry
    else
      puts "Failed to scrape data from #{url} after 3 attempts"
    end
  end

  rescue Selenium::WebDriver::Error::NoSuchElementError
    puts 'Done finding each contact details!'
    break
  end

  @driver.quit
end

scrape_pages
scrape_details

And there we go! At the end we can just run both methods (or one at a time) and generate the list of pages, and then grab all the details. A couple hours later I had a CSV with as many business details for local businesses as I could find.