Web Scraping in Node.js

node.jSTL

8/11/2016

Who am I?

Paul Zerkel

CTO @
Consulting @

A Quick Background on Scraping

  • Building a read-only API on top of a User Interface
  • Originally called screen scraping
  • Document scraping is common too (ex: Tabula)

Not Covering...

  • Web Crawling
  • Text Analysis

Web Scraping Involves:

  • Receive data from web server responses
  • Process the data in some fashion to make sense of it
  • Store or act on the value of the received data

Common Examples

  • Alerts for web content
  • Reformat content for users
  • Aggregate data and store information
  • Access financial data on behalf of users
  • Populating a niche search engine

Web Scraping Advantages

  • Low barrier to entry compared to screen scraping
  • Common response content types provide some structure to the data
  • Developers can start with common tools

Potential Problems

  • UI changes can ruin your scraping efforts
  • Performance leaves a lot to be desired
  • Client-side rendering is more tricky to deal with
  • Terms of Service or robots.txt may prohibit crawling and scraping
  • Rate limiting, session tracking, IP blocking, etc can slow you down

Example: GET + RegEx

  • Easy way to check for presence of text.
  • Just the standard libraries
  • Probably easy to outgrow
  • Consider using a library to make the requests easier such as got

got('localhost:8080', {
  headers: {
    'User-Agent': 'nodejstl-bot',
  }})
  .then(response => { ... });
  

Example: Build a DOM With Cheerio

  • Libraries exist to build a static DOM from the response body
  • Cheerio creates a DOM, selectors, and a jQuery stlye API

// assuming you have a valid response
let $ = cheerio.load(response.body);
let address = $('address');

More on Cheerio

  • Built as a lightweight and fast JSDOM replacement
  • Aims for API compatibility with jQuery
  • Can manipulate the DOM and render back to HTML

let streetAddress = $('[itemprop="name"]');
$('.team tbody tr').each((index, element) => { ... });

Browser Automation

  • Great for dynamic content, logging in, navigation, etc
  • Includes or drives a fully functional browser
  • Browser plugins (macros), Selenium, PhantomJS/Casper, Nightmare, and more

Nightmare

  • Browser automation library
  • Based on Electron
  • API to interact and extract data from the page
  • Pass data between Node and the browser

const nightmare = require('nightmare')();
nightmare.goto('http://localhost:8080/products')

Controlling Nightmare

  • API includes verbs such as `goto`, `click`, `wait`, etc
  • Run your own JavaScript via `evaluate` or `inject`
  • `evaluate` can return data to your Node process

Gotchas

  • Set a user-agent on your client
  • Be nice to your target sites
  • Schedule scraping for off hours
  • Throttle requests
  • Obey robots.txt

Thanks!