Web Scraping in Node.js

node.jSTL

8/11/2016

Who am I?

Paul Zerkel

CTO @

Consulting @

A Quick Background on Scraping

Building a read-only API on top of a User Interface
Originally called screen scraping
Document scraping is common too (ex: Tabula)

Not Covering...

Web Crawling
Text Analysis

Web Scraping Involves:

Receive data from web server responses
Process the data in some fashion to make sense of it
Store or act on the value of the received data

Common Examples

Alerts for web content
Reformat content for users
Aggregate data and store information
Access financial data on behalf of users
Populating a niche search engine

Web Scraping Advantages

Low barrier to entry compared to screen scraping
Common response content types provide some structure to the data
Developers can start with common tools

Potential Problems

UI changes can ruin your scraping efforts
Performance leaves a lot to be desired
Client-side rendering is more tricky to deal with
Terms of Service or robots.txt may prohibit crawling and scraping
Rate limiting, session tracking, IP blocking, etc can slow you down

Example: GET + RegEx

Easy way to check for presence of text.
Just the standard libraries
Probably easy to outgrow
Consider using a library to make the requests easier such as got


got('localhost:8080', {
  headers: {
    'User-Agent': 'nodejstl-bot',
  }})
  .then(response => { ... });

Example: Build a DOM With Cheerio

Libraries exist to build a static DOM from the response body
Cheerio creates a DOM, selectors, and a jQuery stlye API


// assuming you have a valid response
let $ = cheerio.load(response.body);
let address = $('address');

More on Cheerio

Built as a lightweight and fast JSDOM replacement
Aims for API compatibility with jQuery
Can manipulate the DOM and render back to HTML


let streetAddress = $('[itemprop="name"]');
$('.team tbody tr').each((index, element) => { ... });

Browser Automation

Great for dynamic content, logging in, navigation, etc
Includes or drives a fully functional browser
Browser plugins (macros), Selenium, PhantomJS/Casper, Nightmare, and more

Nightmare

Browser automation library
Based on Electron
API to interact and extract data from the page
Pass data between Node and the browser


const nightmare = require('nightmare')();
nightmare.goto('http://localhost:8080/products')

Controlling Nightmare

API includes verbs such as `goto`, `click`, `wait`, etc
Run your own JavaScript via `evaluate` or `inject`
`evaluate` can return data to your Node process

Gotchas

Set a user-agent on your client
Be nice to your target sites
Schedule scraping for off hours
Throttle requests
Obey robots.txt

Web Scraping in Node.js

node.jSTL

Who am I?

Paul Zerkel

A Quick Background on Scraping

Not Covering...

Web Scraping Involves:

Common Examples

Web Scraping Advantages

Potential Problems

Example: GET + RegEx

Example: Build a DOM With Cheerio

More on Cheerio

Browser Automation

Nightmare

Controlling Nightmare

Gotchas

Thanks!