Web Scraping in Node.js
node.jSTL
8/11/2016
Who am I?
Paul Zerkel
CTO @
Consulting @
A Quick Background on Scraping
- Building a read-only API on top of a User Interface
- Originally called screen scraping
- Document scraping is common too (ex: Tabula)
Not Covering...
- Web Crawling
- Text Analysis
Web Scraping Involves:
- Receive data from web server responses
- Process the data in some fashion to make sense of it
- Store or act on the value of the received data
Common Examples
- Alerts for web content
- Reformat content for users
- Aggregate data and store information
- Access financial data on behalf of users
- Populating a niche search engine
Web Scraping Advantages
- Low barrier to entry compared to screen scraping
- Common response content types provide some structure to the data
- Developers can start with common tools
Potential Problems
- UI changes can ruin your scraping efforts
- Performance leaves a lot to be desired
- Client-side rendering is more tricky to deal with
- Terms of Service or robots.txt may prohibit crawling and scraping
- Rate limiting, session tracking, IP blocking, etc can slow you down
Example: GET + RegEx
- Easy way to check for presence of text.
- Just the standard libraries
- Probably easy to outgrow
- Consider using a library to make the requests easier such as got
got('localhost:8080', {
headers: {
'User-Agent': 'nodejstl-bot',
}})
.then(response => { ... });
Example: Build a DOM With Cheerio
- Libraries exist to build a static DOM from the response body
- Cheerio creates a DOM, selectors, and a jQuery stlye API
// assuming you have a valid response
let $ = cheerio.load(response.body);
let address = $('address');
More on Cheerio
- Built as a lightweight and fast JSDOM replacement
- Aims for API compatibility with jQuery
- Can manipulate the DOM and render back to HTML
let streetAddress = $('[itemprop="name"]');
$('.team tbody tr').each((index, element) => { ... });
Browser Automation
- Great for dynamic content, logging in, navigation, etc
- Includes or drives a fully functional browser
- Browser plugins (macros), Selenium, PhantomJS/Casper, Nightmare, and more
Nightmare
- Browser automation library
- Based on Electron
- API to interact and extract data from the page
- Pass data between Node and the browser
const nightmare = require('nightmare')();
nightmare.goto('http://localhost:8080/products')
Controlling Nightmare
- API includes verbs such as `goto`, `click`, `wait`, etc
- Run your own JavaScript via `evaluate` or `inject`
- `evaluate` can return data to your Node process
Gotchas
- Set a user-agent on your client
- Be nice to your target sites
- Schedule scraping for off hours
- Throttle requests
- Obey robots.txt