This blog is part 1 of a web scraping series.
The Bolster Research Team evaluated the performance and working of various libraries available in NodeJS for Web Scraping. Throughout this blog, we will see how we can use different libraries in Node JS to implement web scraping.
In the subsequent blog posts, we will analyze various NodeJS libraries, and provide tips on how to use web scraping technology to strengthen your cybersecurity program.
What is Web Scraping?
Before we dive into web scraping, it’s important to have background knowledge of HTML DOM (Hyper Text Markup Language and Document Object Model) and JS (JavaScript). We recommend familiarizing yourself with developer resources to get started!
But let’s jump into it!
Web scraping is about extracting information from web pages. A website can consist of various types of information, including text, images, audio, videos, scripts, and forms.
It’s important to distinguish the difference between web scraping and other types of data-extraction techniques.
Crawling vs web scraping
When we want to search for information, crawling is the way. When we want to extract information, scraping is the way. So web crawling would mean movement through links or URLs and web scraping means the extraction of information from a particular page/website.
Consider the following example: you want to find a person’s contact information from a website. Crawling can help find a specific page, like a contact page or about us page, and scraping can help get the contact information of the person.
How does automation factor into the web scraping conversation?
When reading about web crawling and scraping, we often encounter the term “web automation”. Once scraping is carried out, we can automate tasks like form submission, data extraction, testing, and validation. We will discuss some web automation techniques in future blogs.
We will use various libraries in NodeJS to demonstrate the quick implementation of scraping. We will scrap the content of the title tag in this article using various libraries.
Web scraping and JSDOM
As per the official documentation, jsdom is a pure-JavaScript implementation of many web standards, notably the WHATWG DOM and HTML Standards, for use with Node.js. In general, the goal of the project is to emulate enough of a subset of a web browser to be useful for testing and scraping real-world web applications.