Join the social network of Tech Nerds, increase skill rank, get work, manage projects...
 
  • How to scrap a website in node.js?

    • 0
    • 0
    • 0
    • 0
    • 0
    • 0
    • 0
    • 0
    • 646
    Comment on it

    Sometimes, we may need to scrap data from a website (It could be any live website, For Ex: https://www.amazon.com).

     

    There are 3 steps involved while scrapping a website:-

     

    1- Getting the HTML content of a website's page.
    2- Parsing the HTML content to an object and
    3- Finally, extracting the needed information from the page.

     

    Here is how I did it:-

     

    1- Getting HTML content:

    To load the webpage, we make a http call to the website and it returns us the html source of the page.
    We use request module for the same.

    To install:

    npm install request

    This module help us to make http (GET, POST, PUT etc.) to any website.

    Ex:

    request({
      method: 'GET',
      url: 'http://www.lipsum.com/'
      }, function(err, response, body, callback) {
            if (err) return console.error(err);
    
      });


    2- Parsing HTML source

    Now, we have HTML source, we need to parse it to an object. For this we use cheerio module.

    Installing cheerio:

    npm install cheerio

    Ex:

    request({
      method: 'GET',
      url: 'http://www.lipsum.com/'
      }, function(err, response, body, callback) {
        if (err) return console.error(err);
        $ = cheerio.load(body); //$ contains parsed object
      });

    3- Extracting content from object

    Cheerio provides a set of methods to traverse and extract information from the object

    In the below example, I am scrapping the very famous lorel lipsum website.
     

    var cheerio = require('cheerio');
    var request = require('request');
    
    request({
      method: 'GET',
      url: 'http://www.lipsum.com/'
      }, function(err, response, body, callback) {
        if (err) return console.error(err);
        $ = cheerio.load(body);
    
        var firstBox = $('#Panes').find('div:first-child');
        
       console.log(firstBox.find('h2').text());
       console.log("=================")
       console.log(firstBox.find('p').text());
    });

     

    Result:

     

    What is Lorem Ipsum?
    =================
    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

     

 0 Comment(s)

Sign In
                           OR                           
                           OR                           
Register

Sign up using

                           OR                           
Forgot Password
Fill out the form below and instructions to reset your password will be emailed to you:
Reset Password
Fill out the form below and reset your password: