Puppeteer vs. Selenium: What to Choose for Web Scraping?

Pichai Nurjanah
Posted by
Pichai Nurjanah

Choose and Buy Proxies

Puppeteer vs. Selenium: What to Choose for Web Scraping?
0 Comments

Are you trying to decide between Puppeteer and Selenium for web scraping? Both are powerful browser automation frameworks, and making the right choice depends on your specific scraping needs and available resources.

To help you make an informed decision, we’ve highlighted the key differences between Puppeteer and Selenium in the table below. Afterward, we will delve into the details and provide a scraping example for each framework to demonstrate their effectiveness in extracting data from web pages.

CriteriaPuppeteerSelenium
Compatible LanguagesOnly JavaScript is officially supported, but there are unofficial PHP and Python portsJava, Python, C#, Ruby, PHP, JavaScript, and Kotlin
Browser SupportChromium and experimental Firefox supportChrome, Safari, Firefox, Opera, Edge, and Internet Explorer
Performance60% faster than SeleniumFast
Operating System SupportWindows, Linux, and macOSWindows, Linux, macOS, and Solaris
ArchitectureEvent-driven architecture with headless browser instancesJSONWire protocol on the web driver to control the browser instance
PrerequisitesJavaScript package is enoughSelenium Bindings (for the selected programming language) and browser web drivers
CommunitySmaller community compared to SeleniumWell-established documentation and a large community

Let’s proceed to discuss these libraries in detail and perform a scraping example with each to illustrate their efficiency in extracting data from a web page.

Puppeteer Logo

Puppeteer

Puppeteer is a Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. It is designed for automating tasks in Chrome or Chromium, such as taking screenshots, generating PDFs, and navigating pages.

Puppeteer can also be used for testing web pages by simulating user interactions like clicking buttons, filling out forms, and verifying the results displayed.

Advantages of Puppeteer

  • Ease of Use: Simple and straightforward to use.
  • Bundled with Chromium: No additional setup is required.
  • Headless Mode: Runs in headless mode by default but can be configured to run in full browser mode.
  • Event-Driven Architecture: Eliminates the need for manual sleep calls in your code.
  • Comprehensive Capabilities: Can take screenshots, generate PDFs, and automate all browser actions.
  • Performance Management: Offers tools for recording runtime and load performance to optimize and debug your scraper.
  • SPA Crawling: Capable of crawling Single Page Applications (SPAs) and generating pre-rendered content (server-side rendering).
  • Script Recording: Allows creating Puppeteer scripts by recording actions on the browser using the DevTools console.

Disadvantages of Puppeteer

  • Limited Browser Support: Supports fewer browsers compared to Selenium.
  • JavaScript Focused: Primarily supports JavaScript, although unofficial ports for Python and PHP exist.

Web Scraping Example with Puppeteer

Let’s go through a Puppeteer web scraping tutorial to extract items from the Crime and Thriller category of the Danube website.

Danube Store: Crime and Thrillers

To get started, import the Puppeteer module and create an asynchronous function to run the Puppeteer code:

const puppeteer = require('puppeteer'); 

async function main() { 
    // Launch a headless browser instance 
    const browser = await puppeteer.launch({ headless: true });

    // Create a new page object 
    const page = await browser.newPage();

    // Navigate to the target URL and wait until the loading finishes
    await page.goto('https://danube-webshop.herokuapp.com/', { waitUntil: 'networkidle2' });

    // Wait for the left-side bar to load
    await page.waitForSelector('ul.sidebar-list');

    // Click on the first element and wait for the navigation to finish
    await Promise.all([
        page.waitForNavigation(),
        page.click("ul[class='sidebar-list'] > li > a"),
    ]);

    // Wait for the book previews to load
    await page.waitForSelector("li[class='preview']");

    // Extract the book previews
    const books = await page.evaluateHandle(
        () => [...document.querySelectorAll("li[class='preview']")]
    );

    // Extract the relevant data using page.evaluate
    const processed_data = await page.evaluate(elements => {
        let data = [];
        elements.forEach(element => {
            let title = element.querySelector("div.preview-title").innerHTML;
            let author = element.querySelector("div.preview-author").innerHTML;
            let rating = element.querySelector("div.preview-details > p.preview-rating").innerHTML;
            let price = element.querySelector("div.preview-details > p.preview-price").innerHTML;

            let result = { title, author, rating, price };
            data.push(result);
        });
        return data;
    }, books);

    // Print out the extracted data
    console.log(processed_data);

    // Close the page and browser respectively
    await page.close();
    await browser.close();
}

// Run the main function to scrape the data
main();

Expected Output

When you run the code, the output should resemble the following:

[
    {
        title: 'Does the Sun Also Rise?',
        author: 'Ernst Doubtingway',
        rating: '★★★★☆',
        price: '$9.95'
    },
    {
        title: 'The Insiders',
        author: 'E. S. Hilton',
        rating: '★★★★☆',
        price: '$9.95'
    },
    {
        title: 'A Citrussy Clock',
        author: 'Bethany Urges',
        rating: '★★★★★',
        price: '$9.95'
    }
]

Another Example of Using Puppeteer

In addition to scraping data from web pages, Puppeteer can be used for a variety of automation tasks. One common use case is to generate a PDF of a webpage. Let’s walk through an example where Puppeteer is used to generate a PDF from a web page.

Generating a PDF with Puppeteer

Step 1: Import Puppeteer and Create an Asynchronous Function

const puppeteer = require('puppeteer');

async function generatePDF() {
    // Launch a headless browser instance
    const browser = await puppeteer.launch({ headless: true });

    // Create a new page object
    const page = await browser.newPage();

    // Navigate to the target URL
    await page.goto('https://example.com', { waitUntil: 'networkidle2' });

    // Generate a PDF from the web page
    await page.pdf({
        path: 'example.pdf', // Output file path
        format: 'A4',        // Paper format
        printBackground: true, // Include background graphics
    });

    // Close the page and browser respectively
    await page.close();
    await browser.close();
}

// Run the function to generate the PDF
generatePDF();

Additional Puppeteer Options

Puppeteer provides several options for generating PDFs that can be customized to suit your needs. Here are some of the options you can use:

  • path: The file path to save the PDF.
  • format: The paper format (e.g., ‘A4’, ‘Letter’).
  • printBackground: Whether to include the background graphics.
  • landscape: Set to true for landscape orientation.
  • margin: Specify margins for the PDF (top, right, bottom, left).

Example with Additional Options:

const puppeteer = require('puppeteer');

async function generatePDF() {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();
    await page.goto('https://example.com', { waitUntil: 'networkidle2' });

    await page.pdf({
        path: 'example.pdf',
        format: 'A4',
        printBackground: true,
        landscape: true,
        margin: {
            top: '20px',
            right: '20px',
            bottom: '20px',
            left: '20px',
        },
    });

    await page.close();
    await browser.close();
}

generatePDF();

Example Output

Running the above code will create a PDF file named example.pdf in the current directory with the contents of the web page https://example.com.

Puppeteer is a versatile tool for web automation tasks, from scraping data to generating PDFs. Its ease of use and powerful features make it an excellent choice for automating a wide range of browser activities. Whether you’re scraping data, generating reports, or testing web pages, Puppeteer provides the tools you need to get the job done efficiently.

Selenium Logo

Selenium

Selenium is an open-source end-to-end testing and web automation tool often used for web scraping. Its main components include Selenium IDE, Selenium WebDriver, and Selenium Grid.

  • Selenium IDE: Used to record actions before automating them.
  • Selenium WebDriver: Executes commands in the browser.
  • Selenium Grid: Enables parallel execution.

Advantages of Selenium

  • Ease of Use: Simple and straightforward to use.
  • Language Support: Supports various programming languages such as Python, Java, JavaScript, Ruby, and C#.
  • Browser Automation: Can automate browsers like Firefox, Edge, Safari, and even custom QtWebKit browsers.
  • Scalability: Possible to scale Selenium to hundreds of instances using cloud servers with different browser settings.
  • Cross-Platform: Operates on Windows, macOS, and Linux.

Disadvantages of Selenium

  • Complex Setup: Selenium setup methods can be complex.

Web Scraping Sample with Selenium

As with Puppeteer, let’s go through a tutorial on web scraping with Selenium using the same target site. We’ll extract the book previews from the Crime and Thriller category of the Danube website.

Danube Store: Crime and Thrillers

Step 1: Import the Necessary Modules and Configure Selenium

import time
from selenium import webdriver
from selenium.webdriver.common.by import By

options = webdriver.ChromeOptions()
options.add_argument("--headless")

Step 2: Initialize the Chrome WebDriver

driver = webdriver.Chrome(options=options)

Step 3: Navigate to the Target Website

time.sleep(1)
crime_n_thrillers = driver.find_element(By.CSS_SELECTOR, "ul[class='sidebar-list'] > li")
crime_n_thrillers.click()
time.sleep(1)
books = driver.find_elements(By.CSS_SELECTOR, "div.shop-content li.preview")

Step 4: Click on the Crime & Thrillers Category and Extract Book Previews

time.sleep(1)
crime_n_thrillers = driver.find_element(By.CSS_SELECTOR, "ul[class='sidebar-list'] > li")
crime_n_thrillers.click()
time.sleep(1)
books = driver.find_elements(By.CSS_SELECTOR, "div.shop-content li.preview")

Step 5: Define a Function to Extract Data from Each Book Preview

def extract(element):
    title = element.find_element(By.CSS_SELECTOR, "div.preview-title").text
    author = element.find_element(By.CSS_SELECTOR, "div.preview-author").text
    rating = element.find_element(By.CSS_SELECTOR, "div.preview-details p.preview-rating").text
    price = element.find_element(By.CSS_SELECTOR, "div.preview-details p.preview-price").text
    return {"title": title, "author": author, "rating": rating, "price": price}

Step 6: Loop Through the Previews, Extract the Data, and Quit the Driver

extracted_data = []
for element in books:
    data = extract(element)
    extracted_data.append(data)

print(extracted_data)
driver.quit()

Expected Output

Running the above code will produce an output similar to the following:

[
    {'title': 'Does the Sun Also Rise?', 'author': 'Ernst Doubtingway', 'rating': '★★★★☆', 'price': '$9.95'},
    {'title': 'The Insiders', 'author': 'E. S. Hilton', 'rating': '★★★★☆', 'price': '$9.95'},
    {'title': 'A Citrussy Clock', 'author': 'Bethany Urges', 'rating': '★★★★★', 'price': '$9.95'}
]

Additional Selenium Example: Taking a Screenshot

In addition to scraping data, Selenium can also be used to take screenshots of web pages. Here’s an example of how to take a screenshot of a web page using Selenium.

Step 1: Import the Necessary Modules and Configure Selenium

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument("--headless")

Step 2: Initialize the Chrome WebDriver

driver = webdriver.Chrome(options=options)

Step 3: Navigate to the Target Website

url = "https://example.com"
driver.get(url)

Step 4: Take a Screenshot

driver.save_screenshot("example_screenshot.png")

Step 5: Quit the Driver

driver.quit()

Selenium is a versatile tool for web automation tasks, including web scraping and taking screenshots. Its support for multiple programming languages and browsers, along with its scalability, makes it a powerful choice for various automation needs. Whether you’re extracting data or generating reports, Selenium provides the capabilities to automate your tasks efficiently.

Puppeteer vs. Selenium: Speed Comparison

Is Puppeteer faster than Selenium? The answer is yes—Puppeteer is generally faster than Selenium.

To compare the speed of Puppeteer and Selenium, we used the Danube-store sandbox and ran the scripts presented above 20 times, averaging the execution times.

Selenium Speed Test

We used the time module in Python to measure the execution time of the Selenium script. The start time was recorded at the beginning and the end time at the end of the script. The difference between these times provided the total execution duration.

Here is the complete script used for Selenium:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def extract(element):
    title = element.find_element(By.CSS_SELECTOR, "div.preview-title").text
    author = element.find_element(By.CSS_SELECTOR, "div.preview-author").text
    rating = element.find_element(By.CSS_SELECTOR, "div.preview-details p.preview-rating").text
    price = element.find_element(By.CSS_SELECTOR, "div.preview-details p.preview-price").text
    return {"title": title, "author": author, "rating": rating, "price": price}

# Start the timer
start_time = time.time()

options = webdriver.ChromeOptions()
options.add_argument("--headless")

# Create a new instance of the Chrome driver
driver = webdriver.Chrome(options=options)

url = "https://danube-webshop.herokuapp.com/"
driver.get(url)

# Click on the Crime & Thrillers category
time.sleep(1)
crime_n_thrillers = driver.find_element(By.CSS_SELECTOR, "ul[class='sidebar-list'] > li")
crime_n_thrillers.click()
time.sleep(1)

# Extract the book previews
books = driver.find_elements(By.CSS_SELECTOR, "div.shop-content li.preview")

extracted_data = []
for element in books:
    data = extract(element)
    extracted_data.append(data)

print(extracted_data)

# End the timer
end_time = time.time()
print(f"The whole script took: {end_time - start_time:.4f} seconds")

driver.quit()

Puppeteer Speed Test

For the Puppeteer script, we used the Date object to measure the execution time. The start time was recorded at the beginning and the end time at the end of the script. The difference between these times provided the total execution duration.

Here is the complete script used for Puppeteer:

const puppeteer = require('puppeteer');

async function main() {
    const start = Date.now();

    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();
    await page.goto('https://danube-webshop.herokuapp.com/', { waitUntil: 'networkidle2' });

    await page.waitForSelector('ul.sidebar-list');

    await Promise.all([
        page.waitForNavigation(),
        page.click("ul[class='sidebar-list'] > li > a"),
    ]);

    await page.waitForSelector("li[class='preview']");
    const books = await page.evaluateHandle(
        () => [...document.querySelectorAll("li[class='preview']")]
    );

    const processed_data = await page.evaluate(elements => {
        let data = [];
        elements.forEach(element => {
            let title = element.querySelector("div.preview-title").innerHTML;
            let author = element.querySelector("div.preview-author").innerHTML;
            let rating = element.querySelector("div.preview-details > p.preview-rating").innerHTML;
            let price = element.querySelector("div.preview-details > p.preview-price").innerHTML;

            let result = { title, author, rating, price };
            data.push(result);
        });
        return data;
    }, books);

    console.log(processed_data);
    await page.close();
    await browser.close();

    const end = Date.now();
    console.log(`Execution time: ${(end - start) / 1000} seconds`);
}

main();

Performance Test Results

The performance tests showed that Puppeteer is about 60% faster than Selenium. This speed advantage makes Puppeteer a more suitable choice for projects requiring high-speed web scraping and automation, especially when working with Chromium-based browsers.

Speed Results Summary:

Puppeteer vs. Selenium Speed Test

The chart below illustrates the performance difference between Puppeteer and Selenium:

Scaling up Puppeteer applications for projects requiring fast, efficient web scraping is the optimal choice in this context.

Puppeteer vs. Selenium: Which Is Better?

So which one is better between Selenium and Puppeteer for scraping? There isn’t a direct answer to that question since it depends on multiple factors, such as long-term library support, cross-browser support, and your web scraping needs.

Puppeteer is faster, but compared to Selenium, it supports fewer browsers. Selenium also supports more programming languages compared to Puppeteer.

Conclusion

Although using Puppeteer or Selenium is a good option for web scraping, scaling up and optimizing your web scraping project can be challenging because advanced anti-bot measures can detect and block these libraries. The best way to avoid this is by using a web scraping API, like OneProxy.

Using Puppeteer with Proxy Servers

To use Puppeteer with a proxy server, you can pass the proxy settings in the args option when launching the browser instance. Here’s an example:

const puppeteer = require('puppeteer');

async function main() {
    const proxyServer = 'http://your-proxy-server:port';
    
    const browser = await puppeteer.launch({
        headless: true,
        args: [`--proxy-server=${proxyServer}`]
    });

    const page = await browser.newPage();
    await page.goto('https://example.com', { waitUntil: 'networkidle2' });

    // Perform your web scraping tasks here

    await browser.close();
}

main();

Using Selenium with Proxy Servers

To use Selenium with a proxy server, you can set the proxy options using the webdriver.Proxy class. Here’s an example:

from selenium import webdriver
from selenium.webdriver.common.proxy import Proxy, ProxyType

proxy = Proxy()
proxy.proxy_type = ProxyType.MANUAL
proxy.http_proxy = "your-proxy-server:port"
proxy.ssl_proxy = "your-proxy-server:port"

capabilities = webdriver.DesiredCapabilities.CHROME
proxy.add_to_capabilities(capabilities)

options = webdriver.ChromeOptions()
options.add_argument("--headless")

driver = webdriver.Chrome(desired_capabilities=capabilities, options=options)
driver.get("https://example.com")

# Perform your web scraping tasks here

driver.quit()

Using proxy servers with Puppeteer and Selenium can help bypass IP-based restrictions and reduce the risk of getting blocked, enhancing the efficiency of your web scraping tasks. OneProxy’s rotating proxies can further optimize this process, providing a seamless scraping experience.

Frequently Asked Questions (FAQ)

Puppeteer and Selenium are both browser automation frameworks used for web scraping, testing, and automating browser tasks. Puppeteer is a Node.js library that controls Chrome or Chromium over the DevTools Protocol, while Selenium is an open-source tool that supports various browsers and programming languages through its WebDriver API.

Puppeteer is generally faster than Selenium. However, the speed difference can vary depending on the specific tasks and configurations used in your web scraping or automation projects.

Puppeteer is known for its ease of use, speed, and ability to automate tasks in headless mode by default. It supports Chromium and has a strong event-driven architecture that eliminates the need for manual sleep calls in code.

Puppeteer supports fewer browsers compared to Selenium and primarily focuses on JavaScript, though unofficial ports for other languages like Python and PHP exist.

You can configure Puppeteer to use a proxy server by passing the proxy settings in the args option when launching the browser.

Selenium supports multiple programming languages (Python, Java, JavaScript, Ruby, C#) and can automate various browsers, including Firefox, Edge, Safari, and custom browsers like QtWebKit. It also allows for extensive scalability through techniques like setting up cloud servers with different browser settings.

Selenium can be more complex to set up compared to Puppeteer, especially when configuring it for different browsers and environments.

You can set up a proxy server in Selenium using the webdriver.Proxy class.

We ran the same web scraping tasks on the Danube-store sandbox using both Puppeteer and Selenium. Each script was executed 20 times, and the average execution times were calculated to compare the performance.

The results showed that Puppeteer is about 60% faster than Selenium, making it a better choice for high-speed web scraping and automation tasks.

OneProxy can help you avoid getting blocked. OneProxy handles anti-bot bypassing, provides rotating proxies, headless browsers, automatic retries, and more, ensuring a seamless web scraping experience.

LEAVE A COMMENT

Datacenter Proxies
Shared Proxies

A huge number of reliable and fast proxy servers.

Starting at$0.06 per IP
Rotating Proxies
Rotating Proxies

Unlimited rotating proxies with a pay-per-request model.

Starting at$0.0001 per request
Private Proxies
UDP Proxies

Proxies with UDP support.

Starting at$0.4 per IP
Private Proxies
Private Proxies

Dedicated proxies for individual use.

Starting at$5 per IP
Unlimited Proxies
Unlimited Proxies

Proxy servers with unlimited traffic.

Starting at$0.06 per IP
Ready to use our proxy servers right now?
from $0.06 per IP