Playwright: Web Scraping & Testing

Master browser automation for testing & scraping

Page content

Playwright is a powerful, modern browser automation framework that revolutionizes web scraping and end-to-end testing.

Developed by Microsoft, it provides a unified API for automating Chromium, Firefox, and WebKit browsers with unprecedented reliability and speed.

playwright ui

What is Playwright?

Playwright is an open-source browser automation framework that enables developers to write reliable end-to-end tests and build sophisticated web scraping solutions. Unlike traditional automation tools, Playwright was built from the ground up to handle modern web applications with dynamic content, single-page applications (SPAs), and complex JavaScript frameworks.

The framework addresses the core problem that plagued previous automation tools: flakiness. Playwright introduces auto-waiting mechanisms that automatically wait for elements to be actionable before performing operations, eliminating the need for arbitrary timeouts and sleep statements that made tests unreliable.

Key Features

Cross-Browser Support: Playwright supports all major browser engines - Chromium (including Chrome and Edge), Firefox, and WebKit (Safari). This means you can write your automation script once and run it across different browsers without modification, ensuring your web applications work consistently everywhere.

Auto-Waiting: One of Playwright’s most powerful features is its built-in auto-waiting mechanism. Before performing any action, Playwright automatically waits for elements to be visible, enabled, stable, and not obscured. This eliminates race conditions and makes tests dramatically more reliable compared to tools like Selenium where explicit waits are often necessary.

Network Interception: Playwright allows you to intercept, modify, and mock network requests and responses. This is invaluable for testing edge cases, simulating slow networks, blocking unnecessary resources during scraping, or mocking API responses without needing a backend.

Mobile Emulation: Test mobile web applications by emulating various mobile devices with specific viewport sizes, user agents, and touch events. Playwright includes device descriptors for popular phones and tablets.

Powerful Selectors: Beyond CSS and XPath selectors, Playwright supports text selectors, role-based selectors for accessibility testing, and even experimental React and Vue selectors for component-based frameworks.

Installation and Setup

Setting up Playwright is straightforward across different programming languages.

Python Installation

For Python projects, Playwright can be installed via pip and includes both synchronous and asynchronous APIs. If you’re looking for a faster, more modern Python package manager, check out our guide on uv - Python Package, Project, and Environment Manager:

# Install Playwright package
pip install playwright

# Install browsers (Chromium, Firefox, WebKit)
playwright install

# For specific browser only
playwright install chromium

For a comprehensive reference of Python syntax and commonly used commands while working with Playwright, refer to our Python Cheatsheet.

JavaScript/TypeScript Installation

For Node.js projects, install Playwright via npm or yarn:

# Using npm
npm init playwright@latest

# Using yarn
yarn create playwright

# Manual installation
npm install -D @playwright/test
npx playwright install

The npm init playwright command provides an interactive setup that configures your project with example tests, configuration files, and GitHub Actions workflow.

Basic Configuration

Create a playwright.config.ts (TypeScript) or playwright.config.js (JavaScript) file:

import { defineConfig, devices } from '@playwright/test';

export default defineConfig({
  testDir: './tests',
  timeout: 30000,
  retries: 2,
  workers: 4,
  use: {
    headless: true,
    viewport: { width: 1280, height: 720 },
    screenshot: 'only-on-failure',
    video: 'retain-on-failure',
  },
  projects: [
    {
      name: 'chromium',
      use: { ...devices['Desktop Chrome'] },
    },
    {
      name: 'firefox',
      use: { ...devices['Desktop Firefox'] },
    },
    {
      name: 'webkit',
      use: { ...devices['Desktop Safari'] },
    },
  ],
});

Web Scraping with Playwright

Playwright excels at web scraping, especially for modern websites with dynamic content that traditional scraping libraries struggle with.

Basic Scraping Example

Here’s a comprehensive Python example demonstrating core scraping concepts:

from playwright.sync_api import sync_playwright
import json

def scrape_website():
    with sync_playwright() as p:
        # Launch browser
        browser = p.chromium.launch(headless=True)
        
        # Create context for isolation
        context = browser.new_context(
            viewport={'width': 1920, 'height': 1080},
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
        )
        
        # Open new page
        page = context.new_page()
        
        # Navigate to URL
        page.goto('https://example.com/products')
        
        # Wait for content to load
        page.wait_for_selector('.product-item')
        
        # Extract data
        products = page.query_selector_all('.product-item')
        
        data = []
        for product in products:
            title = product.query_selector('h2').inner_text()
            price = product.query_selector('.price').inner_text()
            url = product.query_selector('a').get_attribute('href')
            
            data.append({
                'title': title,
                'price': price,
                'url': url
            })
        
        # Clean up
        browser.close()
        
        return data

# Run scraper
results = scrape_website()
print(json.dumps(results, indent=2))

Handling Dynamic Content

Modern websites often load content dynamically via JavaScript. Playwright handles this seamlessly:

async def scrape_dynamic_content():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        
        await page.goto('https://example.com/infinite-scroll')
        
        # Scroll to load more content
        for _ in range(5):
            await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
            await page.wait_for_timeout(2000)
        
        # Wait for network to be idle
        await page.wait_for_load_state('networkidle')
        
        # Extract all loaded items
        items = await page.query_selector_all('.item')
        
        await browser.close()

Converting Scraped Content to Markdown

After extracting HTML content with Playwright, you often need to convert it to a more usable format. For comprehensive guides on converting HTML to Markdown, see our articles on Converting HTML to Markdown with Python: A Comprehensive Guide which compares 6 different Python libraries, and Convert HTML content to Markdown using LLM and Ollama for AI-powered conversion. If you’re working with Word documents instead, check out our guide on Converting Word Documents to Markdown.

Authentication and Session Management

When scraping requires authentication, Playwright makes it easy to save and reuse browser state:

def login_and_save_session():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        context = browser.new_context()
        page = context.new_page()
        
        # Login
        page.goto('https://example.com/login')
        page.fill('input[name="username"]', 'your_username')
        page.fill('input[name="password"]', 'your_password')
        page.click('button[type="submit"]')
        
        # Wait for navigation after login
        page.wait_for_url('**/dashboard')
        
        # Save authenticated state
        context.storage_state(path='auth_state.json')
        
        browser.close()

def scrape_with_saved_session():
    with sync_playwright() as p:
        browser = p.chromium.launch()
        # Reuse saved authentication state
        context = browser.new_context(storage_state='auth_state.json')
        page = context.new_page()
        
        # Already authenticated!
        page.goto('https://example.com/protected-data')
        # ... scrape protected content
        
        browser.close()

This approach is particularly useful when working with APIs or building MCP servers for AI integrations. For a complete guide on implementing web scraping in AI tool integration, see our article on Building MCP Servers in Python: WebSearch & Scrape.

End-to-End Testing

Playwright’s primary use case is writing robust end-to-end tests for web applications.

Writing Your First Test

Here’s a complete test example in TypeScript:

import { test, expect } from '@playwright/test';

test('user can add item to cart', async ({ page }) => {
  // Navigate to homepage
  await page.goto('https://example-shop.com');
  
  // Search for product
  await page.fill('[data-testid="search-input"]', 'laptop');
  await page.press('[data-testid="search-input"]', 'Enter');
  
  // Wait for search results
  await expect(page.locator('.product-card')).toBeVisible();
  
  // Click first product
  await page.locator('.product-card').first().click();
  
  // Verify product page loaded
  await expect(page).toHaveURL(/\/product\/.+/);
  
  // Add to cart
  await page.click('[data-testid="add-to-cart"]');
  
  // Verify cart updated
  const cartCount = page.locator('[data-testid="cart-count"]');
  await expect(cartCount).toHaveText('1');
});

Page Object Model

For larger test suites, use the Page Object Model pattern to improve maintainability:

// pages/LoginPage.ts
export class LoginPage {
  constructor(private page: Page) {}
  
  async navigate() {
    await this.page.goto('/login');
  }
  
  async login(username: string, password: string) {
    await this.page.fill('[name="username"]', username);
    await this.page.fill('[name="password"]', password);
    await this.page.click('button[type="submit"]');
  }
  
  async getErrorMessage() {
    return await this.page.locator('.error-message').textContent();
  }
}

// tests/login.spec.ts
import { test, expect } from '@playwright/test';
import { LoginPage } from '../pages/LoginPage';

test('login with invalid credentials shows error', async ({ page }) => {
  const loginPage = new LoginPage(page);
  await loginPage.navigate();
  await loginPage.login('invalid@email.com', 'wrongpass');
  
  const error = await loginPage.getErrorMessage();
  expect(error).toContain('Invalid credentials');
});

Advanced Features

Codegen - Automatic Test Generation

Playwright’s Codegen tool generates tests by recording your interactions with a web page:

# Open Codegen
playwright codegen example.com

# With specific browser
playwright codegen --browser firefox example.com

# With saved authentication state
playwright codegen --load-storage=auth.json example.com

As you interact with the page, Codegen generates code in real-time. This is incredibly useful for quickly prototyping tests or learning Playwright’s selector syntax.

Trace Viewer for Debugging

When tests fail, understanding why can be challenging. Playwright’s Trace Viewer provides a timeline view of your test execution:

// Enable tracing in config
use: {
  trace: 'on-first-retry',
}

After a test fails and retries, view the trace:

playwright show-trace trace.zip

The Trace Viewer shows screenshots at every action, network activity, console logs, and DOM snapshots, making debugging straightforward.

Network Interception and Mocking

Intercept and modify network traffic for testing edge cases:

def test_with_mocked_api():
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        
        # Mock API response
        def handle_route(route):
            if 'api/products' in route.request.url:
                route.fulfill(
                    status=200,
                    body=json.dumps({
                        'products': [
                            {'id': 1, 'name': 'Test Product', 'price': 99.99}
                        ]
                    })
                )
            else:
                route.continue_()
        
        page.route('**/*', handle_route)
        
        page.goto('https://example.com')
        # Page now uses mocked data
        
        browser.close()

Mobile Testing

Test your responsive designs on various devices:

from playwright.sync_api import sync_playwright

def test_mobile():
    with sync_playwright() as p:
        # Use device descriptor
        iphone_13 = p.devices['iPhone 13']
        
        browser = p.webkit.launch()
        context = browser.new_context(**iphone_13)
        page = context.new_page()
        
        page.goto('https://example.com')
        
        # Interact as mobile user
        page.locator('#mobile-menu-button').click()
        
        browser.close()

Best Practices

For Web Scraping

  1. Use Headless Mode in Production: Headless browsing is faster and uses fewer resources
  2. Implement Rate Limiting: Respect target websites with delays between requests
  3. Handle Errors Gracefully: Network issues, timeouts, and selector changes happen
  4. Rotate User Agents: Avoid detection by varying browser fingerprints
  5. Respect robots.txt: Check and follow website scraping policies
  6. Use Context Isolation: Create separate browser contexts for parallel scraping

When converting scraped content to markdown format, consider leveraging LLM-based conversion tools or Python libraries specialized for HTML-to-Markdown conversion for cleaner output.

For Testing

  1. Use Data-testid Attributes: More stable than CSS classes which change frequently
  2. Avoid Hard Waits: Use Playwright’s built-in waiting mechanisms instead of sleep()
  3. Keep Tests Independent: Each test should be able to run in isolation
  4. Use Fixtures: Share setup code between tests efficiently
  5. Run Tests in Parallel: Leverage Playwright’s worker threads for speed
  6. Record Traces on Failure: Enable trace recording for easier debugging

Performance Optimization

# Disable unnecessary resources
def fast_scraping():
    with sync_playwright() as p:
        browser = p.chromium.launch()
        context = browser.new_context()
        page = context.new_page()
        
        # Block images and stylesheets to speed up scraping
        async def block_resources(route):
            if route.request.resource_type in ['image', 'stylesheet', 'font']:
                await route.abort()
            else:
                await route.continue_()
        
        page.route('**/*', block_resources)
        page.goto('https://example.com')
        
        browser.close()

Comparing Playwright with Alternatives

Playwright vs Selenium

Playwright Advantages:

  • Built-in auto-waiting eliminates flaky tests
  • Faster execution due to modern architecture
  • Better network interception and mocking
  • Superior debugging tools (Trace Viewer)
  • Simpler API with less boilerplate
  • Multiple browsers with single installation

Selenium Advantages:

  • More mature ecosystem with extensive community
  • Supports more programming languages
  • Wider browser compatibility including older versions

Playwright vs Puppeteer

Playwright Advantages:

  • True cross-browser support (Firefox, WebKit, Chromium)
  • Better API design based on Puppeteer lessons
  • More powerful debugging tools
  • Microsoft backing and active development

Puppeteer Advantages:

  • Slightly smaller footprint
  • Chrome DevTools Protocol expertise

For most new projects, Playwright is the recommended choice due to its modern architecture and comprehensive feature set. If you’re working with Go instead of Python or JavaScript and need web scraping capabilities, check out our guide on Beautiful Soup Alternatives for Go for comparable scraping tools in the Go ecosystem.

Common Use Cases

Data Extraction for AI/LLM Applications

Playwright is excellent for gathering training data or creating web search capabilities for AI models. When building MCP (Model Context Protocol) servers, Playwright can handle the web scraping component while LLMs process the extracted content.

Automated Testing in CI/CD

Integrate Playwright tests into your continuous integration pipeline:

# .github/workflows/playwright.yml
name: Playwright Tests
on:
  push:
    branches: [ main, master ]
  pull_request:
    branches: [ main, master ]
jobs:
  test:
    timeout-minutes: 60
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - uses: actions/setup-node@v3
      with:
        node-version: 18
    - name: Install dependencies
      run: npm ci
    - name: Install Playwright Browsers
      run: npx playwright install --with-deps
    - name: Run Playwright tests
      run: npx playwright test
    - uses: actions/upload-artifact@v3
      if: always()
      with:
        name: playwright-report
        path: playwright-report/
        retention-days: 30

Website Monitoring

Monitor your production websites for uptime and functionality:

import schedule
import time

def monitor_website():
    with sync_playwright() as p:
        try:
            browser = p.chromium.launch()
            page = browser.new_page()
            page.goto('https://your-site.com', timeout=30000)
            
            # Check critical elements
            assert page.is_visible('.header')
            assert page.is_visible('#main-content')
            
            print("✓ Website is healthy")
        except Exception as e:
            print(f"✗ Website issue detected: {e}")
            # Send alert
        finally:
            browser.close()

# Run every 5 minutes
schedule.every(5).minutes.do(monitor_website)

while True:
    schedule.run_pending()
    time.sleep(1)

Troubleshooting Common Issues

Browser Installation Issues

If browsers fail to download:

# Set custom download location
PLAYWRIGHT_BROWSERS_PATH=/custom/path playwright install

# Clear cache and reinstall
playwright uninstall
playwright install

Timeout Errors

Increase timeouts for slow networks or complex pages:

page.goto('https://slow-site.com', timeout=60000)  # 60 seconds
page.wait_for_selector('.element', timeout=30000)  # 30 seconds

Selector Not Found

Use Playwright Inspector to identify correct selectors:

PWDEBUG=1 pytest test_file.py

This opens the inspector where you can hover over elements to see their selectors.

Conclusion

Playwright represents the cutting edge of browser automation technology, combining powerful features with excellent developer experience. Whether you’re building a web scraping pipeline, implementing comprehensive test coverage, or creating automated workflows, Playwright provides the tools and reliability you need.

Its auto-waiting mechanisms eliminate flaky tests, cross-browser support ensures your applications work everywhere, and powerful debugging tools make troubleshooting straightforward. As web applications continue to grow in complexity, Playwright’s modern architecture and active development make it an excellent choice for any browser automation needs.

For Python developers working on data pipelines or web scraping projects, Playwright integrates seamlessly with modern package managers and works excellently alongside pandas, requests, and other data science tools. The ability to extract structured data from complex modern websites makes it invaluable for AI applications, research projects, and business intelligence. When combined with HTML-to-Markdown conversion tools and proper content processing, Playwright becomes a complete solution for extracting, transforming, and utilizing web data at scale.

Other References