Building MCP Servers in Python: WebSearch & Scrape Guide

Build MCP servers for AI assistants with Python examples

Page content

The Model Context Protocol (MCP) is revolutionizing how AI assistants interact with external data sources and tools. In this guide, we’ll explore how to build MCP servers in Python, with examples focused on web search and scraping capabilities.

MCP robots

What is the Model Context Protocol?

Model Context Protocol (MCP) is an open protocol introduced by Anthropic to standardize how AI assistants connect to external systems. Instead of building custom integrations for each data source, MCP provides a unified interface that allows:

  • AI assistants (like Claude, ChatGPT, or custom LLM applications) to discover and use tools
  • Developers to expose data sources, tools, and prompts through a standardized protocol
  • Seamless integration without reinventing the wheel for each use case

The protocol operates on a client-server architecture where:

  • MCP Clients (AI assistants) discover and use capabilities
  • MCP Servers expose resources, tools, and prompts
  • Communication happens via JSON-RPC over stdio or HTTP/SSE

If you’re interested in implementing MCP servers in other languages, check out our guide on implementing MCP server in Go, which covers the protocol specifications and message structure in detail.

Why Build MCP Servers in Python?

Python is an excellent choice for MCP server development because:

  1. Rich Ecosystem: Libraries like requests, beautifulsoup4, selenium, and playwright make web scraping straightforward
  2. MCP SDK: Official Python SDK (mcp) provides robust server implementation support
  3. Rapid Development: Python’s simplicity allows quick prototyping and iteration
  4. AI/ML Integration: Easy integration with AI libraries like langchain, openai, and data processing tools
  5. Community Support: Large community with extensive documentation and examples

Setting Up Your Development Environment

First, create a virtual environment and install the required dependencies. Using virtual environments is essential for Python project isolation - if you need a refresher, check out our venv Cheatsheet for detailed commands and best practices.

# Create and activate virtual environment
python -m venv mcp-env
source mcp-env/bin/activate  # On Windows: mcp-env\Scripts\activate

# Install MCP SDK and web scraping libraries
pip install mcp requests beautifulsoup4 playwright lxml
playwright install  # Install browser drivers for Playwright

Modern Alternative: If you prefer faster dependency resolution and installation, consider using uv - the modern Python package and environment manager which can be significantly faster than pip for large projects.

Building a Basic MCP Server

Let’s start with a minimal MCP server structure. If you’re new to Python or need a quick reference for syntax and common patterns, our Python Cheatsheet provides a comprehensive overview of Python fundamentals.

import asyncio
from mcp.server import Server
from mcp.types import Tool, TextContent
import mcp.server.stdio

# Create server instance
app = Server("websearch-scraper")

@app.list_tools()
async def list_tools() -> list[Tool]:
    """Define available tools"""
    return [
        Tool(
            name="search_web",
            description="Search the web for information",
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Search query"
                    },
                    "max_results": {
                        "type": "number",
                        "description": "Maximum number of results",
                        "default": 5
                    }
                },
                "required": ["query"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    """Handle tool execution"""
    if name == "search_web":
        query = arguments["query"]
        max_results = arguments.get("max_results", 5)
        
        # Implement search logic here
        results = await perform_web_search(query, max_results)
        
        return [TextContent(
            type="text",
            text=f"Search results for '{query}':\n\n{results}"
        )]
    
    raise ValueError(f"Unknown tool: {name}")

async def perform_web_search(query: str, max_results: int) -> str:
    """Placeholder for actual search implementation"""
    return f"Found {max_results} results for: {query}"

async def main():
    """Run the server"""
    async with mcp.server.stdio.stdio_server() as (read_stream, write_stream):
        await app.run(
            read_stream,
            write_stream,
            app.create_initialization_options()
        )

if __name__ == "__main__":
    asyncio.run(main())

Implementing Web Search Functionality

Now let’s implement a real web search tool using DuckDuckGo (which doesn’t require API keys):

import asyncio
import requests
from bs4 import BeautifulSoup
from urllib.parse import quote_plus

async def search_duckduckgo(query: str, max_results: int = 5) -> list[dict]:
    """Search DuckDuckGo and parse results"""
    
    url = f"https://html.duckduckgo.com/html/?q={quote_plus(query)}"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }
    
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.text, 'html.parser')
        results = []
        
        for result in soup.select('.result')[:max_results]:
            title_elem = result.select_one('.result__title')
            snippet_elem = result.select_one('.result__snippet')
            url_elem = result.select_one('.result__url')
            
            if title_elem and snippet_elem:
                results.append({
                    "title": title_elem.get_text(strip=True),
                    "snippet": snippet_elem.get_text(strip=True),
                    "url": url_elem.get_text(strip=True) if url_elem else "N/A"
                })
        
        return results
        
    except Exception as e:
        raise Exception(f"Search failed: {str(e)}")

def format_search_results(results: list[dict]) -> str:
    """Format search results for display"""
    if not results:
        return "No results found."
    
    formatted = []
    for i, result in enumerate(results, 1):
        formatted.append(
            f"{i}. {result['title']}\n"
            f"   {result['snippet']}\n"
            f"   URL: {result['url']}\n"
        )
    
    return "\n".join(formatted)

Adding Web Scraping Capabilities

Let’s add a tool to scrape and extract content from web pages. When scraping HTML content for use with LLMs, you may also want to convert it to Markdown format for better processing. For this purpose, check out our comprehensive guide on converting HTML to Markdown with Python, which compares 6 different libraries with benchmarks and practical recommendations.

from playwright.async_api import async_playwright
import asyncio

async def scrape_webpage(url: str, selector: str = None) -> dict:
    """Scrape content from a webpage using Playwright"""
    
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        
        try:
            await page.goto(url, timeout=30000)
            
            # Wait for content to load
            await page.wait_for_load_state('networkidle')
            
            if selector:
                # Extract specific element
                element = await page.query_selector(selector)
                content = await element.inner_text() if element else "Selector not found"
            else:
                # Extract main content
                content = await page.inner_text('body')
            
            title = await page.title()
            
            return {
                "title": title,
                "content": content[:5000],  # Limit content length
                "url": url,
                "success": True
            }
            
        except Exception as e:
            return {
                "error": str(e),
                "url": url,
                "success": False
            }
        finally:
            await browser.close()

# Add scraper tool to the MCP server
@app.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="search_web",
            description="Search the web using DuckDuckGo",
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                    "max_results": {"type": "number", "default": 5}
                },
                "required": ["query"]
            }
        ),
        Tool(
            name="scrape_webpage",
            description="Scrape content from a webpage",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "URL to scrape"},
                    "selector": {
                        "type": "string",
                        "description": "Optional CSS selector for specific content"
                    }
                },
                "required": ["url"]
            }
        )
    ]

Complete MCP Server Implementation

Here’s a complete, production-ready MCP server with both search and scrape capabilities:

#!/usr/bin/env python3
"""
MCP Server for Web Search and Scraping
Provides tools for searching the web and extracting content from pages
"""

import asyncio
import logging
from typing import Any
import requests
from bs4 import BeautifulSoup
from urllib.parse import quote_plus
from playwright.async_api import async_playwright

from mcp.server import Server
from mcp.types import Tool, TextContent, ImageContent, EmbeddedResource
import mcp.server.stdio

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("websearch-scraper")

# Create server
app = Server("websearch-scraper")

# Search implementation
async def search_web(query: str, max_results: int = 5) -> str:
    """Search DuckDuckGo and return formatted results"""
    url = f"https://html.duckduckgo.com/html/?q={quote_plus(query)}"
    headers = {"User-Agent": "Mozilla/5.0"}
    
    try:
        response = requests.get(url, headers=headers, timeout=10)
        soup = BeautifulSoup(response.text, 'html.parser')
        results = []
        
        for result in soup.select('.result')[:max_results]:
            title = result.select_one('.result__title')
            snippet = result.select_one('.result__snippet')
            link = result.select_one('.result__url')
            
            if title and snippet:
                results.append({
                    "title": title.get_text(strip=True),
                    "snippet": snippet.get_text(strip=True),
                    "url": link.get_text(strip=True) if link else ""
                })
        
        # Format results
        if not results:
            return "No results found."
        
        formatted = [f"Found {len(results)} results for '{query}':\n"]
        for i, r in enumerate(results, 1):
            formatted.append(f"\n{i}. **{r['title']}**")
            formatted.append(f"   {r['snippet']}")
            formatted.append(f"   {r['url']}")
        
        return "\n".join(formatted)
        
    except Exception as e:
        logger.error(f"Search failed: {e}")
        return f"Search error: {str(e)}"

# Scraper implementation
async def scrape_page(url: str, selector: str = None) -> str:
    """Scrape webpage content using Playwright"""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        
        try:
            await page.goto(url, timeout=30000)
            await page.wait_for_load_state('networkidle')
            
            title = await page.title()
            
            if selector:
                element = await page.query_selector(selector)
                content = await element.inner_text() if element else "Selector not found"
            else:
                content = await page.inner_text('body')
            
            # Limit content length
            content = content[:8000] + "..." if len(content) > 8000 else content
            
            result = f"**{title}**\n\nURL: {url}\n\n{content}"
            return result
            
        except Exception as e:
            logger.error(f"Scraping failed: {e}")
            return f"Scraping error: {str(e)}"
        finally:
            await browser.close()

# MCP Tool definitions
@app.list_tools()
async def list_tools() -> list[Tool]:
    """List available MCP tools"""
    return [
        Tool(
            name="search_web",
            description="Search the web using DuckDuckGo. Returns titles, snippets, and URLs.",
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query"
                    },
                    "max_results": {
                        "type": "number",
                        "description": "Maximum number of results (default: 5)",
                        "default": 5
                    }
                },
                "required": ["query"]
            }
        ),
        Tool(
            name="scrape_webpage",
            description="Extract content from a webpage. Can target specific elements with CSS selectors.",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to scrape"
                    },
                    "selector": {
                        "type": "string",
                        "description": "Optional CSS selector to extract specific content"
                    }
                },
                "required": ["url"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: Any) -> list[TextContent]:
    """Handle tool execution"""
    try:
        if name == "search_web":
            query = arguments["query"]
            max_results = arguments.get("max_results", 5)
            result = await search_web(query, max_results)
            return [TextContent(type="text", text=result)]
        
        elif name == "scrape_webpage":
            url = arguments["url"]
            selector = arguments.get("selector")
            result = await scrape_page(url, selector)
            return [TextContent(type="text", text=result)]
        
        else:
            raise ValueError(f"Unknown tool: {name}")
    
    except Exception as e:
        logger.error(f"Tool execution failed: {e}")
        return [TextContent(
            type="text",
            text=f"Error executing {name}: {str(e)}"
        )]

async def main():
    """Run the MCP server"""
    logger.info("Starting WebSearch-Scraper MCP Server")
    
    async with mcp.server.stdio.stdio_server() as (read_stream, write_stream):
        await app.run(
            read_stream,
            write_stream,
            app.create_initialization_options()
        )

if __name__ == "__main__":
    asyncio.run(main())

Configuring Your MCP Server

To use your MCP server with Claude Desktop or other MCP clients, create a configuration file:

For Claude Desktop (claude_desktop_config.json):

{
  "mcpServers": {
    "websearch-scraper": {
      "command": "python",
      "args": [
        "/path/to/your/mcp_server.py"
      ],
      "env": {}
    }
  }
}

Location:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • Linux: ~/.config/Claude/claude_desktop_config.json

Testing Your MCP Server

Create a test script to verify functionality:

import asyncio
import json
import sys
from io import StringIO

async def test_mcp_server():
    """Test MCP server locally"""
    
    # Test search
    print("Testing web search...")
    results = await search_web("Python MCP tutorial", 3)
    print(results)
    
    # Test scraper
    print("\n\nTesting webpage scraper...")
    content = await scrape_page("https://example.com")
    print(content[:500])

if __name__ == "__main__":
    asyncio.run(test_mcp_server())

Advanced Features and Best Practices

1. Rate Limiting

Implement rate limiting to avoid overwhelming target servers:

import time
from collections import defaultdict

class RateLimiter:
    def __init__(self, max_requests: int = 10, time_window: int = 60):
        self.max_requests = max_requests
        self.time_window = time_window
        self.requests = defaultdict(list)
    
    async def acquire(self, key: str):
        now = time.time()
        self.requests[key] = [
            t for t in self.requests[key] 
            if now - t < self.time_window
        ]
        
        if len(self.requests[key]) >= self.max_requests:
            raise Exception("Rate limit exceeded")
        
        self.requests[key].append(now)

limiter = RateLimiter(max_requests=10, time_window=60)

2. Caching

Add caching to improve performance:

from functools import lru_cache
import hashlib

@lru_cache(maxsize=100)
async def cached_search(query: str, max_results: int):
    return await search_web(query, max_results)

3. Error Handling

Implement robust error handling:

from enum import Enum

class ErrorType(Enum):
    NETWORK_ERROR = "network_error"
    PARSE_ERROR = "parse_error"
    RATE_LIMIT = "rate_limit_exceeded"
    INVALID_INPUT = "invalid_input"

def handle_error(error: Exception, error_type: ErrorType) -> str:
    logger.error(f"{error_type.value}: {str(error)}")
    return f"Error ({error_type.value}): {str(error)}"

4. Input Validation

Validate user inputs before processing:

from urllib.parse import urlparse

def validate_url(url: str) -> bool:
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc])
    except:
        return False

def validate_query(query: str) -> bool:
    return len(query.strip()) > 0 and len(query) < 500

Deployment Considerations

Using SSE Transport for Web Deployment

For web-based deployments, use SSE (Server-Sent Events) transport. If you’re considering serverless deployment, you might be interested in comparing AWS Lambda performance across JavaScript, Python, and Golang to make an informed decision about your runtime:

import mcp.server.sse

async def main_sse():
    """Run server with SSE transport"""
    from starlette.applications import Starlette
    from starlette.routing import Mount
    
    sse = mcp.server.sse.SseServerTransport("/messages")
    
    starlette_app = Starlette(
        routes=[
            Mount("/mcp", app=sse.get_server())
        ]
    )
    
    import uvicorn
    await uvicorn.Server(
        config=uvicorn.Config(starlette_app, host="0.0.0.0", port=8000)
    ).serve()

AWS Lambda Deployment

MCP servers can also be deployed as AWS Lambda functions, especially when using SSE transport. For comprehensive guides on Lambda deployment:

Docker Deployment

Create a Dockerfile for containerized deployment:

FROM python:3.11-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    wget \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN playwright install chromium
RUN playwright install-deps

# Copy application
COPY mcp_server.py .

CMD ["python", "mcp_server.py"]

Performance Optimization

Async Operations

Use asyncio for concurrent operations:

async def search_multiple_queries(queries: list[str]) -> list[str]:
    """Search multiple queries concurrently"""
    tasks = [search_web(query) for query in queries]
    results = await asyncio.gather(*tasks)
    return results

Connection Pooling

Reuse connections for better performance:

import aiohttp

session = None

async def get_session():
    global session
    if session is None:
        session = aiohttp.ClientSession()
    return session

async def fetch_url(url: str) -> str:
    session = await get_session()
    async with session.get(url) as response:
        return await response.text()

Security Best Practices

  1. Input Sanitization: Always validate and sanitize user inputs
  2. URL Whitelisting: Consider implementing URL whitelisting for scraping
  3. Timeout Controls: Set appropriate timeouts to prevent resource exhaustion
  4. Content Limits: Limit the size of scraped content
  5. Authentication: Implement authentication for production deployments
  6. HTTPS: Use HTTPS for SSE transport in production

Working with Different LLM Providers

While MCP was developed by Anthropic for Claude, the protocol is designed to work with any LLM. If you’re building MCP servers that interact with multiple AI providers and need structured outputs, you’ll want to review our comparison of structured output across popular LLM providers including OpenAI, Gemini, Anthropic, Mistral, and AWS Bedrock.

MCP and Protocol Implementation

Python Development

Web Scraping and Content Processing

Serverless Deployment Resources

LLM Integration

Conclusion

Building MCP servers in Python opens up powerful possibilities for extending AI assistants with custom tools and data sources. The web search and scraping capabilities demonstrated here are just the beginning—you can extend this foundation to integrate databases, APIs, file systems, and virtually any external system.

The Model Context Protocol is still evolving, but its standardized approach to AI tool integration makes it an exciting technology for developers building the next generation of AI-powered applications. Whether you’re creating internal tools for your organization or building public MCP servers for the community, Python provides an excellent foundation for rapid development and deployment.