How to Build a Web Scraping Tool in Python

In this article, we’re going to learn how to create a web scraping tool using Python. Web scraping is like sending a robot to a website to gather information, such as links, images, or text. In this project, we will focus on gathering links from a website. Essentially, we will create a Python tool that extracts all the links from a given website.

Let’s break it down from the beginning and go step by step so that you can understand how it works!

Table of Contents hide

1 What is Web Scraping?

2 Tools We Will Use

3 Step 1: Install Necessary Libraries

4 Step 2: Write the Python Script

5 Step 3: How the Script Works

6 Step 4: Test the Script

7 Conclusion

What is Web Scraping?

Imagine you’re visiting a website, and you want to collect all the links that are on the page. Instead of manually copying every single link, a web scraper is a program that does it for you automatically. Web scraping helps gather data from the internet quickly.

Tools We Will Use

For this project, we’ll need two important Python tools:

Requests: This library helps us connect to websites.
BeautifulSoup: This tool helps us read and understand the webpage content, like finding links or text.

These tools make it easy for our program to visit a website and find what we need.

Now, let’s walk through the steps to create a web scraper that extracts all the links (both internal and external) from a given website.

Step 1: Install Necessary Libraries

Before we write any code, we need to make sure you have Python installed on your computer. After that, you need to install the libraries we talked about.

To install requests, type this in your terminal or command prompt:

pip install requests

To install BeautifulSoup, you need to install it as part of a library called beautifulsoup4:

pip install beautifulsoup4

Now we’re ready to write some code!

Step 2: Write the Python Script

Here is the code that will help you scrape all the links from a webpage:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import socket

def validate_url(url):
    # Check if the URL has a scheme (http or https), if not, add 'https://'
    parsed_url = urlparse(url)
    if not parsed_url.scheme:
        print(f"Invalid URL '{url}': No scheme supplied. Adding 'https://' automatically.")
        url = 'https://' + url
    return url

def is_valid_domain(url):
    try:
        # Extract domain and check if it can be resolved
        domain = urlparse(url).netloc
        socket.gethostbyname(domain)  # Resolves the domain
        return True
    except socket.gaierror:
        return False

def normalize_domain(domain):
    # Removes 'www.' for consistency in domain comparison
    return domain.lower().lstrip('www.')

def extract_links(url):
    try:
        # Send a request to fetch the content of the webpage
        response = requests.get(url)
        response.raise_for_status()  # Check for request errors

        # Parse the page content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Sets to store unique internal and external links
        internal_links = set()
        external_links = set()

        # Normalize the base domain
        base_domain = normalize_domain(urlparse(url).netloc)

        # Find all anchor tags
        for link in soup.find_all('a', href=True):
            href = link['href']
            full_url = urljoin(url, href)  # Handle relative URLs

            # Check if the full_url is a proper URL
            parsed_full_url = urlparse(full_url)
            if parsed_full_url.scheme in ['http', 'https']:
                normalized_link_domain = normalize_domain(parsed_full_url.netloc)

                # Compare normalized domains to classify as internal or external
                if normalized_link_domain == base_domain:
                    internal_links.add(full_url)
                else:
                    external_links.add(full_url)

        return internal_links, external_links

    except requests.exceptions.RequestException as e:
        print(f"Error fetching the page: {e}")
        return set(), set()

def display_links(internal_links, external_links):
    total_links = len(internal_links) + len(external_links)
    print(f"Found {total_links} links.")

    print(f"\nInternal links ({len(internal_links)}):")
    for idx, link in enumerate(internal_links, 1):
        print(f"{idx}. {link}")

    print(f"\nExternal links ({len(external_links)}):")
    for idx, link in enumerate(external_links, 1):
        print(f"{idx}. {link}")

def save_links_to_file(internal_links, external_links, internal_file, external_file):
    with open(internal_file, 'w') as internal_f:
        for link in internal_links:
            internal_f.write(link + '\n')

    with open(external_file, 'w') as external_f:
        for link in external_links:
            external_f.write(link + '\n')

    print(f"\nLinks have been saved to '{internal_file}' and '{external_file}'.")

# Main function to execute the program
if __name__ == "__main__":
    url = input("Enter the URL of the webpage: ")

    # Validate and normalize the URL
    url = validate_url(url)

    # Check if the domain is valid
    if not is_valid_domain(url):
        print(f"Error: Could not resolve the domain of '{url}'. Please check the URL.")
    else:
        print("Extracting links...")

        internal_file = 'internal_links.txt'
        external_file = 'external_links.txt'

        # Extract and process the links
        internal_links, external_links = extract_links(url)

        # Display the links in a structured format
        display_links(internal_links, external_links)

        # Save the links to files
        save_links_to_file(internal_links, external_links, internal_file, external_file)

Step 3: How the Script Works

Now let’s break the code down into simple steps:

Input the URL: When you run the program, it asks for a website link. It then checks if you’ve added https:// to the link. If not, it adds it for you.
Validate the Domain: It makes sure that the website is real. If it can’t find the website, it will tell you there is a problem.
Find the Links: The script visits the website and finds all the links using the BeautifulSoup tool. It checks whether each link belongs to the same website (internal) or another website (external).
Show the Links: It will print the total number of links found and divide them into internal and external links.
Save the Links to Files: The program will save all the internal links into a file called internal_links.txt and the external links into external_links.txt.

Step 4: Test the Script

You can test the script by running it from your terminal or command prompt. Simply save the script above with a .py extension (for example: web_scraper.py) and then run it.

python web_scraper.py

After you enter a URL, it will show the links on the page and save them to text files.

Conclusion

Now you know how to build a web scraping tool in Python! This simple program can collect links from any website, categorize them as internal or external, and save them for later use. I hope this article helps you understand how to extract links from a website. Thank you for reading! For more Python projects, click here.