In this article, we’re going to learn how to create a web scraping tool using Python. Web scraping is like sending a robot to a website to gather information, such as links, images, or text. In this project, we will focus on gathering links from a website. Essentially, we will create a Python tool that extracts all the links from a given website.
Let’s break it down from the beginning and go step by step so that you can understand how it works!
What is Web Scraping?
Imagine you’re visiting a website, and you want to collect all the links that are on the page. Instead of manually copying every single link, a web scraper is a program that does it for you automatically. Web scraping helps gather data from the internet quickly.
Tools We Will Use
For this project, we’ll need two important Python tools:
- Requests: This library helps us connect to websites.
- BeautifulSoup: This tool helps us read and understand the webpage content, like finding links or text.
These tools make it easy for our program to visit a website and find what we need.
Now, let’s walk through the steps to create a web scraper that extracts all the links (both internal and external) from a given website.
Step 1: Install Necessary Libraries
Before we write any code, we need to make sure you have Python installed on your computer. After that, you need to install the libraries we talked about.
- To install requests, type this in your terminal or command prompt:
pip install requests
- To install BeautifulSoup, you need to install it as part of a library called
beautifulsoup4
:
pip install beautifulsoup4
Now we’re ready to write some code!
Step 2: Write the Python Script
Here is the code that will help you scrape all the links from a webpage:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import socket
def validate_url(url):
# Check if the URL has a scheme (http or https), if not, add 'https://'
parsed_url = urlparse(url)
if not parsed_url.scheme:
print(f"Invalid URL '{url}': No scheme supplied. Adding 'https://' automatically.")
url = 'https://' + url
return url
def is_valid_domain(url):
try:
# Extract domain and check if it can be resolved
domain = urlparse(url).netloc
socket.gethostbyname(domain) # Resolves the domain
return True
except socket.gaierror:
return False
def normalize_domain(domain):
# Removes 'www.' for consistency in domain comparison
return domain.lower().lstrip('www.')
def extract_links(url):
try:
# Send a request to fetch the content of the webpage
response = requests.get(url)
response.raise_for_status() # Check for request errors
# Parse the page content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Sets to store unique internal and external links
internal_links = set()
external_links = set()
# Normalize the base domain
base_domain = normalize_domain(urlparse(url).netloc)
# Find all anchor tags
for link in soup.find_all('a', href=True):
href = link['href']
full_url = urljoin(url, href) # Handle relative URLs
# Check if the full_url is a proper URL
parsed_full_url = urlparse(full_url)
if parsed_full_url.scheme in ['http', 'https']:
normalized_link_domain = normalize_domain(parsed_full_url.netloc)
# Compare normalized domains to classify as internal or external
if normalized_link_domain == base_domain:
internal_links.add(full_url)
else:
external_links.add(full_url)
return internal_links, external_links
except requests.exceptions.RequestException as e:
print(f"Error fetching the page: {e}")
return set(), set()
def display_links(internal_links, external_links):
total_links = len(internal_links) + len(external_links)
print(f"Found {total_links} links.")
print(f"\nInternal links ({len(internal_links)}):")
for idx, link in enumerate(internal_links, 1):
print(f"{idx}. {link}")
print(f"\nExternal links ({len(external_links)}):")
for idx, link in enumerate(external_links, 1):
print(f"{idx}. {link}")
def save_links_to_file(internal_links, external_links, internal_file, external_file):
with open(internal_file, 'w') as internal_f:
for link in internal_links:
internal_f.write(link + '\n')
with open(external_file, 'w') as external_f:
for link in external_links:
external_f.write(link + '\n')
print(f"\nLinks have been saved to '{internal_file}' and '{external_file}'.")
# Main function to execute the program
if __name__ == "__main__":
url = input("Enter the URL of the webpage: ")
# Validate and normalize the URL
url = validate_url(url)
# Check if the domain is valid
if not is_valid_domain(url):
print(f"Error: Could not resolve the domain of '{url}'. Please check the URL.")
else:
print("Extracting links...")
internal_file = 'internal_links.txt'
external_file = 'external_links.txt'
# Extract and process the links
internal_links, external_links = extract_links(url)
# Display the links in a structured format
display_links(internal_links, external_links)
# Save the links to files
save_links_to_file(internal_links, external_links, internal_file, external_file)
Step 3: How the Script Works
Now let’s break the code down into simple steps:
- Input the URL: When you run the program, it asks for a website link. It then checks if you’ve added
https://
to the link. If not, it adds it for you. - Validate the Domain: It makes sure that the website is real. If it can’t find the website, it will tell you there is a problem.
- Find the Links: The script visits the website and finds all the links using the
BeautifulSoup
tool. It checks whether each link belongs to the same website (internal) or another website (external). - Show the Links: It will print the total number of links found and divide them into internal and external links.
- Save the Links to Files: The program will save all the internal links into a file called
internal_links.txt
and the external links intoexternal_links.txt
.
Step 4: Test the Script
You can test the script by running it from your terminal or command prompt. Simply save the script above with a .py
extension (for example: web_scraper.py
) and then run it.
python web_scraper.py
After you enter a URL, it will show the links on the page and save them to text files.
Conclusion
Now you know how to build a web scraping tool in Python! This simple program can collect links from any website, categorize them as internal or external, and save them for later use. I hope this article helps you understand how to extract links from a website. Thank you for reading! For more Python projects, click here.