Python Web Scraping: A Beginner’s Guide to Scraping Job Listings
Are you tired of manually scrolling through endless job postings to find the perfect opportunity? Look no further! In this tutorial, I’ll walk you through the steps of scraping job listings from JobInventory.com using Python.
First, we’ll use the requests
library to send a GET request to the website and retrieve the HTML content. Then, we’ll use BeautifulSoup
to parse the HTML and extract the relevant job listing information, such as the job title, company name, and job description.
Next, we’ll explore how to use Python’s regular expressions module to clean up the extracted data and prepare it for further analysis. We’ll also cover how to store the data in a CSV file for future use.
But wait, there’s more! We’ll also dive into advanced scraping techniques, such as handling pagination and dynamically loaded content. And, as a bonus, we’ll show you how to use Python’s natural language processing libraries to extract keywords and analyze job descriptions.
So, grab a cup of coffee and get ready to dive into the world of web scraping with Python. By the end of this tutorial, you’ll have the skills to scrape job listings on JobInventory.com and beyond!
Setup
To get started with scraping job listings, we’ll first need to install a few Python packages. Open up your terminal or command prompt and run the following command:
pip install requests beautifulsoup4 pandas
These packages will allow us to send HTTP requests, parse HTML, and store the scraped data in a CSV file.
Scraping a single page
Now that we have our dependencies installed, let’s dive into the code.
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
# Define the search query and location
search_query = "data scientist"
location = "New York City, NY"
# Construct the URL
url = f"http://www.jobinventory.com/search?q={search_query}&l={location}"
# Send a GET request to the URL
response = requests.get(url)
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
# Find all the job listings on the page
job_listings = soup.find_all("li", class_="resultBlock")
# Define empty lists to store the job details
titles = []
companies = []
locations = []
descriptions = []
# Loop through each job listing and extract the relevant details
for job in job_listings:
title = job.find("div", class_="title").text.strip()
company = job.find("span", class_="company").text.strip()
location = (
job.find("div", class_="state").text.split("\xa0-\xa0")[-1].strip()
)
description = job.find("div", class_="description").text.strip()
titles.append(title)
companies.append(company)
locations.append(location)
descriptions.append(description)
# Clean up the job descriptions using regular expressions
regex = re.compile(r"\s+")
clean_descriptions = [regex.sub(" ", d).split(" - ")[1] for d in descriptions]
# Create a Pandas DataFrame to store the job details
df = pd.DataFrame(
{
"Title": titles,
"Company": companies,
"Location": locations,
"Description": clean_descriptions,
}
)
# Export the DataFrame to a CSV file
df.to_csv("job_listings.csv", index=False)
print("Scraping complete! The results are saved in \"job_listings.csv\".")
df
Scraping complete! The results are saved in 'job_listings.csv'.
Title | Company | Location | Description | |
---|---|---|---|---|
0 | Lead Data Scientist | Tiro | New York, NY | Lead Data Scientist Enigma is seekingand visua... |
1 | Data Scientist | Smith Hanley Associates | New York, NY | Title: Data Scientist Location: Newengineering... |
2 | Data Scientist | Averity | New York, NY | like to become a Data Scientist at a global in... |
3 | Data Scientist | Revelio Labs | New York, NY | for: Revelio Labs is looking for a creative Se... |
4 | Lead Data Scientist | Thomas | New York, NY | looking for a Lead Data Scientist to lead and ... |
5 | Data Scientist | Eliassen Group | Jersey City, NJ | The client is seeking a Neo4j data scientist/e... |
6 | Sr. Data Scientist | CVS | New York, NY | hiring for the following role in New York, NY:... |
7 | Data Scientist | E-Frontiers | New York, NY | Data Scientist The Company is aExperience in a... |
8 | Staff Data Scientist | Harnham | New York, NY | Staff Data Scientist AdTech Companyimplement. ... |
9 | Data Scientist, Modeling | Gro Intelligence | New York, NY | addresses agriculture, food, and our climate o... |
10 | Senior Data Scientist | Teachers Insurance and Annuity Association - TIAA | New York, NY | reporting, interpretation of data analyses to ... |
11 | Lead NLP Data Scientist - Remote | WFH | Get It Recruit - Real Estate | Jersey City, NJ | of data science! We are looking for a talented... |
12 | Senior Research Scientist | NYU Langone Health | New York, NY | Senior Research Scientist will help manage, pr... |
13 | Senior Data Scientist | Equation Staffing | New York, NY | B2B. They are looking for a Senior Data Scient... |
14 | Assistant Research Scientist | NYU Langone Health | New York, NY | Investigator. The Research Scientist will mana... |
15 | Principal Data Scientist | Harnham | New York, NY | Principal Data Scientist AdTech StartupLead th... |
16 | Senior Data Scientist | Oliwska Grupa Konsultingowa | New York, NY | for an experienced applied data scientist to j... |
17 | Senior Data Scientist | Storm3 | New York, NY | achieving faster outcomes. We are seeking a dr... |
18 | Data Scientist | Verizon | New York, NY | in a complex, multi-functional, Agile team env... |
19 | Data Scientist Series, MTA Data & Analytics | MTA | New York, NY | Data Scientist Series, MTA012 479 3 Senior Dat... |
In this code, we first define the search query and location variables. We then construct the URL by concatenating these variables with the base URL of JobInventory.com.
We then send a GET request to the URL using the requests library and parse the HTML content using BeautifulSoup. We find all the job listings on the page by searching for the li
elements with the resultBlock
class.
Next, we define empty lists to store the job details and loop through each job listing, extracting the relevant details using the find
method. We append the extracted details to their respective lists.
To clean up the job descriptions, we define a regular expression pattern that matches one or more whitespace characters and use the sub
method to replace them with a single space.
Finally, we create a Pandas DataFrame to store the job details and export it to a CSV file using the to_csv
method.
And there you have it! With just a few lines of Python code, you can scrape job listings from JobInventory.com and store them in a CSV file for further analysis.
But what if there are multiple pages of job listings?
Scraping multiple pages
We can handle pagination by modifying our code as follows:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
# Define the search query and location
search_query = "data scientist"
location = "New York City, NY"
# Construct the base URL
base_url = "http://www.jobinventory.com"
# Define empty lists to store the job details
titles = []
companies = []
locations = []
descriptions = []
# Loop through each page of job listings
max_pages = 5
page_num = 1
while page_num <= max_pages:
# Construct the URL for the current page
url = f"{base_url}/search?q={search_query}&l={location}&start={page_num}"
# Send a GET request to the URL
response = requests.get(url)
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
# Find all the job listings on the page
job_listings = soup.find_all("li", class_="resultBlock")
# If there are no job listings on the current page, we have reached the end
# of the results
if not job_listings:
break
# Loop through each job listing and extract the relevant details
for job in job_listings:
title = job.find("div", class_="title").text.strip()
company = job.find("span", class_="company").text.strip()
location = (
job.find("div", class_="state").text.split(
"\xa0-\xa0"
)[-1].strip()
)
description = job.find("div", class_="description").text.strip()
titles.append(title)
companies.append(company)
locations.append(location)
descriptions.append(description)
# Increment the page number
page_num += 1
# Clean up the job descriptions using regular expressions
regex = re.compile(r"\s+")
clean_descriptions = [regex.sub(" ", d).split(" - ")[1] for d in descriptions]
# Create a Pandas DataFrame to store the job details
df = pd.DataFrame(
{
"Title": titles,
"Company": companies,
"Location": locations,
"Description": clean_descriptions,
}
)
# Export the DataFrame to a CSV file
df.to_csv("job_listings_multiple.csv", index=False)
print("Scraping complete! Check 'job_listings_multiple.csv' for the results.")
df
Scraping complete! Check 'job_listings_multiple.csv' for the results.
Title | Company | Location | Description | |
---|---|---|---|---|
0 | Lead Data Scientist | Tiro | New York, NY | Lead Data Scientist Enigma is seekingand visua... |
1 | Data Scientist | Smith Hanley Associates | New York, NY | Title: Data Scientist Location: Newengineering... |
2 | Data Scientist | Averity | New York, NY | like to become a Data Scientist at a global in... |
3 | Data Scientist | Revelio Labs | New York, NY | for: Revelio Labs is looking for a creative Se... |
4 | Lead Data Scientist | Thomas | New York, NY | looking for a Lead Data Scientist to lead and ... |
... | ... | ... | ... | ... |
95 | Senior Data Scientist | Wonder | New York, NY | written and verbal) to collaborate with busine... |
96 | Data Scientist, Product Experimentation | Captions | New York, NY | , or a related discipline. * 3-5 years of prov... |
97 | Staff Data Scientist, Marketplace | CookUnity | New York, NY | with engineering. * Provide mentorship and gui... |
98 | Sr. Product Data Scientist (NY) | Philo | New York, NY | streaming service. You'll be working closely w... |
99 | Staff Data Scientist, Marketplace | CookUnity | New York, NY | with engineering. * Provide mentorship and gui... |
100 rows × 4 columns
In this modified code, we first define the search query and location variables, as well as the base URL of JobInventory.com. We also define empty lists to store the job details.
We then loop through each page of job listings, up to e.g. 5 pages, by incrementing the start
parameter of the URL. We check if there are any job listings on the current page, and if not, we break out of the loop.
We then loop through each job listing on the current page, extracting the relevant details using the find
method and appending them to their respective lists.
After we have scraped all the job listings, we clean up the job descriptions using regular expressions, create a Pandas DataFrame to store the job details, and export it to a CSV file.
And there you have it! With these modifications, we can scrape job listings from JobInventory.com across multiple pages.
Happy scraping!