Skip to main content

Python Web Scraping: A Beginner’s Guide to Scraping Job Listings

Are you tired of manually scrolling through endless job postings to find the perfect opportunity? Look no further! In this tutorial, I’ll walk you through the steps of scraping job listings from JobInventory.com using Python.

First, we’ll use the requests library to send a GET request to the website and retrieve the HTML content. Then, we’ll use BeautifulSoup to parse the HTML and extract the relevant job listing information, such as the job title, company name, and job description.

Next, we’ll explore how to use Python’s regular expressions module to clean up the extracted data and prepare it for further analysis. We’ll also cover how to store the data in a CSV file for future use.

But wait, there’s more! We’ll also dive into advanced scraping techniques, such as handling pagination and dynamically loaded content. And, as a bonus, we’ll show you how to use Python’s natural language processing libraries to extract keywords and analyze job descriptions.

So, grab a cup of coffee and get ready to dive into the world of web scraping with Python. By the end of this tutorial, you’ll have the skills to scrape job listings on JobInventory.com and beyond!

Setup

To get started with scraping job listings, we’ll first need to install a few Python packages. Open up your terminal or command prompt and run the following command:

Shell
pip install requests beautifulsoup4 pandas

These packages will allow us to send HTTP requests, parse HTML, and store the scraped data in a CSV file.

Scraping a single page

Now that we have our dependencies installed, let’s dive into the code.

Python
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

# Define the search query and location
search_query = "data scientist"
location = "New York City, NY"

# Construct the URL
url = f"http://www.jobinventory.com/search?q={search_query}&l={location}"

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Find all the job listings on the page
job_listings = soup.find_all("li", class_="resultBlock")

# Define empty lists to store the job details
titles = []
companies = []
locations = []
descriptions = []

# Loop through each job listing and extract the relevant details
for job in job_listings:
title = job.find("div", class_="title").text.strip()
company = job.find("span", class_="company").text.strip()
location = (
job.find("div", class_="state").text.split("\xa0-\xa0")[-1].strip()
)
description = job.find("div", class_="description").text.strip()

titles.append(title)
companies.append(company)
locations.append(location)
descriptions.append(description)

# Clean up the job descriptions using regular expressions
regex = re.compile(r"\s+")
clean_descriptions = [regex.sub(" ", d).split(" - ")[1] for d in descriptions]

# Create a Pandas DataFrame to store the job details
df = pd.DataFrame(
{
"Title": titles,
"Company": companies,
"Location": locations,
"Description": clean_descriptions,
}
)

# Export the DataFrame to a CSV file
df.to_csv("job_listings.csv", index=False)

print("Scraping complete! The results are saved in \"job_listings.csv\".")

df
Scraping complete! The results are saved in 'job_listings.csv'.
TitleCompanyLocationDescription
0Lead Data ScientistTiroNew York, NYLead Data Scientist Enigma is seekingand visua...
1Data ScientistSmith Hanley AssociatesNew York, NYTitle: Data Scientist Location: Newengineering...
2Data ScientistAverityNew York, NYlike to become a Data Scientist at a global in...
3Data ScientistRevelio LabsNew York, NYfor: Revelio Labs is looking for a creative Se...
4Lead Data ScientistThomasNew York, NYlooking for a Lead Data Scientist to lead and ...
5Data ScientistEliassen GroupJersey City, NJThe client is seeking a Neo4j data scientist/e...
6Sr. Data ScientistCVSNew York, NYhiring for the following role in New York, NY:...
7Data ScientistE-FrontiersNew York, NYData Scientist The Company is aExperience in a...
8Staff Data ScientistHarnhamNew York, NYStaff Data Scientist AdTech Companyimplement. ...
9Data Scientist, ModelingGro IntelligenceNew York, NYaddresses agriculture, food, and our climate o...
10Senior Data ScientistTeachers Insurance and Annuity Association - TIAANew York, NYreporting, interpretation of data analyses to ...
11Lead NLP Data Scientist - Remote | WFHGet It Recruit - Real EstateJersey City, NJof data science! We are looking for a talented...
12Senior Research ScientistNYU Langone HealthNew York, NYSenior Research Scientist will help manage, pr...
13Senior Data ScientistEquation StaffingNew York, NYB2B. They are looking for a Senior Data Scient...
14Assistant Research ScientistNYU Langone HealthNew York, NYInvestigator. The Research Scientist will mana...
15Principal Data ScientistHarnhamNew York, NYPrincipal Data Scientist AdTech StartupLead th...
16Senior Data ScientistOliwska Grupa KonsultingowaNew York, NYfor an experienced applied data scientist to j...
17Senior Data ScientistStorm3New York, NYachieving faster outcomes. We are seeking a dr...
18Data ScientistVerizonNew York, NYin a complex, multi-functional, Agile team env...
19Data Scientist Series, MTA Data & AnalyticsMTANew York, NYData Scientist Series, MTA012 479 3 Senior Dat...

In this code, we first define the search query and location variables. We then construct the URL by concatenating these variables with the base URL of JobInventory.com.

We then send a GET request to the URL using the requests library and parse the HTML content using BeautifulSoup. We find all the job listings on the page by searching for the li elements with the resultBlock class.

Next, we define empty lists to store the job details and loop through each job listing, extracting the relevant details using the find method. We append the extracted details to their respective lists.

To clean up the job descriptions, we define a regular expression pattern that matches one or more whitespace characters and use the sub method to replace them with a single space.

Finally, we create a Pandas DataFrame to store the job details and export it to a CSV file using the to_csv method.

And there you have it! With just a few lines of Python code, you can scrape job listings from JobInventory.com and store them in a CSV file for further analysis.

But what if there are multiple pages of job listings?

Scraping multiple pages

We can handle pagination by modifying our code as follows:

Python
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

# Define the search query and location
search_query = "data scientist"
location = "New York City, NY"

# Construct the base URL
base_url = "http://www.jobinventory.com"

# Define empty lists to store the job details
titles = []
companies = []
locations = []
descriptions = []

# Loop through each page of job listings
max_pages = 5
page_num = 1

while page_num <= max_pages:
# Construct the URL for the current page
url = f"{base_url}/search?q={search_query}&l={location}&start={page_num}"

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Find all the job listings on the page
job_listings = soup.find_all("li", class_="resultBlock")

# If there are no job listings on the current page, we have reached the end
# of the results
if not job_listings:
break

# Loop through each job listing and extract the relevant details
for job in job_listings:
title = job.find("div", class_="title").text.strip()
company = job.find("span", class_="company").text.strip()
location = (
job.find("div", class_="state").text.split(
"\xa0-\xa0"
)[-1].strip()
)
description = job.find("div", class_="description").text.strip()

titles.append(title)
companies.append(company)
locations.append(location)
descriptions.append(description)

# Increment the page number
page_num += 1

# Clean up the job descriptions using regular expressions
regex = re.compile(r"\s+")
clean_descriptions = [regex.sub(" ", d).split(" - ")[1] for d in descriptions]

# Create a Pandas DataFrame to store the job details
df = pd.DataFrame(
{
"Title": titles,
"Company": companies,
"Location": locations,
"Description": clean_descriptions,
}
)

# Export the DataFrame to a CSV file
df.to_csv("job_listings_multiple.csv", index=False)

print("Scraping complete! Check 'job_listings_multiple.csv' for the results.")

df
Scraping complete! Check 'job_listings_multiple.csv' for the results.
TitleCompanyLocationDescription
0Lead Data ScientistTiroNew York, NYLead Data Scientist Enigma is seekingand visua...
1Data ScientistSmith Hanley AssociatesNew York, NYTitle: Data Scientist Location: Newengineering...
2Data ScientistAverityNew York, NYlike to become a Data Scientist at a global in...
3Data ScientistRevelio LabsNew York, NYfor: Revelio Labs is looking for a creative Se...
4Lead Data ScientistThomasNew York, NYlooking for a Lead Data Scientist to lead and ...
...............
95Senior Data ScientistWonderNew York, NYwritten and verbal) to collaborate with busine...
96Data Scientist, Product ExperimentationCaptionsNew York, NY, or a related discipline. * 3-5 years of prov...
97Staff Data Scientist, MarketplaceCookUnityNew York, NYwith engineering. * Provide mentorship and gui...
98Sr. Product Data Scientist (NY)PhiloNew York, NYstreaming service. You'll be working closely w...
99Staff Data Scientist, MarketplaceCookUnityNew York, NYwith engineering. * Provide mentorship and gui...

100 rows × 4 columns

In this modified code, we first define the search query and location variables, as well as the base URL of JobInventory.com. We also define empty lists to store the job details.

We then loop through each page of job listings, up to e.g. 5 pages, by incrementing the start parameter of the URL. We check if there are any job listings on the current page, and if not, we break out of the loop.

We then loop through each job listing on the current page, extracting the relevant details using the find method and appending them to their respective lists.

After we have scraped all the job listings, we clean up the job descriptions using regular expressions, create a Pandas DataFrame to store the job details, and export it to a CSV file.

And there you have it! With these modifications, we can scrape job listings from JobInventory.com across multiple pages.

Happy scraping!