Python Web Scraping: A Beginner’s Guide to Scraping Job Listings

Are you tired of manually scrolling through endless job postings to find the perfect opportunity? Look no further! In this tutorial, I’ll walk you through the steps of scraping job listings from JobInventory.com using Python.

First, we’ll use the requests library to send a GET request to the website and retrieve the HTML content. Then, we’ll use BeautifulSoup to parse the HTML and extract the relevant job listing information, such as the job title, company name, and job description.

Next, we’ll explore how to use Python’s regular expressions module to clean up the extracted data and prepare it for further analysis. We’ll also cover how to store the data in a CSV file for future use.

But wait, there’s more! We’ll also dive into advanced scraping techniques, such as handling pagination and dynamically loaded content. And, as a bonus, we’ll show you how to use Python’s natural language processing libraries to extract keywords and analyze job descriptions.

So, grab a cup of coffee and get ready to dive into the world of web scraping with Python. By the end of this tutorial, you’ll have the skills to scrape job listings on JobInventory.com and beyond!

Setup

To get started with scraping job listings, we’ll first need to install a few Python packages. Open up your terminal or command prompt and run the following command:

Shell
pip install requests beautifulsoup4 pandas

These packages will allow us to send HTTP requests, parse HTML, and store the scraped data in a CSV file.

Scraping a single page

Now that we have our dependencies installed, let’s dive into the code.

Python
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

# Define the search query and location
search_query = "data scientist"
location = "New York City, NY"

# Construct the URL
url = f"http://www.jobinventory.com/search?q={search_query}&l={location}"

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Find all the job listings on the page
job_listings = soup.find_all("li", class_="resultBlock")

# Define empty lists to store the job details
titles = []
companies = []
locations = []
descriptions = []

# Loop through each job listing and extract the relevant details
for job in job_listings:
    title = job.find("div", class_="title").text.strip()
    company = job.find("span", class_="company").text.strip()
    location = (
        job.find("div", class_="state").text.split("\xa0-\xa0")[-1].strip()
    )
    description = job.find("div", class_="description").text.strip()

    titles.append(title)
    companies.append(company)
    locations.append(location)
    descriptions.append(description)

# Clean up the job descriptions using regular expressions
regex = re.compile(r"\s+")
clean_descriptions = [regex.sub(" ", d).split(" - ")[1] for d in descriptions]

# Create a Pandas DataFrame to store the job details
df = pd.DataFrame(
    {
        "Title": titles,
        "Company": companies,
        "Location": locations,
        "Description": clean_descriptions,
    }
)

# Export the DataFrame to a CSV file
df.to_csv("job_listings.csv", index=False)

print("Scraping complete! The results are saved in \"job_listings.csv\".")

df

Scraping complete! The results are saved in 'job_listings.csv'.

	Title	Company	Location	Description
0	Lead Data Scientist	Tiro	New York, NY	Lead Data Scientist Enigma is seekingand visua...
1	Data Scientist	Smith Hanley Associates	New York, NY	Title: Data Scientist Location: Newengineering...
2	Data Scientist	Averity	New York, NY	like to become a Data Scientist at a global in...
3	Data Scientist	Revelio Labs	New York, NY	for: Revelio Labs is looking for a creative Se...
4	Lead Data Scientist	Thomas	New York, NY	looking for a Lead Data Scientist to lead and ...
5	Data Scientist	Eliassen Group	Jersey City, NJ	The client is seeking a Neo4j data scientist/e...
6	Sr. Data Scientist	CVS	New York, NY	hiring for the following role in New York, NY:...
7	Data Scientist	E-Frontiers	New York, NY	Data Scientist The Company is aExperience in a...
8	Staff Data Scientist	Harnham	New York, NY	Staff Data Scientist AdTech Companyimplement. ...
9	Data Scientist, Modeling	Gro Intelligence	New York, NY	addresses agriculture, food, and our climate o...
10	Senior Data Scientist	Teachers Insurance and Annuity Association - TIAA	New York, NY	reporting, interpretation of data analyses to ...
11	Lead NLP Data Scientist - Remote \| WFH	Get It Recruit - Real Estate	Jersey City, NJ	of data science! We are looking for a talented...
12	Senior Research Scientist	NYU Langone Health	New York, NY	Senior Research Scientist will help manage, pr...
13	Senior Data Scientist	Equation Staffing	New York, NY	B2B. They are looking for a Senior Data Scient...
14	Assistant Research Scientist	NYU Langone Health	New York, NY	Investigator. The Research Scientist will mana...
15	Principal Data Scientist	Harnham	New York, NY	Principal Data Scientist AdTech StartupLead th...
16	Senior Data Scientist	Oliwska Grupa Konsultingowa	New York, NY	for an experienced applied data scientist to j...
17	Senior Data Scientist	Storm3	New York, NY	achieving faster outcomes. We are seeking a dr...
18	Data Scientist	Verizon	New York, NY	in a complex, multi-functional, Agile team env...
19	Data Scientist Series, MTA Data & Analytics	MTA	New York, NY	Data Scientist Series, MTA012 479 3 Senior Dat...

In this code, we first define the search query and location variables. We then construct the URL by concatenating these variables with the base URL of JobInventory.com.

We then send a GET request to the URL using the requests library and parse the HTML content using BeautifulSoup. We find all the job listings on the page by searching for the li elements with the resultBlock class.

Next, we define empty lists to store the job details and loop through each job listing, extracting the relevant details using the find method. We append the extracted details to their respective lists.

To clean up the job descriptions, we define a regular expression pattern that matches one or more whitespace characters and use the sub method to replace them with a single space.

Finally, we create a Pandas DataFrame to store the job details and export it to a CSV file using the to_csv method.

And there you have it! With just a few lines of Python code, you can scrape job listings from JobInventory.com and store them in a CSV file for further analysis.

But what if there are multiple pages of job listings?

Scraping multiple pages

We can handle pagination by modifying our code as follows:

Python
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

# Define the search query and location
search_query = "data scientist"
location = "New York City, NY"

# Construct the base URL
base_url = "http://www.jobinventory.com"

# Define empty lists to store the job details
titles = []
companies = []
locations = []
descriptions = []

# Loop through each page of job listings
max_pages = 5
page_num = 1

while page_num <= max_pages:
    # Construct the URL for the current page
    url = f"{base_url}/search?q={search_query}&l={location}&start={page_num}"

    # Send a GET request to the URL
    response = requests.get(url)

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, "html.parser")

    # Find all the job listings on the page
    job_listings = soup.find_all("li", class_="resultBlock")

    # If there are no job listings on the current page, we have reached the end
    # of the results
    if not job_listings:
        break

    # Loop through each job listing and extract the relevant details
    for job in job_listings:
        title = job.find("div", class_="title").text.strip()
        company = job.find("span", class_="company").text.strip()
        location = (
            job.find("div", class_="state").text.split(
                "\xa0-\xa0"
            )[-1].strip()
        )
        description = job.find("div", class_="description").text.strip()

        titles.append(title)
        companies.append(company)
        locations.append(location)
        descriptions.append(description)

    # Increment the page number
    page_num += 1

# Clean up the job descriptions using regular expressions
regex = re.compile(r"\s+")
clean_descriptions = [regex.sub(" ", d).split(" - ")[1] for d in descriptions]

# Create a Pandas DataFrame to store the job details
df = pd.DataFrame(
    {
        "Title": titles,
        "Company": companies,
        "Location": locations,
        "Description": clean_descriptions,
    }
)

# Export the DataFrame to a CSV file
df.to_csv("job_listings_multiple.csv", index=False)

print("Scraping complete! Check 'job_listings_multiple.csv' for the results.")

df

Scraping complete! Check 'job_listings_multiple.csv' for the results.

	Title	Company	Location	Description
0	Lead Data Scientist	Tiro	New York, NY	Lead Data Scientist Enigma is seekingand visua...
1	Data Scientist	Smith Hanley Associates	New York, NY	Title: Data Scientist Location: Newengineering...
2	Data Scientist	Averity	New York, NY	like to become a Data Scientist at a global in...
3	Data Scientist	Revelio Labs	New York, NY	for: Revelio Labs is looking for a creative Se...
4	Lead Data Scientist	Thomas	New York, NY	looking for a Lead Data Scientist to lead and ...
...	...	...	...	...
95	Senior Data Scientist	Wonder	New York, NY	written and verbal) to collaborate with busine...
96	Data Scientist, Product Experimentation	Captions	New York, NY	, or a related discipline. * 3-5 years of prov...
97	Staff Data Scientist, Marketplace	CookUnity	New York, NY	with engineering. * Provide mentorship and gui...
98	Sr. Product Data Scientist (NY)	Philo	New York, NY	streaming service. You'll be working closely w...
99	Staff Data Scientist, Marketplace	CookUnity	New York, NY	with engineering. * Provide mentorship and gui...

100 rows × 4 columns

In this modified code, we first define the search query and location variables, as well as the base URL of JobInventory.com. We also define empty lists to store the job details.

We then loop through each page of job listings, up to e.g. 5 pages, by incrementing the start parameter of the URL. We check if there are any job listings on the current page, and if not, we break out of the loop.

We then loop through each job listing on the current page, extracting the relevant details using the find method and appending them to their respective lists.

After we have scraped all the job listings, we clean up the job descriptions using regular expressions, create a Pandas DataFrame to store the job details, and export it to a CSV file.

And there you have it! With these modifications, we can scrape job listings from JobInventory.com across multiple pages.

Happy scraping!

Python Web Scraping: A Beginner’s Guide to Scraping Job Listings

Setup​

Scraping a single page​

Scraping multiple pages​

Setup

Scraping a single page

Scraping multiple pages