Job market analysis in Python

This pulls live job postings into pandas and runs three real analyses — hiring by role category, salary disclosure rates by category, and remote share over time — on a normalized dataset, not scraped HTML. Every snippet is complete and runnable against the live API with a free key.

Setup

Two dependencies: requests to pull the data and pandas to shape it. Set your API key as an environment variable so it never lands in the code.

bash

# Python 3.10+
pip install requests pandas

Then export JLA_API_KEY=jla_live_your_key_here. A free key from the dashboard gives you 5,000 requests a month, which is plenty for analysis.

Pulling a dataset

The fetch function pages through /v1/jobs with offset pagination, capping the pull at a row limit so an exploratory run stays cheap. The one piece that matters in practice is rate-limit handling: when the API returns 429, it tells you exactly how long to wait in the Retry-After header, so you honor that rather than guessing a backoff.

fetch.py

# fetch.py — pull a bounded dataset with polite rate-limit handling.
import os, time, requests

API_BASE = "https://api.joblistingsapi.com/v1"
API_KEY = os.environ["JLA_API_KEY"]
SESSION = requests.Session()
SESSION.headers["X-API-Key"] = API_KEY


def fetch_jobs(max_rows: int = 1000, page_size: int = 100, **filters) -> list[dict]:
    """Page through /v1/jobs with offset, capping the pull at max_rows.

    The free tier allows 10 requests/minute and 5,000/month, so a 1,000-row
    pull is ten requests — well inside the free budget. On a 429 we honor the
    Retry-After header instead of guessing a backoff.
    """
    jobs: list[dict] = []
    offset = 0

    while len(jobs) < max_rows:
        limit = min(page_size, max_rows - len(jobs))
        res = SESSION.get(
            f"{API_BASE}/jobs",
            params={"limit": limit, "offset": offset, **filters},
        )

        # Rate limited: wait exactly as long as the server asks, then retry.
        if res.status_code == 429:
            wait = int(res.headers.get("Retry-After", "5"))
            time.sleep(wait)
            continue

        res.raise_for_status()
        body = res.json()
        batch = body["jobs"]
        if not batch:
            break  # reached the end of the dataset before max_rows

        jobs.extend(batch)
        offset += len(batch)

        # Be a good citizen between pages even when not rate limited.
        time.sleep(0.2)

    return jobs[:max_rows]


if __name__ == "__main__":
    rows = fetch_jobs(max_rows=1000, country="GB")
    print(f"pulled {len(rows)} jobs")

Capping at 1,000–2,000 rows keeps the example inside the free tier — a 1,000-row pull is ten requests, against a budget of 10 per minute and 5,000 per month. Scale the cap up only when you move to a paid plan and need the volume.

Into a DataFrame

pd.json_normalize flattens the nested records. Two fields need attention: salary is a nested object that is null on most postings, so we lift its fields to flat columns and let absent pay become NaN; and listed_at is a timestamp we parse so we can resample by time later.

analyze.py

# analyze.py — normalize the rows into a flat DataFrame.
import pandas as pd
from fetch import fetch_jobs

rows = fetch_jobs(max_rows=1000, country="GB")
df = pd.json_normalize(rows)

# Salary is a nested object that is null on most postings. Pull the fields we
# need to flat columns; rows without disclosed pay become NaN, which is correct.
for field in ("min", "max", "currency"):
    col = f"salary.{field}"
    df[f"salary_{field}"] = df[col] if col in df.columns else pd.NA

# Parse the listing timestamp so we can resample by time. Some postings have no
# listed_at; coerce keeps those as NaT instead of raising.
df["listed_at"] = pd.to_datetime(df["listed_at"], errors="coerce", utc=True)

df["has_salary"] = df["salary_min"].notna()
print(df[["title", "company", "role_category", "has_salary"]].head())

Three analyses

Each of these is a few lines on the DataFrame above. They are the questions people most often bring to posting data.

Postings by role category

A straight value_counts on the normalized role_category column tells you where hiring is concentrated. Because the taxonomy is normalized on every record, "Eng", "SWE", and "Software Engineer" have already resolved to one slug — you are not counting title variants.

python

# 1. Postings by role category — where is hiring concentrated?
# value_counts on the normalized role_category column shows which functions
# are actively hiring. Because the taxonomy is normalized, "Eng", "SWE", and
# "Software Engineer" have already resolved to one category — no string cleanup.
by_category = df["role_category"].value_counts(dropna=False)
print(by_category)
# engineering                  312
# sales-account-management     198
# business-operations          164
# healthcare                    93
# NaN                           33  (postings without a resolved category)

Salary disclosure rate by category

Group by role category and take the mean of the boolean has_salary column. The mean of a 0/1 series is the share that is true, so this is the percentage of postings in each category that disclosed pay — a genuinely interesting cut, since disclosure varies a lot by function.

python

# 2. Salary disclosure rate by role category — who actually posts pay?
# Group by category and take the mean of the boolean has_salary column: the
# mean of a 0/1 series is the share that is True, i.e. the disclosure rate.
disclosure = (
    df.groupby("role_category")["has_salary"]
    .mean()
    .sort_values(ascending=False)
    .mul(100)
    .round(1)
)
print(disclosure)
# engineering    34.2
# data           29.8
# sales          18.1
# (each value is the percent of postings in that category that disclosed pay)

Remote share over time

Resample the postings into weekly buckets and take the mean of is_remote. Being boolean, its weekly mean is the remote share of that week's postings.

python

# 3. Remote share over time — is remote rising or falling week to week?
# Resample the postings into weekly buckets and take the mean of is_remote,
# which (being boolean) gives the remote share of each week's postings.
weekly_remote = (
    df.dropna(subset=["listed_at"])
    .set_index("listed_at")
    .sort_index()["is_remote"]
    .resample("W")
    .mean()
    .mul(100)
    .round(1)
)
print(weekly_remote)
# Caveat: the live dataset is a rolling 21-day window, so this resample only
# spans ~3 weeks. For a real time series, snapshot on a schedule (see below).

Caveats

Two honest limits shape what these numbers can and cannot tell you.

The live window is 21 days. The endpoint holds a rolling 21-day active window, so the remote-over-time resample only spans about three weeks. For trends over months, you cannot read them out of one pull — snapshot the dataset on a schedule (a daily or weekly cron writing each pull to storage) and build your time series from your own snapshots. The data is a current census, not a historical archive.
Salary is self-reported by the employer. The salary fields are populated only when the posting states pay, and only roughly 15–25% of postings do. Disclosure rates therefore measure how often employers in a category choose to post pay, which is not the same as the underlying pay distribution. Never read an absent salary as a low one — it is simply undisclosed.

For ready-made aggregates you do not have to compute yourself — live counts by source, category, and country — see the Job Market Pulse, or the docs for the full schema and filters. To put the same data on a page instead of in a notebook, see building a job board with Next.js.