Building a LinkedIn Scraper (and Turning It Into an MCP Integration)

November 6, 2025 · 6 min read

Software Engineer Intern

TL;DR:

We recently built a LinkedIn scraping tool that can extract profiles, company data, and connections without getting flagged.
Sounds simple, right? Well, three weeks and countless "unusual activity detected" warnings later, We learned that scraping LinkedIn is less about parsing HTML and more about outsmarting their anti-bot systems.

Full source code: github.com/vertexcover-io/linkedin-spider

Why Build This?

We built this to automate our sales pipeline — to find, connect, and communicate with potential leads directly from LinkedIn.

The official API is too limited for that. You can't filter profiles the way you need to, or access detailed company data.

So we built linkedin-spider, a Python scraper with an MCP server on top that plugs into Claude Code.
This setup lets Claude search for leads, gather insights, and even assist with outreach — all through natural language.

The Authentication Nightmare

LinkedIn does not like bots — and it makes that very clear.

At first, we thought authentication would be simple: just grab the li_at cookie, inject it, and start scraping.
Wrong.

LinkedIn's enhanced anti-bot system quickly detected cookie-based sessions, leading to random logouts and account challenges.
So we pivoted to email/password-based authentication with session persistence.

The system now:

Tries multiple authentication methods in order of reliability
Stores session cookies in a persistent Chrome profile
Minimizes repeated logins across runs

Fixing Chrome Profile Persistence

In the early versions, Chrome profiles were stored in the working directory.
Users who ran the scraper from different folders were asked to log in every time — defeating the purpose of "persistence."

The fix? A centralized profile directory that stays consistent across runs.

Handling OTP Challenges in the CLI

Nothing tests your patience like LinkedIn's OTP challenges — especially when you're running a headless CLI tool.

To handle this, we implemented an interactive pause during setup:

def _handle_verification_code_challenge(self) -> bool:
    print("\n Email verification code required")
    print("Please check your email for the verification code.")
    verification_code = input("Enter the 6-digit verification code: ").strip()

The CLI now detects OTP prompts, waits for user input, submits the code, and resumes.
A simple feature, but getting it right across multiple challenge types was surprisingly tricky.

Scraping: Where Premium Users Broke Everything

It turns out LinkedIn's HTML isn't the same for all users — premium accounts get entirely different structures for the same pages.

Here's how we handled it:

def _extract_name_and_url(self, container: WebElement) -> tuple[str, str]:
    name_selectors = [
        'a[data-view-name="search-result-lockup-title"]',  # Free accounts
        'span.entity-result__title-text a',                # Premium variant 1
        'a.app-aware-link span[aria-hidden="true"]',       # Premium variant 2
    ]

The scraper cycles through multiple selectors for each data point — name, headline, location, image URLs, and more.
Not the most elegant solution, but robust across user types.

LinkedIn's search filters are deeply nested and dynamically generated.
Filtering by location, for example, involves dropdowns, autocompletes, and generated IDs.

To manage this, we built a dedicated SearchFilterHandler that automates:

Location filters with autocomplete
Industry and company selectors
Current company searches
Connection degree filtering
Follower/connection-of filters

Each required reverse-engineering LinkedIn's front-end logic and replicating it through Selenium.

Proxy Support: IP Bans

While authenticating the scraper on a server, LinkedIn frequently threw image-based challenges — impossible to solve in a headless browser.

The solution? A residential proxy.
It mimics normal browsing behavior from a typical user IP, drastically reducing challenges and OTP prompts during authentication.

Stealth Mode: Hiding the Automation

Selenium is easily detectable. LinkedIn can identify automated browsers via navigator properties, missing APIs, or automation flags.

To counter that, we added a stealth mode layer using Chrome DevTools Protocol. It:

Masks webdriver properties
Spoofs browser and OS information
Injects anti-detection scripts on each page load

Human Behavior Simulation

Bots act like machines; humans don't.

We built a HumanBehavior module that:

Adds random delays between actions (0.5–2s by default)
Simulates typing character-by-character with natural pauses
Scrolls incrementally rather than jumping
Adds mouse movements before clicks

These subtle touches reduce automation fingerprints and improve long-term reliability.

The MCP Integration

MCP (Model Context Protocol) by Anthropic allows AI assistants to interact with external data sources.

We built an MCP server around linkedin-spider, exposing scraping tools to assistants like Claude.

Claude can now answer queries like:

"Find 10 product managers in San Francisco working at Series B startups."

Example call:

linkedin-mcp search_profiles   --query "product manager San Francisco Series B"   --max_results 10   --location "San Francisco"

[
  {
    "name": "Anand Raghavan",
    "headline": "VP Products, AI at Cisco",
    "location": "San Francisco Bay Area",
    "experience": [
      {
        "title": "Cisco",
        "company": "Cisco",
        "company_url": "https://www.linkedin.com/company/1063/",
        "duration": "Full-time · 2 yrs 4 mos",
        "location": "San Francisco Bay Area"
      }
    ]
  }
]

What I'd Do Differently

Decouple architecture: Separate driver management from scraping logic earlier
Async execution: Move to async/await with concurrent drivers for bulk ops
Error recovery: Add retry logic and exponential backoff
Dynamic selectors: Build a remote selector update mechanism

Current Limitations

Cookie-based auth remains unreliable
Pagination caps restrict full dataset scraping
Rate limits still apply despite delays and proxies
Captchas still require manual intervention

Try It Yourself

pip install linkedin-spider[cli]

linkedin-spider-cli search -q "software engineer" -n 10   --email your@email.com --password yourpass

Full source code: github.com/vertexcover-io/linkedin-spider
Licensed under MIT — please use responsibly.

Final Thoughts

Building this scraper taught me that web scraping is 30% engineering and 70% understanding the adversarial dynamics between scrapers and platforms.

LinkedIn doesn't want to be scraped — but can't entirely block it without breaking its own UX.
That gap is where projects like this thrive.

With great scraping power comes great responsibility.
Don't spam. Don't exploit. Don't cross ethical lines.
Use tools like this for research, automation, and legitimate data needs only.

Why Build This?​

The Authentication Nightmare​

Fixing Chrome Profile Persistence​

Handling OTP Challenges in the CLI​

Scraping: Where Premium Users Broke Everything​

Search Filters: The Recursive Rabbit Hole​

Proxy Support: IP Bans​

Stealth Mode: Hiding the Automation​

Human Behavior Simulation​

The MCP Integration​

What I'd Do Differently​

Current Limitations​

Try It Yourself​

Final Thoughts​