Back to blog
7 min read

Why Playwright MCP is Not the Right Choice for UI Testing

Playwright MCP might seem like a natural fit for AI-driven UI testing, but its fundamental architecture creates problems that make it impractical for real-world testing workflows. Here's why.

LH

Long Horizon Team

Engineering

The Model Context Protocol (MCP) has opened exciting possibilities for AI agents to interact with external tools. Playwright MCP—which exposes browser automation capabilities to AI agents—seems like a natural fit for UI testing. After all, if an agent can browse the web, surely it can test your application too?

The reality is more complicated. While Playwright MCP is impressive for ad-hoc browser interactions, its fundamental architecture makes it poorly suited for serious UI testing workflows. Here's why.

The Multi-Tool-Call Problem

Playwright MCP works by exposing individual browser actions as separate tool calls. Want to test a login flow? That's a tool call to navigate, a tool call to find the username field, a tool call to type, another to find the password field, another to type, another to click the submit button, and more calls to verify the result.

Each tool call requires a round trip to the AI model. The agent needs to observe the result, reason about what to do next, and decide on the next action. A simple login test that takes milliseconds in traditional Playwright becomes a multi-second ordeal with MCP.

For a comprehensive test suite covering dozens of user flows, this latency compounds into minutes or hours of execution time. What should be a quick feedback loop becomes an exercise in patience.

Non-Repeatable Test Execution

Here's a fundamental problem: when you ask an AI agent to "test the checkout flow," it decides how to test it. Run the same prompt tomorrow, and the agent might take completely different steps. It might click different elements, fill forms in a different order, or skip steps it previously included.

This non-determinism is fine for exploration but catastrophic for testing. The entire point of a test suite is repeatability—you want to know that the same checks run every time, so you can trust that a passing suite means the same thing today as it did yesterday.

With Playwright MCP, you're not running tests. You're asking an agent to improvise a testing session. The results are interesting but not reliable.

The Auditability Gap

When a traditional test passes, you can inspect exactly what happened: which assertions ran, what values were checked, what the expected vs actual results were. The test code itself is the audit trail.

With Playwright MCP, the agent tells you "I tested the feature and it works." But what did it actually check? Did it verify the success message appeared? Did it confirm the database was updated? Did it check that the email was sent? You have to trust the agent's summary.

This trust problem becomes acute in regulated industries or when debugging production issues. "The agent said it worked" isn't an acceptable answer when you need to prove compliance or understand why something broke.

No Shareable Evidence

Modern development workflows rely on sharing test results. CI/CD pipelines display test reports. PR reviews include links to test runs. QA teams share execution evidence with stakeholders.

Playwright MCP produces... a chat transcript. There's no structured report to attach to a PR. No shareable link showing what was tested. No artifact that proves to your team lead or your client that the feature actually works.

You could screenshot the chat, but that's hardly professional evidence. And it certainly doesn't integrate with your existing test reporting infrastructure.

Regression Testing at Scale

Perhaps the most damning limitation is regression testing. A mature application might have hundreds of user flows that need verification before each release. With traditional test automation, you run the suite and get results in minutes.

With Playwright MCP, you'd need to prompt the agent to test each flow individually. Given the latency of multi-tool-call execution, running a comprehensive regression suite could take hours. And because each run is non-deterministic, you can't even be sure you're testing the same things as last time.

This makes Playwright MCP impractical for any team that needs reliable, repeatable regression testing—which is to say, any team shipping production software.

What Playwright MCP Is Good For

To be fair, Playwright MCP has legitimate use cases. It's excellent for:

  • Ad-hoc exploration of web applications
  • One-off data extraction tasks
  • Prototyping automation workflows
  • Helping agents understand web page content

But testing—real testing that you can rely on—requires a different approach.

A Better Approach: Agent-Planned, Deterministically Executed

The solution isn't to abandon AI in testing—it's to use AI where it excels while maintaining the properties that make tests valuable.

At Long Horizon, we take a different approach. AI agents plan and write the tests, bringing their understanding of user intent and edge cases. But the tests themselves are deterministic Playwright scripts that execute quickly and repeatably.

This gives you the best of both worlds:

  • AI-powered test planning that understands what to test and why
  • Fast, deterministic execution that runs the same way every time
  • Comprehensive audit trails with screenshots, logs, and network traces
  • Shareable test reports that integrate with your existing workflows
  • Scalable regression testing that runs your full suite in minutes, not hours

Conclusion

Playwright MCP is a cool technology that solves real problems—just not the problem of UI testing. Its architecture fundamentally conflicts with what testing requires: speed, repeatability, auditability, and scalability.

If you're evaluating AI-powered testing solutions, look for approaches that use AI for planning and analysis while maintaining deterministic execution. Your future self—debugging a production issue at 2 AM—will thank you for having reliable, auditable test evidence instead of a chat log saying "the agent said it worked."

Read More