Framework v2

How We Score MCP Servers

Our enterprise readiness framework evaluates MCP servers across 6 categories weighted by what actually matters when deploying AI agents in production. Authentication is weighted highest because it's the #1 blocker for multi-tenant enterprise deployment.

Overview

Each MCP server is scored out of 100 points across 6 categories. The categories are weighted by their importance to enterprise platform operators: authentication and schema quality carry the most weight (25 pts each) because they determine whether the server can be deployed securely for multiple users with proper guardrails.

Our analysis combines automated source code inspection with heuristic evaluation of tool definitions, input schemas, and transport capabilities. We clone each server's repository, extract tool definitions, and analyze them against our scoring rubric. We check for HTTP/SSE transport support, MCP OAuth spec implementation, schema completeness, permission separation, and description quality.

Authentication & Identity

/100 pts

Tool Schema Quality

/100 pts

Permission Granularity

/100 pts

LLM Safety

/100 pts

Data Exposure

/100 pts

Maintenance & Trust

/100 pts

Authentication & Identity

25/100 pts

How does the server handle credentials? Can a platform pass per-user tokens on each request, or does it read a single credential at startup? What are the operational implications for multi-tenant deployment?

This category scores how well the server supports multi-tenant deployment. Deductions stack: a stdio-only server with env var credentials and no OAuth hits all three checks for 0/25. An HTTP/SSE server without OAuth scores 18/25. A server implementing the full MCP OAuth spec scores 25/25. Authentication is weighted highest because it determines whether a platform can deploy one shared instance or must spawn one process per user.

ID	Check	Severity	Score
A1	Stdio-only transport: platform cannot pass per-request tokens	critical	-11
A2	No MCP OAuth spec: platform must handle token management externally	high	-7
A3	Env var credentials: no rotation without restart	high	-7

Tool Schema Quality

25/100 pts

Do all tool parameters have explicit JSON Schema types? Are required fields declared on write operations? Are there constraints like maxLength or enum on dangerous parameters? Are there raw shell/eval/SQL surfaces?

Tool schemas are the primary guardrail in an MCP deployment. They constrain what an LLM can send to a tool before it executes. Without types, required fields, and constraints, a prompt injection attack can pass arbitrary data to write, delete, or execute operations. Schema quality is the difference between a tool that's safe to expose to an AI agent and one that's a liability.

ID	Check	Severity	Score
S1	Tools have no input schema defined	high	-2
S2	Required fields missing on write/delete operations	high	-2
S3	Parameters lack constraints (maxLength, enum, pattern)	medium	-4
S4	Raw shell/eval/SQL execution surface	critical	-5

Permission Granularity

20/100 pts

Are read and write tools separate? Can destructive operations (delete, admin) be independently disabled? Do tool descriptions specify what resources they can access?

Least privilege is critical when an AI agent decides which tools to call. If read and write operations are bundled together, the platform cannot give an agent read-only access. If destructive operations (delete, purge, admin) are mixed with regular CRUD, they cannot be independently disabled. Granular tool separation lets enterprises expose only the capabilities each agent role needs.

ID	Check	Severity	Score
P1	Read and write operations combined in single tool	medium	-4
P2	Destructive operations not isolated from regular tools	high	-5
P3	Tool descriptions lack scope boundaries	low	-3

LLM Safety

15/100 pts

Are all tool descriptions present, specific, and free of manipulative language? Could any description cause the LLM to select the wrong tool or follow hidden instructions? Are there overlapping descriptions that create ambiguity?

LLMs select tools based on their descriptions. A vague or missing description means the LLM is guessing which tool to call, potentially triggering a destructive operation when a read was intended. Hidden instructions in descriptions can hijack the LLM's behavior in multi-server deployments. Overlapping descriptions create ambiguity that leads to wrong tool selection. This combines what used to be separate Description Quality and Injection Resistance categories.

ID	Check	Severity	Score
L1	Tool missing description entirely	high	-1
L2	Description too vague (under 20 characters)	medium	-1
L3	Description contains hidden instructions or prompt injection	critical	-5
L4	Overlapping descriptions between tools	medium	-2

Data Exposure

10/100 pts

Do list operations have pagination? Do tool responses return minimal fields or entire records? Could error messages leak credentials or internal paths?

Tool responses flow back into LLM context windows, which may be logged, cached, or shared across users. A list operation without pagination could dump an entire customer table into context. Returning full records with PII when only an ID was needed is a data leak. Error messages that include credentials or internal paths are exploitable. The platform can mitigate some of this with response filtering, but the server should minimize exposure at the source.

ID	Check	Severity	Score
D1	List operations lack pagination controls	medium	-3
D2	Responses return full records without field selection	low	-2

Maintenance & Trust

5/100 pts

Is the server maintained by the service vendor or a community contributor? Is it actively updated? Does it have a changelog?

Vendor-maintained servers signal long-term reliability: the company behind the service has an interest in keeping the MCP server working and secure. Community-maintained servers may be abandoned. This is the least differentiating category because great community servers exist and vendor servers can still have technical issues, but it matters for enterprise procurement decisions.

ID	Check	Severity	Score
M1	Community-maintained without vendor backing	low	-2

Grade Thresholds

Scores are out of 100 and map directly to letter grades.

85 - 100

Enterprise-ready. Supports per-user OAuth, strong schemas, granular permissions. Deploy with standard review.

70 - 84

Good baseline. Has HTTP transport or strong schemas but gaps in one area. Deployable with platform-level controls.

55 - 69

Significant gaps. Likely stdio-only with weak schemas. Needs compensating controls and restricted scope.

40 - 54

Major concerns. Shared credentials, missing schemas, mixed permissions. Requires substantial remediation.

0 - 39

Critical issues. Not ready for enterprise deployment without a complete security wrapper.

Audit Process

1. Source Code Enumeration

We clone each server's GitHub repository and extract tool definitions from the source code. This identifies tool names, descriptions, input schemas, and transport capabilities without requiring API keys or running the server.

2. Transport & Auth Detection

We scan the source for HTTP/SSE transport implementations (SSEServerTransport, StreamableHTTPServerTransport) and MCP OAuth spec support (.well-known/oauth-authorization-server). This determines whether the server can support per-user authentication.

3. Heuristic Scoring

Each tool's schema is analyzed for completeness (types, required fields, constraints), risk level is classified (read/write/admin), descriptions are checked for quality and injection patterns, and permissions are evaluated for granularity.

Hosted vs Self-Hosted

Vendors are increasingly shipping official hosted (remote) MCP servers — run at a URL like mcp.vercel.com, connected over HTTP, and authenticated with OAuth so each user signs in with the upstream service directly. When a vendor's hosted server supersedes or is the recommended path over an older self-hosted package or a community reimplementation, we score the hosted product, because that is what enterprises will actually deploy.

Hosted servers have no public source to clone, so our source-code audit cannot run against them. For these entries we verify authentication and transport from the vendor's documentation and mark the remaining categories as a provisional assessment pending a live, tool-level audit. Each such finding says so explicitly on the server page.

Where a vendor deliberately maintains both — for example a self-hosted build for restricted (EU / air-gapped / federal) environments alongside a hosted server for everyone else — we keep the self-hosted listing and note the hosted option rather than replacing it.

Disputes & Rescans

If you maintain an MCP server and believe a finding is incorrect or your server has been updated, you can request a rescan. Email security@completeflow.ai with the server name and specific finding(s).