Framework v2

How We Score MCP Servers

Our enterprise readiness framework evaluates MCP servers across 6 categories weighted by what actually matters when deploying AI agents in production. Authentication is weighted highest because it's the #1 blocker for multi-tenant enterprise deployment.

Overview

Each MCP server is scored out of 100 points across 6 categories. The categories are weighted by their importance to enterprise platform operators: authentication and schema quality carry the most weight (25 pts each) because they determine whether the server can be deployed securely for multiple users with proper guardrails.

Our analysis combines automated source code inspection with heuristic evaluation of tool definitions, input schemas, and transport capabilities. We clone each server's repository, extract tool definitions, and analyze them against our scoring rubric. We check for HTTP/SSE transport support, MCP OAuth spec implementation, schema completeness, permission separation, and description quality.

Authentication & Identity
25
/100 pts
Tool Schema Quality
25
/100 pts
Permission Granularity
20
/100 pts
LLM Safety
15
/100 pts
Data Exposure
10
/100 pts
Maintenance & Trust
5
/100 pts

Authentication & Identity

25/100 pts

How does the server handle credentials? Can a platform pass per-user tokens on each request, or does it read a single credential at startup? What are the operational implications for multi-tenant deployment?

This category scores how well the server supports multi-tenant deployment. Deductions stack: a stdio-only server with env var credentials and no OAuth hits all three checks for 0/25. An HTTP/SSE server without OAuth scores 18/25. A server implementing the full MCP OAuth spec scores 25/25. Authentication is weighted highest because it determines whether a platform can deploy one shared instance or must spawn one process per user.

IDCheckSeverityScore
A1Stdio-only transport: platform cannot pass per-request tokenscritical-11
A2No MCP OAuth spec: platform must handle token management externallyhigh-7
A3Env var credentials: no rotation without restarthigh-7

Tool Schema Quality

25/100 pts

Do all tool parameters have explicit JSON Schema types? Are required fields declared on write operations? Are there constraints like maxLength or enum on dangerous parameters? Are there raw shell/eval/SQL surfaces?

Tool schemas are the primary guardrail in an MCP deployment. They constrain what an LLM can send to a tool before it executes. Without types, required fields, and constraints, a prompt injection attack can pass arbitrary data to write, delete, or execute operations. Schema quality is the difference between a tool that's safe to expose to an AI agent and one that's a liability.

IDCheckSeverityScore
S1Tools have no input schema definedhigh-2
S2Required fields missing on write/delete operationshigh-2
S3Parameters lack constraints (maxLength, enum, pattern)medium-4
S4Raw shell/eval/SQL execution surfacecritical-5

Permission Granularity

20/100 pts

Are read and write tools separate? Can destructive operations (delete, admin) be independently disabled? Do tool descriptions specify what resources they can access?

Least privilege is critical when an AI agent decides which tools to call. If read and write operations are bundled together, the platform cannot give an agent read-only access. If destructive operations (delete, purge, admin) are mixed with regular CRUD, they cannot be independently disabled. Granular tool separation lets enterprises expose only the capabilities each agent role needs.

IDCheckSeverityScore
P1Read and write operations combined in single toolmedium-4
P2Destructive operations not isolated from regular toolshigh-5
P3Tool descriptions lack scope boundarieslow-3

LLM Safety

15/100 pts

Are all tool descriptions present, specific, and free of manipulative language? Could any description cause the LLM to select the wrong tool or follow hidden instructions? Are there overlapping descriptions that create ambiguity?

LLMs select tools based on their descriptions. A vague or missing description means the LLM is guessing which tool to call, potentially triggering a destructive operation when a read was intended. Hidden instructions in descriptions can hijack the LLM's behavior in multi-server deployments. Overlapping descriptions create ambiguity that leads to wrong tool selection. This combines what used to be separate Description Quality and Injection Resistance categories.

IDCheckSeverityScore
L1Tool missing description entirelyhigh-1
L2Description too vague (under 20 characters)medium-1
L3Description contains hidden instructions or prompt injectioncritical-5
L4Overlapping descriptions between toolsmedium-2

Data Exposure

10/100 pts

Do list operations have pagination? Do tool responses return minimal fields or entire records? Could error messages leak credentials or internal paths?

Tool responses flow back into LLM context windows, which may be logged, cached, or shared across users. A list operation without pagination could dump an entire customer table into context. Returning full records with PII when only an ID was needed is a data leak. Error messages that include credentials or internal paths are exploitable. The platform can mitigate some of this with response filtering, but the server should minimize exposure at the source.

IDCheckSeverityScore
D1List operations lack pagination controlsmedium-3
D2Responses return full records without field selectionlow-2

Maintenance & Trust

5/100 pts

Is the server maintained by the service vendor or a community contributor? Is it actively updated? Does it have a changelog?

Vendor-maintained servers signal long-term reliability: the company behind the service has an interest in keeping the MCP server working and secure. Community-maintained servers may be abandoned. This is the least differentiating category because great community servers exist and vendor servers can still have technical issues, but it matters for enterprise procurement decisions.

IDCheckSeverityScore
M1Community-maintained without vendor backinglow-2

Grade Thresholds

Scores are out of 100 and map directly to letter grades.

A
85 - 100

Enterprise-ready. Supports per-user OAuth, strong schemas, granular permissions. Deploy with standard review.

B
70 - 84

Good baseline. Has HTTP transport or strong schemas but gaps in one area. Deployable with platform-level controls.

C
55 - 69

Significant gaps. Likely stdio-only with weak schemas. Needs compensating controls and restricted scope.

D
40 - 54

Major concerns. Shared credentials, missing schemas, mixed permissions. Requires substantial remediation.

F
0 - 39

Critical issues. Not ready for enterprise deployment without a complete security wrapper.

Audit Process

1. Source Code Enumeration

We clone each server's GitHub repository and extract tool definitions from the source code. This identifies tool names, descriptions, input schemas, and transport capabilities without requiring API keys or running the server.

2. Transport & Auth Detection

We scan the source for HTTP/SSE transport implementations (SSEServerTransport, StreamableHTTPServerTransport) and MCP OAuth spec support (.well-known/oauth-authorization-server). This determines whether the server can support per-user authentication.

3. Heuristic Scoring

Each tool's schema is analyzed for completeness (types, required fields, constraints), risk level is classified (read/write/admin), descriptions are checked for quality and injection patterns, and permissions are evaluated for granularity.

Disputes & Rescans

If you maintain an MCP server and believe a finding is incorrect or your server has been updated, you can request a rescan. Email security@completeflow.ai with the server name and specific finding(s).