How We Score MCP Servers
Our enterprise readiness framework evaluates MCP servers across 6 categories weighted by what actually matters when deploying AI agents in production. Authentication is weighted highest because it's the #1 blocker for multi-tenant enterprise deployment.
Overview
Each MCP server is scored out of 100 points across 6 categories. The categories are weighted by their importance to enterprise platform operators: authentication and schema quality carry the most weight (25 pts each) because they determine whether the server can be deployed securely for multiple users with proper guardrails.
Our analysis combines automated source code inspection with heuristic evaluation of tool definitions, input schemas, and transport capabilities. We clone each server's repository, extract tool definitions, and analyze them against our scoring rubric. We check for HTTP/SSE transport support, MCP OAuth spec implementation, schema completeness, permission separation, and description quality.
Authentication & Identity
25/100 ptsHow does the server handle credentials? Can a platform pass per-user tokens on each request, or does it read a single credential at startup? What are the operational implications for multi-tenant deployment?
This category scores how well the server supports multi-tenant deployment. Deductions stack: a stdio-only server with env var credentials and no OAuth hits all three checks for 0/25. An HTTP/SSE server without OAuth scores 18/25. A server implementing the full MCP OAuth spec scores 25/25. Authentication is weighted highest because it determines whether a platform can deploy one shared instance or must spawn one process per user.
| ID | Check | Severity | Score |
|---|---|---|---|
| A1 | Stdio-only transport: platform cannot pass per-request tokens | critical | -11 |
| A2 | No MCP OAuth spec: platform must handle token management externally | high | -7 |
| A3 | Env var credentials: no rotation without restart | high | -7 |
Tool Schema Quality
25/100 ptsDo all tool parameters have explicit JSON Schema types? Are required fields declared on write operations? Are there constraints like maxLength or enum on dangerous parameters? Are there raw shell/eval/SQL surfaces?
Tool schemas are the primary guardrail in an MCP deployment. They constrain what an LLM can send to a tool before it executes. Without types, required fields, and constraints, a prompt injection attack can pass arbitrary data to write, delete, or execute operations. Schema quality is the difference between a tool that's safe to expose to an AI agent and one that's a liability.
| ID | Check | Severity | Score |
|---|---|---|---|
| S1 | Tools have no input schema defined | high | -2 |
| S2 | Required fields missing on write/delete operations | high | -2 |
| S3 | Parameters lack constraints (maxLength, enum, pattern) | medium | -4 |
| S4 | Raw shell/eval/SQL execution surface | critical | -5 |
Permission Granularity
20/100 ptsAre read and write tools separate? Can destructive operations (delete, admin) be independently disabled? Do tool descriptions specify what resources they can access?
Least privilege is critical when an AI agent decides which tools to call. If read and write operations are bundled together, the platform cannot give an agent read-only access. If destructive operations (delete, purge, admin) are mixed with regular CRUD, they cannot be independently disabled. Granular tool separation lets enterprises expose only the capabilities each agent role needs.
| ID | Check | Severity | Score |
|---|---|---|---|
| P1 | Read and write operations combined in single tool | medium | -4 |
| P2 | Destructive operations not isolated from regular tools | high | -5 |
| P3 | Tool descriptions lack scope boundaries | low | -3 |
LLM Safety
15/100 ptsAre all tool descriptions present, specific, and free of manipulative language? Could any description cause the LLM to select the wrong tool or follow hidden instructions? Are there overlapping descriptions that create ambiguity?
LLMs select tools based on their descriptions. A vague or missing description means the LLM is guessing which tool to call, potentially triggering a destructive operation when a read was intended. Hidden instructions in descriptions can hijack the LLM's behavior in multi-server deployments. Overlapping descriptions create ambiguity that leads to wrong tool selection. This combines what used to be separate Description Quality and Injection Resistance categories.
| ID | Check | Severity | Score |
|---|---|---|---|
| L1 | Tool missing description entirely | high | -1 |
| L2 | Description too vague (under 20 characters) | medium | -1 |
| L3 | Description contains hidden instructions or prompt injection | critical | -5 |
| L4 | Overlapping descriptions between tools | medium | -2 |
Data Exposure
10/100 ptsDo list operations have pagination? Do tool responses return minimal fields or entire records? Could error messages leak credentials or internal paths?
Tool responses flow back into LLM context windows, which may be logged, cached, or shared across users. A list operation without pagination could dump an entire customer table into context. Returning full records with PII when only an ID was needed is a data leak. Error messages that include credentials or internal paths are exploitable. The platform can mitigate some of this with response filtering, but the server should minimize exposure at the source.
| ID | Check | Severity | Score |
|---|---|---|---|
| D1 | List operations lack pagination controls | medium | -3 |
| D2 | Responses return full records without field selection | low | -2 |
Maintenance & Trust
5/100 ptsIs the server maintained by the service vendor or a community contributor? Is it actively updated? Does it have a changelog?
Vendor-maintained servers signal long-term reliability: the company behind the service has an interest in keeping the MCP server working and secure. Community-maintained servers may be abandoned. This is the least differentiating category because great community servers exist and vendor servers can still have technical issues, but it matters for enterprise procurement decisions.
| ID | Check | Severity | Score |
|---|---|---|---|
| M1 | Community-maintained without vendor backing | low | -2 |
Grade Thresholds
Scores are out of 100 and map directly to letter grades.
Enterprise-ready. Supports per-user OAuth, strong schemas, granular permissions. Deploy with standard review.
Good baseline. Has HTTP transport or strong schemas but gaps in one area. Deployable with platform-level controls.
Significant gaps. Likely stdio-only with weak schemas. Needs compensating controls and restricted scope.
Major concerns. Shared credentials, missing schemas, mixed permissions. Requires substantial remediation.
Critical issues. Not ready for enterprise deployment without a complete security wrapper.
Audit Process
1. Source Code Enumeration
We clone each server's GitHub repository and extract tool definitions from the source code. This identifies tool names, descriptions, input schemas, and transport capabilities without requiring API keys or running the server.
2. Transport & Auth Detection
We scan the source for HTTP/SSE transport implementations (SSEServerTransport, StreamableHTTPServerTransport) and MCP OAuth spec support (.well-known/oauth-authorization-server). This determines whether the server can support per-user authentication.
3. Heuristic Scoring
Each tool's schema is analyzed for completeness (types, required fields, constraints), risk level is classified (read/write/admin), descriptions are checked for quality and injection patterns, and permissions are evaluated for granularity.
Disputes & Rescans
If you maintain an MCP server and believe a finding is incorrect or your server has been updated, you can request a rescan. Email security@completeflow.ai with the server name and specific finding(s).