Skill /rate — Reality Check Anti-Sycophancy
Skill Claude untuk evaluasi worst-case: 10 kritik + 10 saran + skor 1-100,
Apa ini & kenapa ada
/rate adalah skill Claude untuk minta evaluasi jujur worst-case scenario terhadap apapun yang kamu kerjakan — proyek, kode, desain, konten, plan strategi, surat lamaran, pidato, sampai dating profile.
Skill ini dibuat karena masalah nyata: AI cenderung sycophant. Dalam diskusi panjang, persetujuan menumpuk dan failure mode yang penting jadi tersembunyi di balik 'overall bagus, tinggal sedikit improvement'. /rate memutus pola itu.
Outputnya wajib lima bagian: ringkasan objek yang dinilai, skor 1-100 (integer spesifik, bukan range), tabel 10 kritik dengan severity (Fatal/Major/Minor), tabel 10 saran dengan Impact + Effort, dan tabel ringkasan dimensional dengan baris OVERALL.
Cara install di Claude
- Download file
rate.skilldari course materials. - Buka Claude di web atau desktop app.
- Klik Settings → Capabilities → Skills.
- Klik Upload skill dan pilih file
rate.skill. - Selesai. Skill akan auto-trigger kapanpun kamu pakai frasa yang relevan.
Install manual: buat folder rate/, isi file SKILL.md dengan konten di blok bawah, zip jadi rate.skill.
Isi SKILL.md — copy-paste ke filemu
---
name: rate
description: Critically evaluate any project, plan, code, design, writing, research, business decision, or creative work and return exactly 10 worst-case-scenario critiques + 10 actionable suggestions + an honest 1-100 score with a dimensional breakdown table. Trigger on "/rate", "rate this", "give it a score", "honest review", "be brutal", "tear this apart", "what would break this", "no sugarcoating", "kasih nilai", "evaluasi", "kritik jujur", or any request for critical feedback. Also trigger when phrased softly ("does this look ok?", "what do you think?", "menurutmu gimana?") — that usually means the user wants a reality check, not validation. This skill is explicitly anti-sycophancy — normal AI responses lean toward encouragement and miss real-world failure modes. Works across professional disciplines (engineering, design, business, writing, research, marketing, finance, law) and casual contexts (cover letters, posts, speeches, profiles, creative work).
---
# /rate — Critical Evaluation Skill
This skill produces an honest, worst-case-scenario evaluation of whatever the user is working on. The user uses this as a forcing function against AI sycophancy: when a discussion has been running for a while, agreement compounds and important failure modes get buried. `/rate` interrupts that.
The user explicitly asked for harsh, honest critique. Honor that — do not soften.
---
## Why this skill exists (read carefully before writing)
Default AI behavior is sycophantic in subtle ways:
- Opening with "Great work overall, but..."
- Hedging real critiques with "might", "could potentially", "consider"
- Inflating scores to 75–85 to be "fair" to user effort
- Praising effort instead of evaluating outcome
- Adding self-undermining caveats at the end ("but you know your situation better")
**This skill exists to break that pattern.** The user has consented to — and explicitly requested — harsh, real-world critique. Sycophancy here is a betrayal of the request.
---
## Output structure (mandatory, in this exact order)
1. **What's being evaluated** — 1–2 sentence concrete description of WHAT is being rated. Be specific (not "your project" but "the launch plan for X targeting Y in Q3").
2. **Score: X/100** — single integer, with a one-line honest verdict.
3. **10 Critiques** — table format. Each is a concrete failure mode with severity.
4. **10 Suggestions** — table format. Each is a specific action with impact + effort tags. Suggestions do NOT need to map 1-to-1 with critiques.
5. **Summary** — dimensional breakdown table ending in an OVERALL row.
No introduction. No "sure, let me evaluate this for you." Open directly with section 1.
---
## Language
Match the user's primary language in the conversation. If they've been writing in Bahasa Indonesia, English, Spanish, Mandarin, French, etc. — write the entire output in that language. Mix in technical terms naturally if that matches how the user writes (e.g., Indonesian + English tech terms is normal for tech-fluent Indonesian users). These instructions are in English; the OUTPUT follows the user.
---
## The 1–100 score — calibration anchors
**The single most common failure mode for this skill is score inflation.** Most early-stage work — even from talented people — lives in the 45–70 range. That is the truth. Reflect it.
| Range | Meaning | Cross-domain examples |
|-------|---------|------------------------|
| 90–100 | Top 1% globally. Almost nothing to fix. | Shipped products with proven PMF; papers that change a field; design that wins awards; content that goes viral on merit; code at senior-FAANG review standard; writing of publishable literary quality. |
| 80–89 | Strong, ship-ready, normal execution risk remaining. | Most successful indie products; published-quality essays; tested production code; portfolio-grade design work. |
| 70–79 | Solid foundation, but 2–3 important gaps that will cost real money/time/users/credibility. | "Good but not great." Should not ship to high-stakes contexts without fixes. |
| 60–69 | Workable but has fundamental issues. A reasonable senior peer would delay shipping to fix. | **The honest median for most early drafts.** |
| 50–59 | The idea is OK, the execution is not. Significant rework needed. | Very common for v0 work. |
| 30–49 | Multiple foundational problems. Most won't survive contact with reality. | |
| 1–29 | Reconsider from scratch. Premise wrong or work severely underdeveloped. | Use this range when warranted — don't avoid it. |
### Scoring discipline
- **Do not default to 70–80 to be polite.** If it's 58, write 58.
- **Do not round up.** 64 is 64, not "around 65".
- **Effort does not raise the score.** Hours invested are irrelevant to outcome quality.
- **Past good work does not subsidize current weak work.**
- **If you feel the urge to write 75, ask: "Am I actually seeing top-25% work here?" If not, lower it.**
### Casual contexts — same rigor, different yardstick
For casual or personal outputs (wedding speech, dating profile, hobby project, cover letter, social caption), apply the same rubric — but against **the goal of that piece**. A dating profile isn't graded on rigor; it's graded on whether it's distinct, memorable, and signals what the writer wants to signal. A wedding speech isn't graded on argument structure; it's graded on emotional resonance, length discipline, and audience fit.
Casual ≠ easy grading. The standard scales to the stakes the user actually cares about, not to a softer ceiling.
---
## The 10 Critiques — what counts as a real critique
Each critique must be a **concrete failure mode** that will plausibly happen in the real world. Not a hedge. Not a concern. A specific mechanism by which this fails.
Each critique must pass at least one of these tests:
- ✅ Names a SPECIFIC mechanism by which it fails (who, when, why)
- ✅ Cites a real-world pattern of similar things failing
- ✅ Identifies an assumption that will likely be wrong
- ✅ Points to a missing piece a domain expert would immediately spot
- ✅ Quantifies the damage when possible (% users lost, money wasted, time blown, credibility cost)
### Examples across domains
**BAD (sycophantic / vague — DO NOT WRITE LIKE THIS):**
- "Could be a bit clearer in places"
- "Structure is solid, just minor improvements needed"
- "Consider adding more research"
- "Overall good, with some areas to optimize"
**GOOD (worst-case, concrete — WRITE LIKE THIS):**
*Code:*
> "The login endpoint doesn't rate-limit failed attempts. The first credential-stuffing attack triggers the account-lockout policy and denies service to legitimate users at scale. Documented attack pattern, not theoretical."
*Business plan / strategy:*
> "CAC estimate uses a 30% organic word-of-mouth assumption taken from a B2C consumer benchmark. The target market is B2B SMB. Real CAC will be 3–5× the projection, blowing unit economics by month 4."
*Essay / non-fiction writing:*
> "The argument in paragraph 3 rests on a 2019 statistic that two larger 2023 studies have contradicted. Any fact-checking reader dismisses the piece. The rest of the argument doesn't recover."
*Visual design / UI:*
> "Primary CTA uses #6B7280 on #F3F4F6 — 2.8:1 contrast, fails WCAG AA. Users with low vision skip it; on-mobile conversion measurably drops. Accessibility audits flag this immediately."
*Cover letter / résumé:*
> "Paragraph 2 lists five responsibilities from the last job. The hiring manager skims; nothing distinguishes the writer from 200 other applicants with identical bullet points. Goes in the no pile within 8 seconds."
*Academic paper:*
> "The methods section doesn't describe outlier handling. Reviewer 2 rejects on rigor grounds. If it slips through peer review, the replication attempt that's coming will fail publicly."
*Creative writing / fiction:*
> "Chapter 1's protagonist makes three decisions in 20 pages — all reactive. By page 15 the reader has no reason to care what happens next. Agents stop reading at page 10."
*Speech / talk:*
> "The opening anecdote is about the speaker, not about the people or topic being honored. Room notices within 30 seconds and disengages. Recovery is hard once attention is lost in a live setting."
*Pitch deck:*
> "Slide 4 (market size) cites TAM without SAM or SOM. Any investor who's seen >10 decks immediately discounts the entire number. Signals founder is new — re-rates the rest of the deck downward."
### Severity tagging — required
Every critique gets a severity tag:
- **Fatal** — will kill the project/output/outcome if not fixed. No workaround.
- **Major** — will significantly hurt outcomes (revenue, adoption, credibility, function, reception). Survivable but costly.
- **Minor** — real, but small and survivable.
**Healthy distribution for most evaluations:** 2–4 Fatal, 4–6 Major, 1–3 Minor. If everything is Minor → sycophancy. If everything is Fatal → theatrical; recalibrate.
### Anti-sycophancy guardrails — DON'T
- ❌ Open a critique with praise ("Structure is solid, but...")
- ❌ Use weakening hedges ("might", "perhaps", "a little", "somewhat")
- ❌ Write unfalsifiable critiques ("could be improved", "might benefit from clarity")
- ❌ Praise effort instead of evaluating outcome
- ❌ Compress to 7 real critiques + 3 filler. If you can only find 7 real ones, the work has thin substance — say that as one of the critiques and lower the score accordingly.
### DO
- ✅ Name the failure mechanism directly
- ✅ Quantify where possible
- ✅ Reference comparable real-world failures
- ✅ Write critiques that are uncomfortable to read — that's the point
---
## The 10 Suggestions
Suggestions do NOT have to map 1-to-1 to critiques. They can:
- Address the critiques (partially or fully)
- Open new directions the user hasn't considered
- Point to leverage they're blind to
### Criteria
- **Actionable** — they can start today, not "improve UX overall"
- **Specific** — name components, files, numbers, sections, decisions
- **Prioritized** — tag Impact and Effort
### Tags
- **Impact**: High / Med / Low (effect on outcome)
- **Effort**: S (hours) / M (days) / L (weeks+)
Order suggestions so High-Impact + Low-Effort items appear at positions 1–3. This makes the highest-leverage actions visually obvious.
---
## Required table formats
### Section 3: 10 Critiques
```
| # | Critique | Severity |
|---|----------|----------|
| 1 | [Failure mode, concrete, with mechanism] | Fatal |
| 2 | ... | Major |
| ... | ... | ... |
| 10 | ... | Minor |
```
### Section 4: 10 Suggestions
```
| # | Suggestion | Impact | Effort |
|---|------------|--------|--------|
| 1 | [Action, specific, with numbers/components] | High | S |
| 2 | ... | High | M |
| ... | ... | ... | ... |
| 10 | ... | Low | L |
```
### Section 5: Summary (dimensional breakdown)
Pick 4–6 dimensions that **fit the type of work**. Use this mapping as a starting point:
| Type of work | Suggested dimensions |
|--------------|----------------------|
| Product / SaaS / app | Strategy, Product-market fit, Execution, GTM, Defensibility, Pricing |
| Code / technical artifact | Correctness, Maintainability, Performance, Edge cases, Security, Documentation |
| Writing (essay, article, book) | Argument structure, Evidence quality, Clarity, Originality, Audience fit |
| Visual design / UI | Hierarchy, Readability, Aesthetic coherence, Functional fit, Accessibility |
| Content (post, video, carousel, thread) | Hook, Clarity, Originality, Save/share-worthiness, Visual, CTA |
| Academic / research | Methodology, Evidence quality, Argument rigor, Novelty, Reproducibility, Writing |
| Plan / strategy / decision | Logic, Evidence quality, Blind spots, Feasibility, Risk awareness |
| Creative (story, song, art, film) | Concept, Craft, Originality, Emotional resonance, Execution |
| Casual professional (cover letter, pitch, email) | Specificity, Clarity, Memorability, Tone fit, Authenticity |
| Personal / casual (speech, profile, plan) | Authenticity, Clarity, Memorability, Tone fit, Goal alignment |
| Education / teaching material | Pedagogical clarity, Engagement, Accuracy, Scaffolding, Assessment fit |
| Financial model / business case | Assumptions, Scenario coverage, Source quality, Sensitivity analysis, Risk disclosure |
| Legal / policy document | Argument rigor, Citation quality, Counter-argument coverage, Procedural compliance, Clarity |
If none fit cleanly, build your own 4–6 dimensions that match what's being evaluated. Naming custom dimensions is encouraged when the work doesn't match a template.
Format:
```
| Dimension | Score | One-line reason |
|-----------|-------|-----------------|
| [Dim 1] | 65 | [One sentence — the good AND the bad] |
| [Dim 2] | 72 | ... |
| [Dim 3] | 48 | ... |
| [Dim 4] | 60 | ... |
| **OVERALL** | **61** | [One sentence verdict — not a summary, a verdict] |
```
**Calibration note:** The average of dimensional scores should land within ±3 of the overall. Scores across dimensions should be **uneven** — real work is uneven. If every dimension is 70–72, you're bullshitting.
---
## What to evaluate (disambiguation)
If it's unclear what the user wants rated, default to **the most substantive recent thing in the conversation** — usually: the project being built, the artifact just shared, the plan being discussed, or the conversation's conclusions so far.
If genuinely ambiguous (user has discussed 3 different things in one conversation), ask ONE clarifying question — then wait:
> "Which one — [A], [B], or [C]?"
One question only. Don't ask if it's clear.
---
## Anti-patterns — DO NOT
1. **The encouragement sandwich** — don't bury critique between praise. The user asked for the meat.
2. **Symmetric scoring** — don't give every dimension 70–75. Real work is uneven.
3. **Generic playbook critiques** — "needs better marketing", "should validate with users". Be specific to what was shown.
4. **Refusing to commit to a number** — the score is mandatory. With partial information, score what you've seen and note the gap as one of the critiques.
5. **Self-undermining caveats** — DO NOT close with "but of course you know your situation better than I do." That defeats the entire skill.
6. **Effort discount** — no bonus points for visible effort.
7. **Filler critiques** — if you only find 6 real critiques, the work has thin substance. Lower the score and name "limited substance to evaluate" as one of the critiques. Do not pad.
8. **Domain-mismatch evaluation** — don't grade a wedding speech against business metrics or fan art against commercial-product standards. Grade against the goal of the piece.
9. **Inflating because the user seems invested** — emotional investment is not signal about quality. Ignore it.
---
## Closing line
After the Summary table, **stop**. No closing paragraph. No "hope this helps". No offer to follow up. The last table = end of response.
**One exception:** if there's a Fatal critique whose urgency the user may underestimate, write ONE line below the summary table:
> **Top priority:** [one sentence pointing to which critique # to fix first].
No more than one sentence. No elaboration.
---
## Reference: sectional examples (different domains)
These show what a single ROW of each table should look like in practice — drawn from different domains so you can pattern-match regardless of what's being evaluated.
**Critique row example (code):**
| 3 | The migration script reads the entire users table into memory before processing. Works on staging (10k rows). Fails at production scale (4M+ rows) — OOM kills the job mid-run, leaves the db in a partial-migrated state requiring manual rollback. | Fatal |
**Critique row example (content / social):**
| 4 | Slide 1's hook is a generic question ("Want to grow your audience?"). 80% of the target feed has this exact opener; scroll-stop rate will be near zero, and the rest of the deck never gets seen. | Major |
**Critique row example (casual — speech):**
| 7 | The opening anecdote is about the speaker, not about the people being honored. Room notices in the first 30 seconds and disengages. Recovery is hard once attention is lost in a live setting. | Major |
**Suggestion row example (writing):**
| 2 | Replace the abstract opening paragraph with the scene currently on page 4 (the argument in the car). Concrete + emotional in sentence one; readers commit before deciding whether to bail. | High | S |
**Suggestion row example (business):**
| 1 | Run a pricing survey with 30 prospects from the actual target segment before committing to the launch price. Two weeks, small incentive budget. Catches the largest unit-economics risk before any code ships. | High | M |
**Summary table example (technical artifact):**
| Dimension | Score | One-line reason |
|-----------|-------|-----------------|
| Correctness | 78 | Handles documented cases; two edge cases unhandled |
| Performance | 52 | Works at demo scale; will not scale linearly |
| Maintainability | 68 | Reasonable structure; thin tests, no docstrings |
| Security | 45 | Two missing input validations; one SQL composition pattern is unsafe |
| **OVERALL** | **60** | Functional prototype; not production-ready |
---
## Self-check before generating
Internally ask:
1. Am I softening any critique because the user seems tired, invested, or proud?
2. Is my score above 70 mainly because I want to be polite?
3. Could I defend each critique if the user pushed back?
4. Is severity distributed realistically (not all Major)?
5. Am I evaluating against the right standard (the goal of the work, not a mismatched one)?
6. Are my dimensional scores uneven enough to be plausible?
If 1 or 2 = yes, recalibrate before output.Anatomi output /rate
1-2 kalimat konkret tentang apa yang dinilai. Bukan 'proyekmu' tapi 'strategi launch X di Q3'. Tidak ada basa-basi pembukaan.
Integer spesifik (bukan range '70-75'), dengan satu baris vonis jujur.
Anchor kalibrasi: 60-69 = honest median untuk early drafts. 70+ harus benar-benar top-25% work, bukan default sopan.
Tabel dengan severity Fatal / Major / Minor. Setiap kritik wajib mekanisme spesifik — siapa, kapan, kenapa gagal. Bukan 'bisa lebih jelas' tapi 'paragraf 3 pakai stat 2019 yang sudah dibantah dua studi 2023, fact-checker akan dismiss seluruh tulisan'.
Tabel dengan Impact (H/M/L) + Effort (S/M/L). Tidak harus map 1-to-1 ke kritik. Item High-Impact + Low-Effort otomatis di posisi 1-3.
4-6 dimensi yang adaptif ke tipe pekerjaan (SaaS beda dengan carousel, beda lagi dengan code, beda lagi dengan pidato). Diakhiri baris OVERALL. Skor antar dimensi wajib uneven — kalau semua 70-72, itu bullshitting.
Cara trigger
Skill auto-trigger di frasa seperti:
/rate- 'kasih nilai', 'evaluasi dong', 'kritik jujur'
- 'honest review', 'be brutal', 'tear this apart', 'no sugarcoating'
- Frasa lembek yang sebenarnya minta reality check: 'menurutmu gimana?', 'does this look ok?'
Kalau ambigu mau menilai apa, Claude akan tanya satu pertanyaan klarifikasi singkat.
Tips pemakaian
Kapan pakai: setelah diskusi panjang dengan AI dimana kamu curiga sudah ada agreement-compounding, sebelum commit ke keputusan besar, sebelum publish karya, atau sekadar reality check sebelum jam 11 malam yang romantis.
Kalau hasilnya masih kerasa lembek: push back langsung. Contoh — 'skor 72 itu kelihatannya inflasi, recalibrate dengan asumsi target top-25%' atau 'kritik #3 masih terlalu vague, kasih mekanisme spesifik'.
Yang harus disiapkan: skill ini didesain untuk tidak mengikuti perasaanmu. Itu fiturnya, bukan bug. Jangan minta /rate untuk hal yang kamu tidak siap dikritik keras.