1) High‑level probability model
Definitions
- F: Task is feasible with current tech + data (binary)
- Q: Input quality (docs, data, APIs) on 0–1 scale
- C: Clarity/constraint of the task spec on 0–1 scale
- H: Human‑in‑the‑loop availability (0 none → 1 always)
- S: Success (task completed to spec within guardrails)
Top‑line equation
P(S) = P(S\,|\,F)\,P(F) + P(S\,|\,\neg F)\,[1 - P(F)]
We aim to maximize P(F) at intake and keep P(S|¬F) ~ 0 via guardrails (decline or re‑scope).
Success conditional on feasibility
P(S\,|\,F) \approx w_Q Q + w_C C + w_H H - w_R R
Where R is risk/ambiguity (0–1). We tune weights by domain; defaults: w_Q=0.35, w_C=0.35, w_H=0.20, w_R=0.10.
Feasibility prior (Bayes)
P(F\,|\,E) = \dfrac{P(E\,|\,F)\,P(F)}{P(E\,|\,F)\,P(F) + P(E\,|\,\neg F)\,[1 - P(F)]}
Evidence E includes: historical wins in domain, API coverage, benchmark tasks, and prototype spikes.
2) Decision policy (accept vs. re‑scope vs. decline)
Accept
Rule: \( \hat P(S) \ge T_S \) and positive expected value.
EV = p_S \cdot V - (1 - p_S) \cdot C_{fail} - C_{build} \;\;\Rightarrow\;\; EV > 0
Typical threshold T_S ∈ [0.7, 0.85] depending on criticality.
Re‑scope
Increase C, improve Q, add human checks (H) to raise \(\hat P(S)\).
Decline
Low P(F) or unacceptable R; we avoid moonshots without a research track.
3) Scenarios
A) Lead Intake (Pro Services)
APIs exist, data is clean, task is constrained. Handoff to calendar + CRM.
Assume: P(F)=0.9, Q=0.85, C=0.8, H=0.6, R=0.2
\Rightarrow P(S|F) ≈ .35(.85)+.35(.8)+.20(.6)−.10(.2)= 0.2975+0.28+0.12−0.02 = 0.6775
P(S) ≈ 0.6775·0.9 + 0·0.1 = 0.6098
Re‑scope to add human review on qualification (raise H→0.8) and tighten prompts (C→0.9): then P(S|F) ≈ 0.35(.85)+0.35(.9)+0.20(.8)−.10(.2) = 0.2975+0.315+0.16−0.02 = 0.7525 → P(S) ≈ 0.677.
B) Policy Retrieval (Healthcare)
Retrieval from payer policies with strict accuracy; HITL required.
Assume: P(F)=0.8, Q=0.7, C=0.9, H=0.9, R=0.3
P(S|F) ≈ .35(.7)+.35(.9)+.20(.9)−.10(.3)= 0.245+0.315+0.18−0.03 = 0.71
P(S) ≈ 0.71·0.8 = 0.568
Tighten sources and templates (Q→0.85), keep HITL: P(S|F)≈.35(.85)+.35(.9)+.20(.9)−.10(.3)=0.2975+0.315+0.18−0.03=0.7625 → P(S)≈0.61. Accept if EV positive and threshold ≤ 0.6; else split into narrower intents.
C) Generative Media Summaries
Loose inputs, subjective outputs; success relies on C & H.
Assume: P(F)=0.7, Q=0.6, C=0.75, H=0.7, R=0.35
P(S|F) ≈ .35(.6)+.35(.75)+.20(.7)−.10(.35)= 0.21+0.2625+0.14−0.035 = 0.5775
P(S) ≈ 0.5775·0.7 = 0.404
Recommendation: re‑scope to niche formats (tight C), add checklists (raise H), or decline.
4) Evaluation metrics
Task‑level outcomes
| Feasible (F) | Not Feasible (¬F) |
| Agent succeeds | True Positive | False Positive (guardrails aim ≈ 0) |
| Agent fails | False Negative (re‑scope) | True Negative (decline) |
Program KPIs
- Acceptance rate (projects meeting threshold)
- Observed success rate (by domain & spec tightness)
- Median time‑to‑first‑value
- Human review effort / task
- Cost per successful task vs. baseline
5) Estimate your project’s success (interactive)
Adjust the sliders to explore how input quality (Q), clarity (C), human review (H), risk (R), and feasibility prior P(F) affect the success probability \(\hat P(S)\).
Weights
Defaults: wQ=0.35, wC=0.35, wH=0.20, wR=0.10
Results
Decision uses threshold TS=0.7 by default. Tweak your sliders to see how scoping changes the outcome.
5) Our promise
We only take probable use‑cases
We partner where feasibility is high and the value is clear. Otherwise, we propose a research spike or politely decline.
Transparent math
Every proposal includes our current \(\hat P(S)\), the levers to raise it (Q, C, H), risk factors R, and an EV check. If the numbers don’t add up, we won’t recommend it.