The geometric distribution models the number of trials required to get the first success in a sequence of independent Bernoulli trials, each with the same probability of success (p). It is a special case of the negative binomial distribution where the number of successes (r) is fixed at 1.
Probability Mass Function (PMF):
If X∼Geometric(p), where:
- p is the probability of success in each trial, and
- X is the number of trials needed to get the first success,
The PMF is given by:
P(X = x) = (1−p)x−1⋅p for x = 1,2,3,…
where
(1−p)x−1: represents x−1 failures (probability of failure is 1−p).
p: probability of success in the xth trial.
Cumulative Distribution Function (CDF)
The cumulative probability that the first success occurs on or before the kth trial is:
P(X ≤ k) = 1−(1−p)k
This gives the total probability for X = 1,2,…,k.
Mean (Expected Value)
E(X) = 1/p
Variance
Var(X) = (1−p)/p2
Applications
The geometric distribution is widely used in scenarios where we are interested in the number of attempts required for the first success or in other words bunch of failures followed by a success:
- Healthcare: Number of patients screened before finding the first one eligible for a clinical trial.
- Quality Control: Number of items tested before finding the first defective item.
- Call Centers: Number of calls made before the first successful connection.
- Sports: Number of attempts required to score a goal.
Key Insights
- The geometric distribution is memoryless, meaning the probability of success on the next trial is independent of the number of previous failures:
P(X>k+n∣X>k) = P(X>n)
This property makes it useful for modeling “waiting times” in systems where the past does not affect future outcomes.
Let us look at a problem to understand this distribution better.
1. A new diagnostic test has a 30% success rate in detecting a rare disease (p=0.3). What is the probability that the test detects the disease on the 4th trial?
Exactly on the 4th trial means P(X=4). Using the formula for PMF:
P(X = k) = (1−p)k−1⋅p
P(X = 4) = (1−0.3)4−1⋅0.3 = (0.7)3⋅0.3 = 0.343⋅0.3 = 0.1029.
The probability that the test detects the disease on the 4th trial is 10.29%.
2. Using the same diagnostic test (p=0.3), what is the probability that the disease is detected within the first 5 trials?
Within 5 trials mean P(X≤5). We use the CDF formula: P(X≤k) = 1−(1−p)k
P(X ≤ 5) = 1 − (1 − 0.3)5=1 − (0.7)⋅5 = 1 − 0.16807 = 0.83193.
The probability that the disease is detected within the first 5 trials is 83.19%.
3. Using the same diagnostic test (p=0.3), what is the probability that the disease is detected after the first 4 trials?
After 5 trials mean P(X>4). We use the cdf formula after applying the compliment.
The probability of X > 4 is the complement of P(X ≤ 4), which is: P(X > 4) = 1 − P(X ≤ 4)
Using the CDF formula for P(X ≤ k) = 1 − (1 − p)k
P(X ≤ 4) = 1 − (1 − 0.3)⋅4 = 1 − (0.7)4 = 1−0.2401=0.7599
P(X > 4) = 1 − 0.7599 = 0.2401.
The probability that the disease is detected after the first 4 trials is: P(X>4)=0.2401 or 24.01%.P(X > 4) = 0.2401 or 24.01%.
Using R, we solve this problem.
#Parameters
> p <- 0.3 # Probability of success
> k <- 4 # Trial number
# 1. Probability of success on the 4th trial
> P_X_eq_4 <- dgeom(k-1, prob = p)
> P_X_eq_4
# 2. Cumulative probability of success within the first 5 trials
> P_X_leq_5 <- pgeom(5-1, prob = p)
> P_X_leq_5
Parameters
p <- 0.3 # Probability of success
k <- 4 # Number of trials
# 3. Probability of detecting the disease after the first 4 trials
> P_X_gt_4 <- 1 – pgeom(k-1, prob = p)
> P_X_gt_4