Welcome back to the AI Bayeslab Statistics series.
Today, we continue to explore statistics on sampling distributions, specifically examining situations where sample proportions are used to make inferences about population proportions. The design of the case is as follows:
1. Case Introduction: Market Research Reliability Assessment
A company conducts customer satisfaction surveys in two regions:
Region X: n₁ = 100, p̂ ₁ = 69%
Region Y: n₂ = 100, p̂ ₂ = 60.5%
Objective:
Before comparing proportions, we must check if the variances in satisfaction levels are equal (variance homogeneity test). This ensures data can be pooled for further analysis.

As illustrated in the previous example, we frequently face the decision of endorsing a specific attitude in real life. Our primary focus today is on how to statistically analyze satisfaction survey results that yield a "yes" or "no" outcome. This type of data relates closely to the binomial distribution, as the results either exhibit a particular characteristic or they do not, similar to the two regions X and Y.
Note: Formulas related to the binomial distribution
|
Suppose the satisfaction level variances between Region X and Region Y are found to be equal. In that case, it indicates that the satisfaction levels in both regions are comparable, showing no significant difference. In such scenarios, data from both areas can be merged for additional analysis. A test for homogeneity of variances is used to assess whether the satisfaction levels in the two regions are equal.
Additionally, we can assess the differences in satisfaction levels between the two regions to determine if there is no significant variation in the satisfaction proportions. By estimating at a specified significance level, we calculate the confidence interval(CI).
If this interval includes zero, it indicates that there is no significant difference in the overall proportions of the two groups.
If both endpoints are negative, it implies that (p₁ < p₂);
Conversely, if both endpoints are positive, it suggests (p₁ > p₂).
This method of estimating the significance level by confidence interval similarly applies to the mean in hypothesis testing.
2. Statistical Background: Binomial Proportions & Normal Approximation Condition
Since survey outcomes are binary ("satisfied" or "not satisfied"), the data follow a binomial distribution:
Expected value (mean): E(X) = np
Variance: D(X) = np(1-p)
Sample proportion variance:

The latex is \text{Var}(\hat{p}) = \frac{p(1-p)}{n}
Normal Approximation Condition:
For large samples ( p̂ ₁ > 5 and n(1-p̂ ) > 5), the binomial distribution approximates a normal distribution:

p̂ \sim N \left( p, \frac{p(1-p)}{n} \right)
Here is the complete explanation and formulas regarding the sampling distribution of proportions, expected values, and variances, conditions for normal approximation, and confidence intervals:
1) Definition of Symbols
Population proportion: p (unknown parameter)
Sample proportion: p'=X / n, where X is the number of individuals with a particular characteristic in the sample, and n is the sample size.
Random variable X: Follows a binomial distribution B(n, p), i.e., X∼B(n,p).
2) Expected Value and Variance
Expected value and variance of X :

E(X) = np
D(X) = np(1-p)
Expected value and variance of the sample proportion p' :

E(p') = E\left(\frac{X}{n}\right) = \frac{1}{n}E(X) = p
D(p') = D\left(\frac{X}{n}\right) = \frac{1}{n^2}D(X) = \frac{p(1-p)}{n}
Standard deviation (standard error):

\sigma_{p'} = \sqrt{\frac{p(1-p)}{n}}
3) Conditions for Normal Approximation
When the sample size is sufficiently large, the binomial distribution can be approximated by a normal distribution.
As the AI block visualization example:
|
As the sample size increases, when n = 100, we can see there is an overlapping shape with the binomial distribution and the normal curve.

Typically, we obtain a specific calculation criterion in the proportion hypothesis testing under the following conditions:

np' > 5 \quad \text{and} \quad n(1-p') > 5
Under these conditions:
X approximately follows a normal distribution : N(np,np(1-p)).
p' approximately follows a normal distribution :N(p,(p(1-p))/n).
N(np, np(1-p))
N\left(p, \frac{p(1-p)}{n}\right)
4) Confidence Interval ("1-α" Confidence Level)
Under the normal approximation conditions, the confidence interval for the population proportion p is:

p' \pm z_{\alpha/2} \cdot \sqrt{\frac{p'(1-p')}{n}}
Where:
Zₐ/₂ is the critical value from the standard normal distribution (e.g., for α = 0.05 , Z_0.025≈1.96).
Interval formula:

\left[ p' - z_{\alpha/2} \sqrt{\frac{p'(1-p')}{n}},\; p' + z_{\alpha/2} \sqrt{\frac{p'(1-p')}{n}} \right]
5) Notes
Continuity correction: For small samples or when p is close to 0 or 1, consider Yates' continuity correction (click the link for more details on Yates' continuity correction).
Alternative methods: If the normal approximation conditions are not met, use exact binomial methods or bootstrap methods.
Standard error estimation: In practice, replace the unknown p with p'.
3. Step-by-Step Analysis
Step 1: Variance Homogeneity Test
Objective: Test if homogeneity, aka :

\text{Var}(p̂ _1) = \text{Var}(p̂ _2

Hypotheses:

Test Statistic (Z-Test for Proportions)

Z = \frac{p̂ _1(1-p̂ _1) - p̂ _2(1-p̂ _2)}{\sqrt{ \frac{[p̂ _1(1-p̂ _1)]^2 + [p̂ _2(1-p̂ _2)]^2}{n} }}
Calculations:

Conclusion:
Since |Z| = 0.784 < 1.96 (critical value at α = 0.05), we fail to reject H₀.
Interpretation: Variances are not significantly different (homogeneity holds).
Step 2: Pooled Proportion Estimate
Since variances are equal, we compute a pooled proportion:

p̂ _{pooled} = \frac{X_1 + X_2}{n_1 + n_2} = \frac{69 + 60.5}{200} = 0.6475

Step 3: Two-Proportion Z-Test (Difference Testing)
Hypotheses:

Test Statistic:

Z = \frac{p̂ _1 - p̂_2}{\sqrt{ p̂ _{pooled}(1-p̂ _{pooled}) \left( \frac{1}{n_1} + \frac{1}{n_2} \right)}}
Calculations:

Z = \frac{0.69 - 0.605}{\sqrt{0.6475 \times 0.3525 \times 0.02}} \approx 1.31
Conclusion:
|Z| = 1.31 < 1.96 → No significant difference ( p > 0.05).

Step 4: Confidence Interval for Difference

(p̂ _1 - p̂ 2) \pm z{\alpha/2} \sqrt{ \frac{p̂ _1(1-p̂ _1)}{n_1} + \frac{p̂ _2(1-p̂ _2)}{n_2} }
Calculations:

Interpretation:
CI includes 0 → No significant difference in satisfaction.

4. Final Summary
Variance Homogeneity Test: No significant difference in variances ( p > 0.05).
Proportion Comparison: No significant difference between regions ( p > 0.05).
Business Implication: Data can be pooled for further analysis.
Next Steps:
If H_0 were rejected, use Welch’s t-test for unequal variances.
For small samples, consider Fisher’s Exact Test.
Stay tuned, subscribe to Bayeslab, and let everyone master the wisdom of statistics at a low cost with the AI Agent Online tool.