Hypothesis Testing Calculator
Enter a sample mean, population mean, population standard deviation, sample size, and significance level to perform a z-test. Get the z-statistic, p-value, and a reject/fail-to-reject decision.
Hypothesis testing is the formal framework for making statistical inferences from sample data. The basic structure: state a null hypothesis (H₀) representing "no effect" or "no difference," compute a test statistic from your data, find the probability of seeing such a statistic under the null (the p-value), and decide whether to reject the null based on a pre-set significance level α.
This calculator performs a one-sample z-test: comparing a sample mean to a hypothesized population mean when the population standard deviation is known. It returns the z-statistic, p-value, critical value, and reject/fail-to-reject decision. For the more common case where population SD is estimated from the sample, use a t-test instead.
The whole hypothesis testing framework is a target of ongoing criticism in scientific practice — too binary, too easily abused (p-hacking, HARKing), too often misinterpreted. Modern practice emphasizes effect sizes, confidence intervals, and replication alongside or instead of significance tests. Still, understanding the framework remains essential for reading scientific literature and conducting analyses correctly.
Inputs
Results
Z-Statistic
2.4000
P-Value (Two-Tail)
0.016395
P-Value (One-Tail)
0.008198
Decision
Reject H0 (p = 0.0164 < 0.05)
Standard Error
0.8333
Critical Value (Two-Tail)
+/- 1.9604
Cohen's d (Effect Size)
0.4000
Formula
How to use this calculator
- Enter sample mean.
- Enter hypothesized population mean (under H₀).
- Enter population standard deviation.
- Enter sample size.
- Set significance level (α, typically 0.05).
- Calculator returns z-statistic, p-value, and decision.
Worked examples
IQ test scoring
**Scenario:** IQ test designed for population mean 100, SD 15. Sample of 36 students: mean 105. Significance level α = 0.05. Is this group significantly different? **Calculation:** z = (105 - 100) / (15 / √36) = 5/2.5 = 2.0. Two-tailed p = 0.046. Since p < α = 0.05: reject null. **Result:** Statistically significant evidence (p = 0.046) that this group's mean IQ differs from 100. Effect size: 5 points = 1/3 SD = small-moderate. Practical importance moderate; consider context.
Production line quality check
**Scenario:** Bottling line spec: 500 mL ± 5 mL. Sample of 49 bottles: mean 502 mL, σ known = 4 mL. Is line on target? α = 0.05. **Calculation:** z = (502 - 500) / (4 / √49) = 2 / 0.571 = 3.5. Two-tailed p < 0.001. Reject null. **Result:** Highly significant evidence line is off target (2 mL high). Within tolerance (±5 mL) but trending. Consider recalibrating; trending could lead to specification violations.
Marketing campaign effect
**Scenario:** Pre-campaign average sales: $1000/day. After campaign, sample of 30 days: mean $1150, σ known = $200. Did campaign work? One-tailed test. α = 0.05. **Calculation:** z = (1150 - 1000) / (200 / √30) = 150 / 36.5 = 4.1. One-tailed p < 0.001. **Result:** Highly significant. Campaign produced significantly higher sales. Effect size: ($150/$200 = 0.75 SD) is moderate-large. Practical significance: $150/day increase = $54,750/year if sustained.
When to use this calculator
**Use hypothesis testing for:**
- **Comparing data to expected value**: one-sample test. - **Comparing two groups**: two-sample test. - **Before/after analysis**: paired t-test. - **Categorical data analysis**: chi-square. - **Multi-group comparison**: ANOVA. - **Quality control**: process monitoring. - **Scientific research**: standard inferential framework.
**Choosing the right test:**
| Question | Test | |---|---| | Mean equal to value? | One-sample t/z test | | Two means different? | Two-sample t-test | | Three+ means different? | ANOVA | | Proportions different? | Z-test for proportions | | Categorical association? | Chi-square independence | | Non-normal small samples? | Non-parametric tests | | Paired measurements? | Paired t-test | | Multiple variables? | Multivariate methods |
**Reporting hypothesis test results:**
Include: - Test name and type. - Null and alternative hypotheses. - Test statistic. - Degrees of freedom (if applicable). - p-value. - Effect size. - Confidence interval (if applicable). - Decision and interpretation. - Sample size.
Example: "A one-sample z-test was conducted comparing the sample mean (M = 52) to the hypothesized value (μ₀ = 50). The result was significant, z = 2.40, p = 0.016, indicating the sample differs from the hypothesized mean."
**Common testing errors:**
- **Multiple comparisons without correction**: inflates Type I error. - **Stopping when significant**: p-hacking; should set sample size beforehand. - **HARKing**: hypothesizing after results. - **Reporting only significant**: publication bias. - **Confusing significance with importance**: small p ≠ large effect.
**The Big Picture:**
Hypothesis testing is a useful framework but should not be the only tool: - **For exploration**: descriptive stats, visualization. - **For confirmation**: pre-registered hypothesis tests. - **For estimation**: confidence intervals. - **For comparison**: effect sizes. - **For prediction**: regression and validation.
**Power and sample size:**
Before testing, consider: - Expected effect size: small (0.2), medium (0.5), large (0.8). - Target power: usually 80%. - α: usually 0.05. - This determines required sample size.
| Effect (d) | n for 80% power | |---|---| | 0.2 | ~196 | | 0.5 | ~32 | | 0.8 | ~13 |
**Common misinterpretations:**
❌ "P-value < 0.05 proves the alternative hypothesis." ✓ "Data is inconsistent with the null at this significance level."
❌ "Failure to reject null means null is true." ✓ "Insufficient evidence to detect an effect with current data."
❌ "Statistical significance = practical significance." ✓ "Statistical and practical significance are independent; both matter."
**Resources:**
- Cohen "Statistical Power Analysis" - classic on power and effect size. - ASA Statement on p-values (2016) - modern best practices. - Pre-registration platforms: AsPredicted, OSF.
**Tools:**
- **Excel**: limited; use T.TEST, Z.TEST, CHISQ.TEST. - **R**: comprehensive; t.test, prop.test, chisq.test, aov, etc. - **Python (scipy.stats)**: ttest_1samp, ttest_ind, chi2_contingency. - **SPSS**: menu-driven testing options. - **JASP**: free, modern Bayesian-friendly.
Common mistakes to avoid
- Treating statistical significance as practical importance. They're different.
- Multiple testing without correction. Inflates Type I error.
- P-hacking: stopping when significant or running many tests until one works.
- HARKing: hypothesizing after results are known.
- Equating non-significant with null hypothesis being true.
- Using wrong test for data type. Z-test for proportions, t for means, chi-square for categorical.
- Reporting only p-value without effect size and CI.