CalcMountain

Hypothesis Testing Calculator

Enter a sample mean, population mean, population standard deviation, sample size, and significance level to perform a z-test. Get the z-statistic, p-value, and a reject/fail-to-reject decision.

Hypothesis testing is the formal framework for making statistical inferences from sample data. The basic structure: state a null hypothesis (H₀) representing "no effect" or "no difference," compute a test statistic from your data, find the probability of seeing such a statistic under the null (the p-value), and decide whether to reject the null based on a pre-set significance level α.

This calculator performs a one-sample z-test: comparing a sample mean to a hypothesized population mean when the population standard deviation is known. It returns the z-statistic, p-value, critical value, and reject/fail-to-reject decision. For the more common case where population SD is estimated from the sample, use a t-test instead.

The whole hypothesis testing framework is a target of ongoing criticism in scientific practice — too binary, too easily abused (p-hacking, HARKing), too often misinterpreted. Modern practice emphasizes effect sizes, confidence intervals, and replication alongside or instead of significance tests. Still, understanding the framework remains essential for reading scientific literature and conducting analyses correctly.

Inputs

Results

Z-Statistic

2.4000

P-Value (Two-Tail)

0.016395

P-Value (One-Tail)

0.008198

Decision

Reject H0 (p = 0.0164 < 0.05)

Standard Error

0.8333

Critical Value (Two-Tail)

+/- 1.9604

Cohen's d (Effect Size)

0.4000

Last updated:

Formula

**One-sample z-test:** z = (x̄ - μ₀) / (σ / √n) Where: - **x̄**: sample mean - **μ₀**: hypothesized population mean (under H₀) - **σ**: population standard deviation - **n**: sample size **Decision rule:** If |z| > z_critical → reject H₀ If |z| ≤ z_critical → fail to reject H₀ **Critical z-values:** | α | One-tail | Two-tail | |---|---|---| | 0.10 | ±1.282 | ±1.645 | | 0.05 | ±1.645 | ±1.960 | | 0.01 | ±2.326 | ±2.576 | | 0.001 | ±3.090 | ±3.291 | **Worked example: sample mean 52, hypothesized 50, σ = 5, n = 36, α = 0.05** z = (52 - 50) / (5 / √36) = 2 / 0.833 = 2.40 Two-tailed p = 2 × (1 - Φ(2.40)) = 2 × 0.0082 = 0.016 Since p (0.016) < α (0.05): reject H₀. Conclusion: significant evidence the population mean differs from 50. **Steps for hypothesis testing:** 1. **State H₀ and H₁**: H₀ is "no effect"; H₁ is what you want to detect. 2. **Choose α** (significance level): commonly 0.05. 3. **Choose test**: z-test, t-test, chi-square, etc. based on data and hypotheses. 4. **Compute test statistic**. 5. **Find p-value** from distribution. 6. **Compare to α**: p < α → reject H₀. 7. **Interpret in context**: significance + effect size + practical importance. **Type I and Type II errors:** | Decision | H₀ true | H₀ false | |---|---|---| | Reject H₀ | Type I error (α) | Correct (power) | | Fail to reject H₀ | Correct (1-α) | Type II error (β) | - **α** (Type I): probability of rejecting H₀ when it's true (false positive). - **β** (Type II): probability of failing to reject H₀ when it's false (false negative). - **Power** = 1 - β: probability of correctly detecting an effect. **Common hypothesis tests:** | Test | Use case | |---|---| | One-sample z-test | Sample vs hypothesized value, known SD | | One-sample t-test | Sample vs hypothesized, SD estimated | | Two-sample t-test | Compare two group means | | Paired t-test | Before/after on same subjects | | Chi-square | Categorical data, goodness of fit | | ANOVA | Three or more group comparison | | Mann-Whitney | Non-parametric two-group | | Wilcoxon | Non-parametric paired | | Fisher's exact | Small samples, categorical | | Kruskal-Wallis | Non-parametric ANOVA | **One-tailed vs two-tailed:** - **One-tailed**: testing direction (e.g., H₁: μ > μ₀). - **Two-tailed**: testing for any difference (H₁: μ ≠ μ₀). - Two-tailed is more conservative (requires more extreme data). - One-tailed has higher power but only detects in specified direction. - Default: two-tailed unless strong prior reason. **Significance level (α):** - **0.05**: standard convention, very common. - **0.01**: more stringent; medical research, high-stakes. - **0.001**: very stringent; key claims, replication. - **0.10**: more lenient; exploratory analysis. α is the maximum probability of Type I error you're willing to accept. **Multiple testing problem:** When running multiple tests: - Individual α = 0.05 each. - Family-wise α increases dramatically. - 10 tests at α = 0.05: ~40% chance of at least one false positive. - 100 tests: ~99% chance. Solutions: - Bonferroni correction: α/k. - Benjamini-Hochberg: controls false discovery rate. - Pre-specify analysis plan. **Statistical vs practical significance:** - **Statistical significance**: p < α; effect is reliably detected. - **Practical significance**: effect is large enough to matter. - Large sample can detect trivial effects (p < 0.05, but effect size tiny). - Small sample may miss meaningful effects (p > 0.05, but effect size moderate). **Modern best practices:** 1. **Pre-register hypotheses**: prevent post-hoc analysis. 2. **Report effect size**: Cohen's d, Pearson r, odds ratio. 3. **Provide confidence intervals**: more informative than p alone. 4. **Note sample size**: affects interpretation. 5. **Distinguish exploratory from confirmatory**: different evidence standards. 6. **Replicate findings**: single study is rarely conclusive. **Power analysis:** Before study: - Determine target effect size. - Choose α (usually 0.05). - Target power (usually 80%). - Calculate required sample size. After study: - Power = 1 - β = probability of detecting given effect. - Low power = high Type II error rate. - Most studies are underpowered for small effects. **Bayesian alternative:** Bayesian hypothesis testing: - **Prior**: belief about hypotheses before data. - **Likelihood**: probability of data given hypotheses. - **Posterior**: updated belief after data. - **Bayes factor**: ratio of evidence for each hypothesis. Avoids some misinterpretations of frequentist p-values. **ASA statement on p-values (2016):** 1. P-values don't measure probability of null being true. 2. Significance doesn't imply causation or practical importance. 3. Don't draw conclusions from p < threshold alone. 4. Proper inference requires full reporting and transparency. 5. P-values don't measure effect size. 6. P-values alone are insufficient for decisions.

How to use this calculator

  1. Enter sample mean.
  2. Enter hypothesized population mean (under H₀).
  3. Enter population standard deviation.
  4. Enter sample size.
  5. Set significance level (α, typically 0.05).
  6. Calculator returns z-statistic, p-value, and decision.

Worked examples

IQ test scoring

**Scenario:** IQ test designed for population mean 100, SD 15. Sample of 36 students: mean 105. Significance level α = 0.05. Is this group significantly different? **Calculation:** z = (105 - 100) / (15 / √36) = 5/2.5 = 2.0. Two-tailed p = 0.046. Since p < α = 0.05: reject null. **Result:** Statistically significant evidence (p = 0.046) that this group's mean IQ differs from 100. Effect size: 5 points = 1/3 SD = small-moderate. Practical importance moderate; consider context.

Production line quality check

**Scenario:** Bottling line spec: 500 mL ± 5 mL. Sample of 49 bottles: mean 502 mL, σ known = 4 mL. Is line on target? α = 0.05. **Calculation:** z = (502 - 500) / (4 / √49) = 2 / 0.571 = 3.5. Two-tailed p < 0.001. Reject null. **Result:** Highly significant evidence line is off target (2 mL high). Within tolerance (±5 mL) but trending. Consider recalibrating; trending could lead to specification violations.

Marketing campaign effect

**Scenario:** Pre-campaign average sales: $1000/day. After campaign, sample of 30 days: mean $1150, σ known = $200. Did campaign work? One-tailed test. α = 0.05. **Calculation:** z = (1150 - 1000) / (200 / √30) = 150 / 36.5 = 4.1. One-tailed p < 0.001. **Result:** Highly significant. Campaign produced significantly higher sales. Effect size: ($150/$200 = 0.75 SD) is moderate-large. Practical significance: $150/day increase = $54,750/year if sustained.

When to use this calculator

**Use hypothesis testing for:**

- **Comparing data to expected value**: one-sample test. - **Comparing two groups**: two-sample test. - **Before/after analysis**: paired t-test. - **Categorical data analysis**: chi-square. - **Multi-group comparison**: ANOVA. - **Quality control**: process monitoring. - **Scientific research**: standard inferential framework.

**Choosing the right test:**

| Question | Test | |---|---| | Mean equal to value? | One-sample t/z test | | Two means different? | Two-sample t-test | | Three+ means different? | ANOVA | | Proportions different? | Z-test for proportions | | Categorical association? | Chi-square independence | | Non-normal small samples? | Non-parametric tests | | Paired measurements? | Paired t-test | | Multiple variables? | Multivariate methods |

**Reporting hypothesis test results:**

Include: - Test name and type. - Null and alternative hypotheses. - Test statistic. - Degrees of freedom (if applicable). - p-value. - Effect size. - Confidence interval (if applicable). - Decision and interpretation. - Sample size.

Example: "A one-sample z-test was conducted comparing the sample mean (M = 52) to the hypothesized value (μ₀ = 50). The result was significant, z = 2.40, p = 0.016, indicating the sample differs from the hypothesized mean."

**Common testing errors:**

- **Multiple comparisons without correction**: inflates Type I error. - **Stopping when significant**: p-hacking; should set sample size beforehand. - **HARKing**: hypothesizing after results. - **Reporting only significant**: publication bias. - **Confusing significance with importance**: small p ≠ large effect.

**The Big Picture:**

Hypothesis testing is a useful framework but should not be the only tool: - **For exploration**: descriptive stats, visualization. - **For confirmation**: pre-registered hypothesis tests. - **For estimation**: confidence intervals. - **For comparison**: effect sizes. - **For prediction**: regression and validation.

**Power and sample size:**

Before testing, consider: - Expected effect size: small (0.2), medium (0.5), large (0.8). - Target power: usually 80%. - α: usually 0.05. - This determines required sample size.

| Effect (d) | n for 80% power | |---|---| | 0.2 | ~196 | | 0.5 | ~32 | | 0.8 | ~13 |

**Common misinterpretations:**

❌ "P-value < 0.05 proves the alternative hypothesis." ✓ "Data is inconsistent with the null at this significance level."

❌ "Failure to reject null means null is true." ✓ "Insufficient evidence to detect an effect with current data."

❌ "Statistical significance = practical significance." ✓ "Statistical and practical significance are independent; both matter."

**Resources:**

- Cohen "Statistical Power Analysis" - classic on power and effect size. - ASA Statement on p-values (2016) - modern best practices. - Pre-registration platforms: AsPredicted, OSF.

**Tools:**

- **Excel**: limited; use T.TEST, Z.TEST, CHISQ.TEST. - **R**: comprehensive; t.test, prop.test, chisq.test, aov, etc. - **Python (scipy.stats)**: ttest_1samp, ttest_ind, chi2_contingency. - **SPSS**: menu-driven testing options. - **JASP**: free, modern Bayesian-friendly.

Common mistakes to avoid

  • Treating statistical significance as practical importance. They're different.
  • Multiple testing without correction. Inflates Type I error.
  • P-hacking: stopping when significant or running many tests until one works.
  • HARKing: hypothesizing after results are known.
  • Equating non-significant with null hypothesis being true.
  • Using wrong test for data type. Z-test for proportions, t for means, chi-square for categorical.
  • Reporting only p-value without effect size and CI.

Frequently Asked Questions

Sources & further reading

Related Calculators