What is hypothesis testing?

Hypothesis testing is a statistical framework for making decisions about a population parameter based on sample data. State null hypothesis (H₀, "no effect"), compute test statistic, find p-value, compare to significance level. If p < α, reject H₀ (evidence of effect). If p ≥ α, fail to reject H₀ (insufficient evidence).

What is the difference between a z-test and t-test?

A z-test is used when the population standard deviation is known or the sample is large (n > 30). A t-test is used when the population standard deviation is unknown and estimated from the sample. T-test uses Student's t-distribution which has heavier tails to account for additional uncertainty.

What's a Type I error?

Type I error: rejecting the null hypothesis when it's actually true (false positive). Probability = α (significance level), commonly 0.05 means 5% chance of false positive. Type II error: failing to reject null when it's actually false (false negative). Trade-off: smaller α reduces Type I but increases Type II.

What does statistical power mean?

Statistical power = 1 - β = probability of correctly rejecting a false null hypothesis. Higher power = lower chance of missing a true effect. Power depends on: effect size, sample size, significance level. Target 80% power for most research; lower power increases risk of missing real effects.

How do I choose alpha (significance level)?

α = 0.05 is the most common standard. α = 0.01 for higher rigor (medical research, key claims). α = 0.001 for very stringent applications. α = 0.10 for exploratory analyses. Choose based on consequence of Type I error and field convention.

What is one-tailed vs two-tailed test?

Two-tailed: tests for any difference (H₁: μ ≠ μ₀); critical region in both tails. One-tailed: tests specific direction (H₁: μ > μ₀ or μ < μ₀); critical region in one tail. Two-tailed is more conservative. One-tailed has higher power but only detects in specified direction. Default: two-tailed.

What if my sample size is small?

For small samples (< 30): use t-distribution instead of z. Check normality assumption (Shapiro-Wilk or visual inspection). For non-normal small samples, use non-parametric tests (Mann-Whitney, Wilcoxon). Small samples have low power; effect must be large to detect.

Hypothesis Testing Calculator

Q: What is the difference between a z-test and t-test?

A z-test is used when the population standard deviation is known or the sample is large (n > 30). A t-test is used when the population standard deviation is unknown and estimated from the sample. T-test uses Student's t-distribution which has heavier tails to account for additional uncertainty.

Q: What if my sample size is small?

For small samples (< 30): use t-distribution instead of z. Check normality assumption (Shapiro-Wilk or visual inspection). For non-normal small samples, use non-parametric tests (Mann-Whitney, Wilcoxon). Small samples have low power; effect must be large to detect.

Enter a sample mean, population mean, population standard deviation, sample size, and significance level to perform a z-test. Get the z-statistic, p-value, and a reject/fail-to-reject decision.

Hypothesis testing is the formal framework for making statistical inferences from sample data. The basic structure: state a null hypothesis (H₀) representing "no effect" or "no difference," compute a test statistic from your data, find the probability of seeing such a statistic under the null (the p-value), and decide whether to reject the null based on a pre-set significance level α.

This calculator performs a one-sample z-test: comparing a sample mean to a hypothesized population mean when the population standard deviation is known. It returns the z-statistic, p-value, critical value, and reject/fail-to-reject decision. For the more common case where population SD is estimated from the sample, use a t-test instead.

The whole hypothesis testing framework is a target of ongoing criticism in scientific practice — too binary, too easily abused (p-hacking, HARKing), too often misinterpreted. Modern practice emphasizes effect sizes, confidence intervals, and replication alongside or instead of significance tests. Still, understanding the framework remains essential for reading scientific literature and conducting analyses correctly.

Inputs

Sample Mean

Population Mean (H0)

Population Std Dev (sigma)

Sample Size (n)

Significance Level (alpha)

Results

Z-Statistic

2.4000

P-Value (Two-Tail)

0.016395

P-Value (One-Tail)

0.008198

Decision

Reject H0 (p = 0.0164 < 0.05)

Standard Error

0.8333

Critical Value (Two-Tail)

+/- 1.9604

Cohen's d (Effect Size)

0.4000

Last updated: May 29, 2026

Formula

**One-sample z-test:** z = (x̄ - μ₀) / (σ / √n) Where: - **x̄**: sample mean - **μ₀**: hypothesized population mean (under H₀) - **σ**: population standard deviation - **n**: sample size **Decision rule:** If |z| > z_critical → reject H₀ If |z| ≤ z_critical → fail to reject H₀ **Critical z-values:** | α | One-tail | Two-tail | |---|---|---| | 0.10 | ±1.282 | ±1.645 | | 0.05 | ±1.645 | ±1.960 | | 0.01 | ±2.326 | ±2.576 | | 0.001 | ±3.090 | ±3.291 | **Worked example: sample mean 52, hypothesized 50, σ = 5, n = 36, α = 0.05** z = (52 - 50) / (5 / √36) = 2 / 0.833 = 2.40 Two-tailed p = 2 × (1 - Φ(2.40)) = 2 × 0.0082 = 0.016 Since p (0.016) < α (0.05): reject H₀. Conclusion: significant evidence the population mean differs from 50. **Steps for hypothesis testing:** 1. **State H₀ and H₁**: H₀ is "no effect"; H₁ is what you want to detect. 2. **Choose α** (significance level): commonly 0.05. 3. **Choose test**: z-test, t-test, chi-square, etc. based on data and hypotheses. 4. **Compute test statistic**. 5. **Find p-value** from distribution. 6. **Compare to α**: p < α → reject H₀. 7. **Interpret in context**: significance + effect size + practical importance. **Type I and Type II errors:** | Decision | H₀ true | H₀ false | |---|---|---| | Reject H₀ | Type I error (α) | Correct (power) | | Fail to reject H₀ | Correct (1-α) | Type II error (β) | - **α** (Type I): probability of rejecting H₀ when it's true (false positive). - **β** (Type II): probability of failing to reject H₀ when it's false (false negative). - **Power** = 1 - β: probability of correctly detecting an effect. **Common hypothesis tests:** | Test | Use case | |---|---| | One-sample z-test | Sample vs hypothesized value, known SD | | One-sample t-test | Sample vs hypothesized, SD estimated | | Two-sample t-test | Compare two group means | | Paired t-test | Before/after on same subjects | | Chi-square | Categorical data, goodness of fit | | ANOVA | Three or more group comparison | | Mann-Whitney | Non-parametric two-group | | Wilcoxon | Non-parametric paired | | Fisher's exact | Small samples, categorical | | Kruskal-Wallis | Non-parametric ANOVA | **One-tailed vs two-tailed:** - **One-tailed**: testing direction (e.g., H₁: μ > μ₀). - **Two-tailed**: testing for any difference (H₁: μ ≠ μ₀). - Two-tailed is more conservative (requires more extreme data). - One-tailed has higher power but only detects in specified direction. - Default: two-tailed unless strong prior reason. **Significance level (α):** - **0.05**: standard convention, very common. - **0.01**: more stringent; medical research, high-stakes. - **0.001**: very stringent; key claims, replication. - **0.10**: more lenient; exploratory analysis. α is the maximum probability of Type I error you're willing to accept. **Multiple testing problem:** When running multiple tests: - Individual α = 0.05 each. - Family-wise α increases dramatically. - 10 tests at α = 0.05: ~40% chance of at least one false positive. - 100 tests: ~99% chance. Solutions: - Bonferroni correction: α/k. - Benjamini-Hochberg: controls false discovery rate. - Pre-specify analysis plan. **Statistical vs practical significance:** - **Statistical significance**: p < α; effect is reliably detected. - **Practical significance**: effect is large enough to matter. - Large sample can detect trivial effects (p < 0.05, but effect size tiny). - Small sample may miss meaningful effects (p > 0.05, but effect size moderate). **Modern best practices:** 1. **Pre-register hypotheses**: prevent post-hoc analysis. 2. **Report effect size**: Cohen's d, Pearson r, odds ratio. 3. **Provide confidence intervals**: more informative than p alone. 4. **Note sample size**: affects interpretation. 5. **Distinguish exploratory from confirmatory**: different evidence standards. 6. **Replicate findings**: single study is rarely conclusive. **Power analysis:** Before study: - Determine target effect size. - Choose α (usually 0.05). - Target power (usually 80%). - Calculate required sample size. After study: - Power = 1 - β = probability of detecting given effect. - Low power = high Type II error rate. - Most studies are underpowered for small effects. **Bayesian alternative:** Bayesian hypothesis testing: - **Prior**: belief about hypotheses before data. - **Likelihood**: probability of data given hypotheses. - **Posterior**: updated belief after data. - **Bayes factor**: ratio of evidence for each hypothesis. Avoids some misinterpretations of frequentist p-values. **ASA statement on p-values (2016):** 1. P-values don't measure probability of null being true. 2. Significance doesn't imply causation or practical importance. 3. Don't draw conclusions from p < threshold alone. 4. Proper inference requires full reporting and transparency. 5. P-values don't measure effect size. 6. P-values alone are insufficient for decisions.

How to use this calculator

Enter sample mean.
Enter hypothesized population mean (under H₀).
Enter population standard deviation.
Enter sample size.
Set significance level (α, typically 0.05).
Calculator returns z-statistic, p-value, and decision.

Worked examples

IQ test scoring

**Scenario:** IQ test designed for population mean 100, SD 15. Sample of 36 students: mean 105. Significance level α = 0.05. Is this group significantly different? **Calculation:** z = (105 - 100) / (15 / √36) = 5/2.5 = 2.0. Two-tailed p = 0.046. Since p < α = 0.05: reject null. **Result:** Statistically significant evidence (p = 0.046) that this group's mean IQ differs from 100. Effect size: 5 points = 1/3 SD = small-moderate. Practical importance moderate; consider context.

Production line quality check

**Scenario:** Bottling line spec: 500 mL ± 5 mL. Sample of 49 bottles: mean 502 mL, σ known = 4 mL. Is line on target? α = 0.05. **Calculation:** z = (502 - 500) / (4 / √49) = 2 / 0.571 = 3.5. Two-tailed p < 0.001. Reject null. **Result:** Highly significant evidence line is off target (2 mL high). Within tolerance (±5 mL) but trending. Consider recalibrating; trending could lead to specification violations.

Marketing campaign effect

**Scenario:** Pre-campaign average sales: $1000/day. After campaign, sample of 30 days: mean $1150, σ known = $200. Did campaign work? One-tailed test. α = 0.05. **Calculation:** z = (1150 - 1000) / (200 / √30) = 150 / 36.5 = 4.1. One-tailed p < 0.001. **Result:** Highly significant. Campaign produced significantly higher sales. Effect size: ($150/$200 = 0.75 SD) is moderate-large. Practical significance: $150/day increase = $54,750/year if sustained.

When to use this calculator

**Use hypothesis testing for:**

- **Comparing data to expected value**: one-sample test. - **Comparing two groups**: two-sample test. - **Before/after analysis**: paired t-test. - **Categorical data analysis**: chi-square. - **Multi-group comparison**: ANOVA. - **Quality control**: process monitoring. - **Scientific research**: standard inferential framework.

**Choosing the right test:**

| Question | Test | |---|---| | Mean equal to value? | One-sample t/z test | | Two means different? | Two-sample t-test | | Three+ means different? | ANOVA | | Proportions different? | Z-test for proportions | | Categorical association? | Chi-square independence | | Non-normal small samples? | Non-parametric tests | | Paired measurements? | Paired t-test | | Multiple variables? | Multivariate methods |

**Reporting hypothesis test results:**

Include: - Test name and type. - Null and alternative hypotheses. - Test statistic. - Degrees of freedom (if applicable). - p-value. - Effect size. - Confidence interval (if applicable). - Decision and interpretation. - Sample size.

Example: "A one-sample z-test was conducted comparing the sample mean (M = 52) to the hypothesized value (μ₀ = 50). The result was significant, z = 2.40, p = 0.016, indicating the sample differs from the hypothesized mean."

**Common testing errors:**

- **Multiple comparisons without correction**: inflates Type I error. - **Stopping when significant**: p-hacking; should set sample size beforehand. - **HARKing**: hypothesizing after results. - **Reporting only significant**: publication bias. - **Confusing significance with importance**: small p ≠ large effect.

**The Big Picture:**

Hypothesis testing is a useful framework but should not be the only tool: - **For exploration**: descriptive stats, visualization. - **For confirmation**: pre-registered hypothesis tests. - **For estimation**: confidence intervals. - **For comparison**: effect sizes. - **For prediction**: regression and validation.

**Power and sample size:**

Before testing, consider: - Expected effect size: small (0.2), medium (0.5), large (0.8). - Target power: usually 80%. - α: usually 0.05. - This determines required sample size.

| Effect (d) | n for 80% power | |---|---| | 0.2 | ~196 | | 0.5 | ~32 | | 0.8 | ~13 |

**Common misinterpretations:**

❌ "P-value < 0.05 proves the alternative hypothesis." ✓ "Data is inconsistent with the null at this significance level."

❌ "Failure to reject null means null is true." ✓ "Insufficient evidence to detect an effect with current data."

❌ "Statistical significance = practical significance." ✓ "Statistical and practical significance are independent; both matter."

**Resources:**

- Cohen "Statistical Power Analysis" - classic on power and effect size. - ASA Statement on p-values (2016) - modern best practices. - Pre-registration platforms: AsPredicted, OSF.

**Tools:**

- **Excel**: limited; use T.TEST, Z.TEST, CHISQ.TEST. - **R**: comprehensive; t.test, prop.test, chisq.test, aov, etc. - **Python (scipy.stats)**: ttest_1samp, ttest_ind, chi2_contingency. - **SPSS**: menu-driven testing options. - **JASP**: free, modern Bayesian-friendly.

Common mistakes to avoid

Treating statistical significance as practical importance. They're different.
Multiple testing without correction. Inflates Type I error.
P-hacking: stopping when significant or running many tests until one works.
HARKing: hypothesizing after results are known.
Equating non-significant with null hypothesis being true.
Using wrong test for data type. Z-test for proportions, t for means, chi-square for categorical.
Reporting only p-value without effect size and CI.

Hypothesis Testing Calculator

Inputs

Results

Formula

How to use this calculator

Worked examples

IQ test scoring

Production line quality check

Marketing campaign effect

When to use this calculator

Common mistakes to avoid

Frequently Asked Questions

Sources & further reading

Related Calculators

T-Test Calculator

Z-Score Calculator

P-Value Calculator