❮ Scipy Constants Home ❯

Scipy Significance Testing

Significance testing (significance test) involves making an assumption about the parameters of a population (random variable) or the form of its distribution, and then using sample information to determine the plausibility of this assumption (alternative hypothesis). The goal is to assess whether the true state of the population significantly differs from the original hypothesis. Alternatively, significance testing aims to determine if the difference between the sample and the assumption we made about the population is due to random variation or due to discrepancies between our assumption and the actual population state. Significance testing is conducted to test the assumptions we make about the population, based on the principle of "the practical impossibility of unlikely events" to accept or reject the hypothesis.

Significance testing is used to determine if there are differences between the experimental treatment group and the control group, or between two different treatments, and whether these differences are significant.

SciPy provides the scipy.stats module to perform Scipy significance testing.

Statistical Hypotheses

Statistical hypotheses are assumptions about the unknown distribution of one or more random variables. When the distribution form of the random variable is known, and the hypothesis only involves one or several unknown parameters in the distribution, it is called a parametric hypothesis. The process of testing statistical hypotheses is known as hypothesis testing, and the test for parametric hypotheses is called parametric testing.

Null Hypothesis

The null hypothesis (null hypothesis) is a statistical term, also known as the original hypothesis, referring to the assumption established prior to conducting a statistical test. When the null hypothesis is true, the relevant statistic should follow a known probability distribution.

When the calculated value of the statistic falls into the rejection region, it indicates that an unlikely event has occurred, and the original hypothesis should be rejected.

The hypothesis to be tested is often denoted as H0, called the null hypothesis (null hypothesis), and the hypothesis opposing H0 is denoted as H1, called the alternative hypothesis.

Rejecting the null hypothesis when it is true is called a Type I error, with the probability usually denoted as α.
Failing to reject the null hypothesis when it is false is called a Type II error, with the probability usually denoted as β.
α + β does not necessarily equal 1.

Typically, only the maximum probability of a Type I error α is limited, without considering the probability of a Type II error β. Such a hypothesis test is also known as a significance test, with the probability α referred to as the significance level.

The most commonly used α values are 0.01, 0.05, and 0.10. Depending on the research question, if rejecting a true hypothesis has significant consequences, α is chosen to be smaller; otherwise, α is chosen to be larger.

Alternative Hypothesis

The alternative hypothesis (alternative hypothesis) is a fundamental concept in statistics, encompassing all propositions about the population distribution that would cause the null hypothesis to be rejected. The alternative hypothesis is also known as the对立假设 (counter hypothesis) or备选假设 (alternative hypothesis).

The alternative hypothesis can replace the null hypothesis.

For example, in evaluating students, we would use:

"The student is below average" — as the null hypothesis

"The student is above average" — as the alternative hypothesis.

One-Sided Test

A one-sided test (one-sided test), also known as a one-tailed test or one-sided test, is a method in hypothesis testing where the critical region is constructed using the single tail area of the density curve of the test statistic and the x-axis.

When our hypothesis tests only one side of the value, it is called a "one-tailed test".

Example:

For the null hypothesis:

"Mean equals k"

We can have the alternative hypothesis:

"Mean is less than k"
or
"Mean is greater than k"

Two-Sided Test

A two-sided test (two-sided test), also known as a two-tailed test or two-sided test, is a method in hypothesis testing where the critical region is constructed using the two tail areas of the density curve of the test statistic and the x-axis.

When our hypothesis tests both sides of the value.

Example:

For the null hypothesis:

"Mean equals k"

We can have the alternative hypothesis:

"Mean does not equal k"

In this case, the mean is either less than or greater than k, and both sides need to be checked.

Alpha Value

The alpha value is the significance level.

The significance level is the probability of making an error when estimating that the population parameter falls within a certain interval, denoted as α.

How close the data must be to the extreme to reject the null hypothesis.

Commonly taken as 0.01, 0.05, or 0.1.

P-Value

Compare the P-value with the alpha value (alpha) to determine the level of statistical significance.

If p-value <= alpha, we reject the null hypothesis and conclude that the data is statistically significant; otherwise, we accept the null hypothesis.

T-Test (T-Test)

The T-test is used to determine if there is a significant difference between the means of two variables and to judge whether they belong to the same distribution.

This is a two-tailed test.

The function ttest_ind() takes two samples of the same size and returns a tuple of the t-statistic and p-value.

To find if the given values v1 and v2 come from the same distribution:

Example

import numpy as np
from scipy.stats import ttest_ind

v1 = np.random.normal(size=100)
v2 = np.random.normal(size=100)

res = ttest_ind(v1, v2)

print(res)

Output:

Ttest_indResult(statistic=0.40833510339674095, pvalue=0.68346891833752133)

If you only want to return the p-value, use the pvalue attribute:

Example

import numpy as np
from scipy.stats import ttest_ind

v1 = np.random.normal(size=100)
v2 = np.random.normal(size=100)

res = ttest_ind(v1, v2).pvalue

print(res)

Output:

0.68346891833752133

KS Test

This function takes two parameters: the values to be tested and the CDF.

CDF stands for Cumulative Distribution Function, which can be a string or a callable function that returns probabilities.

It can be used as a one-tailed or two-tailed test. By default, it is a two-tailed test. We can pass the parameter alternative as a string for either "two-sided", "less", or "greater".

To find if the given values follow a normal distribution:

Example

import numpy as np
from scipy.stats import kstest

v = np.random.normal(size=100)

res = kstest(v, 'norm')

print(res)

Output:

KstestResult(statistic=0.047798701221956841, pvalue=0.97630967161777515)

Data Statistics Description

The describe() function can be used to view information about an array, including the following values:

nobs — number of observations
minmax — minimum and maximum values
mean — arithmetic mean
variance — variance
skewness — skewness
kurtosis — kurtosis

To display statistical description information for an array:

Example

import numpy as np
from scipy.stats import describe

v = np.random.normal(size=100)
res = describe(v)

print(res)

Output:

DescribeResult(
    nobs=100,
    minmax=(-2.0991855456740121, 2.1304142707414964),
    mean=0.11503747689121079,
    variance=0.99418092655064605,
    skewness=0.013953400984243667,
    kurtosis=-0.671060517912661
  )

Normality Test (Skewness and Kurtosis)

The test for determining whether the population follows a normal distribution based on observed data is called a normality test. It is a special type of goodness-of-fit hypothesis test in statistical decision-making.

The normality test is based on skewness and kurtosis.

The normaltest() function returns the p-value for the null hypothesis:

"x comes from a normal distribution"

Skewness

A measure of the symmetry of the data.

For a normal distribution, it is 0.

If negative, it means the data is skewed to the left.

If positive, it means the data is skewed to the right.

Kurtosis

A measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.

A positive kurtosis indicates heavy tails.

A negative kurtosis indicates light tails.

To find the skewness and kurtosis of the values in an array:

Example

import numpy as np
from scipy.stats import skew, kurtosis

v = np.random.normal(size=100)

print(skew(v))
print(kurtosis(v))

Output:

0.11168446328610283
  -0.1879320563260931

To find if the data comes from a normal distribution:

Example

import numpy as np
from scipy.stats import normaltest

v = np.random.normal(size=100)

print(normaltest(v))

Output:

NormaltestResult(statistic=4.4783745697002848, pvalue=0.10654505998635538)

❮ Scipy Constants Home ❯