Data Visualization Examples: Master Basic Statistical Concepts in 3 Minutes with AI-Generated Charts

Mar 5, 2025

7 min read

Sometimes, we may encounter situations where we forget some statistical concepts we once learned due to long-term non-use, and when we need to apply them again, it becomes difficult to recall.

This time, we will briefly review and introduce some important statistical concepts, distributions, and test methods to help you quickly regain the basics of statistical analysis.

First, we start with one distribution and three statistical concepts:

Normal distribution curve X~N(μ,σ²)
Significant difference in means
Confidence interval
Degrees of freedom

Next, we introduce distributions derived from the normal distribution:

Chi-Squared (X²) distribution
F-Distributions

Finally, we introduce a test method:

Introductory Overview of Single-Factor ANOVA

Through these contents, we hope to help you re-master the key points of statistical analysis.

In the next article, we will explore how to use “ordinary single-factor ANOVA” to analyze the effect of different soybean feed concentrations on iron-deficiency anemia in mice, based on the understanding of the above concepts through a specific case.

Main Text:

First, we will introduce some basic statistical distributions and concepts to help understand the results obtained through AI analysis.

Distribution 1: Normal Distribution Curve

We start with three normal distribution curves obtained from frequency analysis. These curves help us understand the data distribution, which is the basis for normality tests.

When the mean is the same, the larger the variance, the flatter the normal distribution curve; the smaller the variance, the steeper the curve.
When the variance is the same, the larger the mean, the more the entire normal distribution curve shifts to the right.

Note:

For how to obtain the above curves, please refer to a previous analysis case I did with data files, Column Table — Frequency Analysis&Gaussian Fit,which includes the specific frequency analysis steps and curve fitting process, along with Python source code.

Concept 1: Significant Difference in Means

Assuming the data follow a normal distribution and have equal variance, we test the significant difference in means between two groups after experimental treatment.

We will illustrate with two visual examples:

Case A: Large within-group variation, much overlap between groups.
Case B: Small within-group variation, little overlap between groups.

In both cases, the difference in sample group means is the same.

However, the significant difference is more apparent in Case B, as there is less overlap between groups, which means:

Randomly picking a sample from the higher mean group, even the smallest, would still be high in the lower mean group.
Conversely, in Case A, there is more overlap between groups, and a sample from the high mean group may not necessarily be high in the low mean group. Therefore, the difference in mean values between sample groups is not significant.

Concept 2: Confidence Interval

Within a certain confidence interval, we can judge whether the designed experimental treatment has a significant impact on the research objective based on the “mean difference” and “standard deviation(SD)” results obtained through ANOVA analysis.

(1) Confidence Interval: In statistical analysis, a confidence interval is a range used to estimate a population parameter (e.g., mean), within which the parameter is believed to fall with a certain probability (usually 95%).

(2) Confidence Level: A 95% confidence level means we are 95% confident that the calculated confidence interval contains the true population parameter.

(3) Significance Level (α): In hypothesis testing, α is the probability threshold for determining whether to reject the null hypothesis, commonly set at 0.05 or 0.01.

The relationship between α and confidence level is simple: Confidence Level = 1 — α.

For example, α = 0.05 corresponds to a 95% confidence level.

There are three types of tests:

Two-Tailed Test: Used to test whether the sample mean is significantly different from the population mean, assuming differences on both sides, with α/2 distributed in both tails.
Right-Tailed Test: Used to test whether the sample mean is significantly greater than the population mean, assuming a difference in the right tail, where the test statistic falls in the right tail area.
Left-Tailed Test: Used to test whether the sample mean is significantly less than the population mean, assuming a difference in the left tail, where the test statistic falls in the tail area.

With a 95% confidence level:

For a two-tailed test, 2.5% on each side.
For a one-tailed test (left or right), the boundary value is in the entire 5% area.

Concept 3: Degrees of Freedom

Assume you have the following sample data: (4, 7, 9, 10).

The sample size is 4. We need to calculate the sample mean and sample variance.

The sample mean is calculated as:

The concept of degrees of freedom is applied when calculating the sample variance.

The formula for sample variance is:

where ( n ) is the sample size.

To calculate the variance, each data point in the sample is subtracted from the mean, squared, and then summed. Because we used the sample mean ( x̄ = 7.5 ) to compute the total sum of squares, we have consumed one degree of freedom.

Therefore, the remaining degrees of freedom are the original sample size of 4 minus 1, which is 3.

Thus, the degrees of freedom are 3.

In summary: In a dataset containing 4 values, the degrees of freedom for calculating the sample variance are (4–1 = 3), because computing the sample mean has consumed one degree of freedom.

Distribution 2: Chi-Squared (X²) Distribution

n = df(n), the degrees of freedom, represents the number of independent variables that can freely vary in statistical calculations.

E(X)=n
D(X)=2n
When n→∞, X²→N(n,2n)
Additionally, the Chi-Squared distribution is additive. If X₁ ~X²₍𝑑𝑓₁₎ and X₂~X²₍𝑑𝑓₂₎, then (X₁+X₂)~X²₍𝑑𝑓₁₊𝑑𝑓₂₎.

Distribution 3: F-Distributions

If X₁ ~X²₍𝑑𝑓₁₎ and X₂~X²₍𝑑𝑓₂₎, and X₁ and X₂ are independent, then the random variable F can be expressed as

F has two parameters that determine its shape, degrees of freedom df₁ and df₂, each pair of df₁ and df₂ determines one F distribution.

When df₁≤2, the F distribution density function is similar to a J-shape.
When df₁>2, the F distribution density function is positively skewed.
As df₁ and df₂ increase, the skewness of the curve decreases, but there is no limit form of the normal distribution.

Test Method: Introductory Overview of Single-Factor ANOVA

The basic idea of variance analysis comes from Cochran’s decomposition theorem, which states that under normal distribution, the sum of squares of random variables can be decomposed into the sum of independent Chi-Squared variables (i.e., the Chi-Squared distribution we introduced above). That is, the total sum of squares (SST) can be decomposed into:

The sum of squares between treatments (SSTreatment)
The sum of squares of errors (SSE)

The independence and Chi-Squared distribution properties of these sums of squares are important for hypothesis testing.

Thus, the F distribution is derived (i.e., the F-Distributions we introduced above), with the formula for the F value being F = SSTr/SSE.

In variance analysis:

The difference between sample means is called the between-group difference (SSTr).
The difference between observations within samples is called the within-group difference (SSE).

In other words, the basic logic of variance analysis is to decompose the total difference (SST) into between-group and within-group differences and then compare their relative sizes.

The larger the ratio of the between-group difference to the within-group difference, the more evident the difference between the group means.

Supplementary Concept: Factor & Level in ANOVA

Assume there are several conditions that affect the occurrence of event A when changed.

Suppose we now change condition a1 to observe its impact on event A.

Here, condition a1 is considered a factor.

If an experiment changes only one factor, it is a single-factor experiment (easy to perform but low efficiency and ecological validity).

Changing two factors simultaneously is a two-factor experiment (more effective but requires stricter operational conditions).

If one factor is gender with two possibilities, 1-male/2-female, “male/female” is called a level — two possibilities = two levels.
If the daily study time is at least 1 hour, 2 hours, 3 hours, 5 hours, 7 hours, affecting the probability of passing the final exam, we control five possible study times, resulting in five levels.
If gender and study time are changed simultaneously to observe the final study performance, this is called a two-factor experiment. Additionally, changing three or more experimental conditions simultaneously is called a multi-factor experiment.
For single-factor experiments, use single-factor ANOVA.
Typically, the data table is a Column table, with a single column variable grouping.
For two-factor experiments, use two-factor ANOVA.
The data table is a Grouped Table, typically with one column variable grouping and one row variable grouping, i.e., two factors.
For multi-factor experiments, use multi-factor ANOVA.
Multi-factor ANOVA will be introduced later.

In single-factor, two-factor, and multi-factor ANOVA, the between-group sum of squares may involve interaction due to interaction between two, so the decomposition method varies, but ultimately, the F value is compared to determine whether the means are significantly different.

That concludes our introduction to the basic statistical principles of variance analysis.

In the next article, we will further learn through a specific case, combining the above content with AI-generated Python, to complete a conventional single-factor ANOVA case.

Case: Researchers conducted the following experiment to study the effect of soybean on iron-deficiency anemia.

36 mice, which had already been modeled for anemia, were randomly divided into three groups, each with 12 mice, and fed with the following three feeds:

Conventional feed without soy
Feed containing 10% soy
Feed containing 15% soy

A week later, the number of red blood cells (x10⁶) in these mice was measured.

The results are shown in an Excel file “Soybean Feed vs. Iron-Deficiency Anemia.xlsx,” as illustrated in the figure below.

This is a typical Column table. Please see the next article for practical case analysis, usingAI Tool(Bayeslabe)to generate Python and complete a conventional one-way ANOVA data significance analysis and follow-up analysis.

About Bayeslab

Bayeslab: Website

The AI First Data Workbench

X: @BayeslabAI

Documents:

https://bayeslab.gitbook.io/docs

Blogs:

https://bayeslab.ai/blog

Bayeslab is a powerful web-based AI code editor and data analysis assistant designed to cater to a diverse group of users, including :

👥 data analysts ，🧑🏼‍🔬experimental scientists, 📊statisticians, 👨🏿‍💻 business analysts, 👩‍🎓university students, 🖍️academic writers, 👩🏽‍🏫scholars, and ⌨️ Python learners.

Bayeslab makes data analysis as easy as note-taking!

Bayeslab makes data analysis as easy
as note-taking!

Start Free