A Mini Lecture for the Introduction to Statistics
Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data. It provides tools for making informed decisions based on data. In this mini lecture, we will explore several key concepts in statistics that are foundational for understanding data analysis.
1. Descriptive Statistics
Descriptive statistics summarize and describe the main features of a data set. They provide simple summaries about the data. Key measures include:
Measures of Central Tendency
- Mean: The average of a data set, calculated by summing all values and dividing by the number of values.
- Median: The middle value when the data set is ordered. It is less affected by outliers than the mean. Outliers are observations that are unusually large or small relative to the majority.
- Mode: The most frequently occurring value in a data set.
Measures of Variability
- Range: The difference between the highest and lowest values in a data set.
- Variance: A measure of how much the values in a data set differ from the mean.
- Standard Deviation: The square root of the variance, providing a measure of spread in the same units as the data.
Example
Consider the data set: 4, 8, 6, 5, 3.
- Mean: (4 + 8 + 6 + 5 + 3) / 5 = 5.2
- Median: 5 (when ordered: 3, 4, 5, 6, 8)
- Mode: No mode (all values occur once)
- Range: 8 - 3 = 5
- Variance: formula
\[\frac{(x_1- \bar{x})^2+(x_1- \bar{x})^2+\cdots}{n-1}\]
\[\frac{(4- 5.2)^2+(8- 5.2)^2+(6- 5.2)^2+(5- 5.2)^2+(3- 5.2)^2}{5-1}=\frac{14.8}{4}=3.7\] - Standard Deviation: the square root of the variance, or 1.92.
2. Basic Probability
Probability is the measure of the likelihood that an event will occur. It ranges from 0 (impossible event) to 1 (certain event). Probability is the foundation to statistics.
To understand probability, we need to know some key concepts:
- Experiment: Doing something and observing outcomes
- Sample space (S): the set of all possible outcomes in an experiment
- Events: Any subset of the sample space. Use letters A, B, C, … to denote events.
- Definition of probability: Assume each outcome in the sample space is equally likely. The probability of an event is defined as the number of outcomes in the event as a set, divided by the total number of all possible outcomes in S.
Properties of probability:
- Probability is always between 0 and 1.
- The probability of an event occurring plus the probability of it not occurring equals 1.
- If two events \(A\) and \(B\) cannot happen at the same time, the probability that \(A\) or \(B\) happen is the sum of the two probabilities.
- Independent Events: The occurrence of one event does not affect the occurrence of another. If two events are independent, then the probability that they happen at the same time is the product of the probabilities.
Examples:
- If a fair die is rolled, the probability of rolling a 3 is 1/6. The probability of not rolling a 3 is 5/6.
- Toss a 6-sided die and flip a coin.
- The probability of getting heads is \(\frac{1}{2}\).
- The probability of getting an even number is \(\frac{3}{6}\).
- The probability of getting heads and an even number is \(\frac{1}{2}\cdot \frac{3}{6}=0.25\).
- The probability of getting neither heads or nor an even number is \(1-0.25=0.75\).
3. Confidence Intervals
Motivation:
- Population: a population is the collection of objects or subjects of interest.
- Example 1: All fish in a lake
- Example 2: All computers in SCSU
- Example 3: All people who has flight experience
- Parameters: A parameter is a quantity that describes a population.
- Example 1: proportion of fish that are 10 inches or longer in a lake
- Example 2: proportion of computers that are 5+ years old in SCSU
- Example 3: average age of people who has flight experience
- Sample: a sample is a subset of a population.
- 20 fish caught in a lake
- 10 computers randomly selected from SCSU
- 5 friends who has flight experience
- Statistic: a statistic is any quantity (such as proportion or mean) calculated based on a sample.
- Example 1: 35% of the 20 fish caught in a lake are 10 inches or longer.
- Example 2: 20% of the 10 selected computers from SCSU are 5+ years old.
- Example 3: the average age of the 5 friends who has flight experience
- The best estimate of a population parameter is its sample counterpart, statistic.
- a population mean is best estimated using a sample mean.
- a population proportion is best estimated using a sample proportion.
- but, we don’t know how accurate each of the estimate is; that is, we don’t know the estimation error.
The most often considered parameters are population proportion (denoted \(p\)) and population mean (denoted \(\mu\)). Estimating parameters and testing hypotheses about parameters are the major tasks of statistics.
A confidence interval (CI) provides a range of values that likely contain the population parameter. It is constructed using sample data and a specified confidence level (e.g., 95%).
We focus on the population proportion (\(p\)) parameter throughout the course.
Key Concepts
- Point Estimate (PE): A single value estimate of a population parameter. The variation of a point estimate can be described by its standard error. For \(p\), the point estimate is the sample proportion \(\hat{p}\). The (sample-to-sample) variation of \(\hat{p}\) can be described by its standard error (\(se\)) \[se = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]
- Margin of Error (\(ME\)): The range above and below the point estimate. The confidence interval method provides a formula for calculating the margin of error. At 95% confidence level, \(ME=1.96\cdot se\); at 90% confidence level, \(ME=1.645\cdot se\).
- CI: PE \(\pm\) se; that is, CI is the interval from PE \(-\) se to PE \(+\) se.
Example
- Scenario: If 30 out of a sample of 50 students has flight experience, a 95% confidence interval can provides a margin of error of 0.15. Estimate, at 95% confidence, the range of the population proportion of all students who have flight experience.
- Answer: The sample proportion is 30/50 or 0.6. The standard error \(se=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}=\sqrt{\frac{0.6(1-0.6)}{50}}=0.0693\). At 95% confidence, the margin of error is \(1.96\cdot 0.0693=0.1358\) so the range of all possible values for the population proportion is from 0.60-0.1358 to 0.60+0.1358, or from 0.4642 to 0.7358.
4. Hypothesis Testing
Hypothesis testing is a statistical method used to make decisions about population parameters based on sample data. It involves formulating a null hypothesis (H0) and an alternative hypothesis (H1).
Steps in Hypothesis Testing
State the Hypotheses: Formulate the null and alternative hypotheses.
- Example: H0: The new teaching method has no effect on student performance. H1: The new teaching method improves student performance.
Choose a Significance Level (α): Commonly set at 0.05 or 0.01.
Collect Data: Gather sample data relevant to the hypotheses.
Calculate the Test Statistic: Use appropriate statistical tests (e.g., t-test, z-test).
Make a Decision: Compare the p-value to the significance level.
- If p-value < α, reject H0.
- If p-value ≥ α, fail to reject H0.
Example
A school wants to test if a new teaching method improves student test scores. They set H0: The mean score is 75 (no improvement) and H1: The mean score is greater than 75 (improvement). After conducting the test, they find a p-value of 0.03. Since 0.03 < 0.05, they reject the null hypothesis and conclude that the new method likely improves scores.
5. Simple Linear Regression
Simple linear regression is used to model the relationship between two quantitative variables. The model predicts the dependent variable (\(y\)) based on the independent variable (\(x\)) by establishing an equation \(y=a+bx\), called the (simple linear) regression equation.
Key Components
- Slope: Indicates the change in the dependent variable for a one-unit change in the independent variable.
- Intercept: The predicted value of the dependent variable when the independent variable is zero.
- Coefficient of Determination (R²): Measures the proportion of variance in the dependent variable that can be explained by the independent variable.
Example
In a simple linear regression analysis predicting test scores based on hours studied, the regression equation is \(score = 70+5\cdot time\).
- The positive slope indicates that more hours studied are associated with higher test scores.
- The value of slope 5 indicates that each extra hour of study brings 5 more points on average.
- If a student from the population is selected and studies 3 hours, their score is predicted to be \(70+5(3)=85\).
6. Study Design and Ethics
Understanding study design is crucial for conducting valid research. Ethical considerations ensure the integrity of the research process.
Types of Study Designs
- Experimental Studies: Researchers manipulate one variable to determine its effect on another.
- Random assignment of treatments to groups
- Control of extraneous variables
- Some experimental studies are double-blinded: neither the researcher nor the participants know what treatment is received by each participant.
- Observational Studies: Researchers observe subjects without manipulation.
- Simple random sampling: Every element has equal probability of selection
- Scenario: Randomly select a certain number of students from SCSU and ask the question “Have you used AI tools helping you understand hard concepts?”
- To achieve a 95% confidence interval for the proportion of SCSU students who used AI tools for study, a sample size of about \(\frac{1}{E^2}\) is needed in order to control the margin of error at E. For example, if \(E=0.02\), the sample size is \(n=2500\).
- Convenience sampling: Selection of participants are made based on their ease of access or availability to the researcher. This method introduces bias to the research study.
- Scenario: A student conducting a survey for a class project stands outside their university cafeteria and asks the first 50 people who walk by to answer a questionnaire about study habits.
- Confounding is common in observation studies: Data shows that as ice cream sales increase, so do drowning deaths. This relationship does not imply cause and effect, because hot weather can cause increase in both ice cream sales and drowning deaths.
- In conclusion, controlled experiments can establish cause and effect, but observation studies can’t.
Ethical Considerations
- Informed Consent: Participants must be fully informed about the study and agree to participate.
- Beneficence: Researchers must minimize harm and maximize benefits to participants.
7 Worked Exercises
- If you have the test scores of five students: 70, 80, 90, 85, and 75, the mean score is (70 + 80 + 90 + 85 + 75) / 5 = 80.
- In the data set 3, 7, 2, 9, 5, when ordered (2, 3, 5, 7, 9), the median is 5.
- In the data set 1, 2, 2, 3, 4, the mode is 2.
- In the data set 8, 12, 15, 20, 25, the range is 25 - 8 = 17.
- If the test scores are 70, 80, and 90, the variance measures how spread out these scores are from the mean (80).
- A standard deviation of 5 in test scores indicates that most scores fall within 5 points of the mean.
- If a survey estimates that 60% of people prefer coffee with a margin of error of ±5%, the confidence interval is (55%, 65%).
- A regression equation between study time and score is \(score = 70 +5\cdot time\). Predict thescore of a students who studies 4 hours. The answer is \(70+5(4)=90\).
- Have you cheated with AI on any college exam? In a sample of 80 students randomly selected, 10 said yes. To test the claim that more than 10% of students in the college have cheated with AI, set \(H_0: p=0.1\) and \(H1: p>0.1\). After conducting the test, p-value is found to be 0.228. Since 0.228 is not less than 0.05, the null hypothesis is not rejected and it can be concluded that there is not enough evidence provided by the data that more than 10% of students in the college have cheated with AI.