PPT for Chapter 1 Introduction to Statistical Data Analysis

Slide 1: Chapter Content

Objective: Learn fundamental statistical concepts and data visualization using R.
Topics:
- Measures of Dispersion
- Data Visualization Techniques
- R Environment Setup
- Practical Examples and Exercises
Let’s dive in!

Slide 2: Types of Variables

Quantitative (Numerical)

Discrete:
- Number of cars
- Students in class
Continuous:
- Height/Weight
- Temperature

Qualitative (Categorical)

Nominal:
- Eye color
- Country
Ordinal:
- Education level
- Satisfaction ratings

Slide 3: Measures of Central Tendency

Data example 1: 4, 6, 8, 8, 10, 14, 20
Mean: 10
median: 8
mode: 8
Data example 2: 0, 3, 4, 4, 4, 5, 8, 9, 12, 20
Mean: 6.9
median: 4.5
mode: 4

Slide 4: Measures of Dispersion (1.1.2)

Range: Difference between maximum and minimum values.
Variance: Average of squared deviations from the mean.
Standard Deviation: Square root of variance; typical deviation from mean.
Interquartile Range (IQR): Range between Q1 (25th percentile) and Q3 (75th percentile), capturing middle 50% of data.
- Percentile: Value below which a certain percentage of data falls (e.g., Q1 = 25%, Median = 50%, Q3 = 75%).

Slide 5: Plots for Visualizing Data Distribution (1.1.3)

Numerical Data:
- Histogram
- Boxplot
- Density Curve
Categorical Data:
- Barplot
Visualization helps uncover patterns in data distribution.

Slide 6: How to Create a Histogram (1.1.4)

Requirements: Numeric data.
Steps:
1. Divide data into k bins (k = 5–25) of equal length.
2. Determine frequencies (data points per bin).
3. Draw the plot:
  - X-axis: Bin boundaries (e.g., 72-76).
  - Y-axis: Frequencies.
  - Bars: Height corresponds to frequency.
  - Add a title.
We’ll use R to create histograms.

Slide 7: How to Create a Boxplot (1.1.5)

Requirements: Numeric data.
Components (Five-Number Summary):
- Minimum: Smallest value.
- Q1 (25th percentile): 25% of data below.
- Median (Q2, 50th percentile): Middle value.
- Q3 (75th percentile): 75% of data below.
- Maximum: Largest value.
Visuals: Box (Q1 to Q3 with median line), whiskers (to min/max), outliers as points.
We’ll use R for boxplots.

Slide 8: How to Create a Density Curve (1.1.6)

Definition: Smooth curve representing distribution of numeric data.
Key Points:
- Alternative to histogram with similar shape.
- Total area under curve = 1.
- Don’t worry if concept is unclear; focus on visualization.
We’ll create density curves using R.

Slide 9: How to Create a Barplot (1.1.7)

Requirements: Categorical data with frequencies.
Example: Favorite fruits survey (Apples: 12, Bananas: 8, Oranges: 15, Grapes: 5).
Steps:
1. Axes: X-axis for categories, Y-axis for frequencies.
2. Bars: Height matches frequency, equal width, spaced apart.
3. Labels: Add title (e.g., “Favorite Fruits”), axis labels.
We’ll use R for barplots.

Slide 10: Setting Up Your R Environment (1.2)

Steps to Start:
1. Visit https://posit.cloud/.
2. Sign up/in with any email.
3. Create new project → Choose “Quarto” → Click “OK”.
Posit Cloud Interface (4 Panes):
- Upper-Left: Source code editor (Quarto Document).
- Lower-Left: Console for direct commands.
- Upper-Right: Minimize (not used here).
- Lower-Right: Files and project directory.
We’ll use built-in dataset mtcars for examples.

Slide 11: Exploring `mtcars` Dataset

Dataset Overview: Data on 32 cars.
Variables: Miles per gallon (mpg), horsepower (hp), weight (wt), transmission (am: 0=auto, 1=manual).
Basic Commands:

Slide 12: Making Histograms in R (1.3.1)

Example: Histogram of mpg from mtcars.

hist(
  mtcars$mpg, 
  breaks = 5, 
  main = "Histogram of Miles per Gallon",
  xlab = "Miles per Gallon",
  ylab = "Number of Cars",
  col = "blue"
)

Explanation:

hist(): Function for histograms.
breaks=5: Number of bins.
main, xlab, ylab: Title and x-, y-Labels.
col=“blue”: Bar color.

Slide 13: Making Boxplots in R (1.3.2)

Example: Boxplot of mpg from mtcars.

boxplot(mtcars$mpg, main = "Boxplot of Miles per Gallon")

Explanation:

boxplot(): Function for creating boxplots.
Shows five-number summary and outliers.

Slide 14: Making Density Curves in R (1.3.3)

Example: Density curve for mpg from mtcars.

hist(mtcars$mpg, freq = FALSE, main = "Density Curve of Miles per Gallon", 
     xlab = "Miles per Gallon", ylab = "Density", col = "lightblue")

Explanation:

hist() with freq=FALSE: Converts frequency to density scale.
Y-axis represents density (area under curve = 1).

Slide 15: Making Barplots in R (1.3.4)

Example: Barplot of cylinder counts in mtcars.

cylinder_table <- table(mtcars$cyl)
barplot(cylinder_table, main = "Bar Plot of Cylinder Counts", xlab = "Cylinders", ylab = "Frequency", col = "green")

Explanation:

table(): Summarizes counts per category.
barplot(): Creates the bar plot with labels and color.

Slide 16: Scatter Plots for Quantitative Variables (1.4)

Definition: Visualizes relationship between two numeric variables. Strengths: Reveals patterns, clusters, outliers. Weaknesses: Can be cluttered with large data. Example: Weight (wt) vs. Miles per Gallon (mpg) from mtcars.

plot(wt ~ mpg, data = mtcars, main = "Scatterplot: MPG vs. Weight", xlab = "Miles per Gallon", ylab = "Weight", col = "red")

Interpretation:

As weight increases, MPG decreases.

Slide 17: Line Graphs for Trends (1.4)

Use: Best for time series or continuous sequences. Example: Sales over 20 days.

sales <- c(120, 135, 150, 145, 160, 170, 165, 180, 190, 200, 195, 210, 220, 215, 230, 240, 235, 250, 260, 270)
plot(sales, type = "l", col = "blue", lwd = 2, main = "Sales Over 20 Days", xlab = "Day", ylab = "Sales ($)")

Interpretation:

Shows an increasing trend in sales.

Slide 18: Correlation Coefficient (1.5)

Definition: Measures strength and direction of linear relationship (-1 to 1). Example: Math vs. English scores for 10 students.

math <- c(65, 72, 78, 80, 85, 88, 90, 92, 95, 98)
english <- c(55, 60, 62, 68, 70, 75, 78, 82, 85, 90)
cor_value <- cor(math, english)
cat("Correlation:", round(cor_value, 2))

Correlation: 0.98

Result:

Correlation of 0.98 indicates strong positive relationship.

Slide 19: Z-Score for Standardization (1.6)

Definition: Measures how many standard deviations a data point is from the mean. Formula: \[z = \frac{x-\bar{x}}{s}\] where \(\bar{x}\) = mean, \(s\) = standard deviation. Uses:

Standardize data (mean=0, sd=1).
Detect outliers (\(|z| > 3\)).
Compare across datasets. Example: Test scores (70, 80, 85, 90, 95). Mean = 84, SD ≈ 9.62. Z-score for 70: \((70−84)/9.62=−1.46\) (1.46 SDs below mean).

Slide 20: Selected Exercises (1.7)

Exercise 1: Calculate mean, median, mode, variance, SD for croissant sales (25, 22, 28, etc.). Exercise 9: Z-score for math test (Score=85, Mean=75, SD=10). What does it mean? Exercise 10: Compare student performance (Math=85, English=78) using Z-scores across class data.

Slide 21: Case Study - Customer Satisfaction (1.8)

Data Overview: Survey of 10 customers (Age, Gender, Purchase Amount, etc.). Key Findings:

Avg. purchase: 130.
Satisfaction: Avg. 4.1/5.

Recommendations:

Target older customers (>45) with premium bundles.
Improve Home Goods quality.
Future Research: Analyze seasonal trends, repeat vs. first-time customers.

Slide 22: Conclusion Summary:

Learned measures of central tendency & dispersion, visualization techniques, and relationships.
Practiced R for histograms, boxplots, barplots, and more.
Explored Z-scores and correlation.
Next Steps: Apply these concepts to real datasets, complete exercises, and deepen R skills.

Thank You! Questions?

Slide 1: Chapter Content

Slide 2: Types of Variables

Slide 3: Measures of Central Tendency

Slide 4: Measures of Dispersion (1.1.2)

Slide 5: Plots for Visualizing Data Distribution (1.1.3)

Slide 6: How to Create a Histogram (1.1.4)

Slide 7: How to Create a Boxplot (1.1.5)

Slide 8: How to Create a Density Curve (1.1.6)

Slide 9: How to Create a Barplot (1.1.7)

Slide 10: Setting Up Your R Environment (1.2)

Slide 11: Exploring mtcars Dataset

Slide 12: Making Histograms in R (1.3.1)

Slide 13: Making Boxplots in R (1.3.2)

Slide 14: Making Density Curves in R (1.3.3)

Slide 15: Making Barplots in R (1.3.4)

Slide 16: Scatter Plots for Quantitative Variables (1.4)

Slide 17: Line Graphs for Trends (1.4)

Slide 18: Correlation Coefficient (1.5)

Slide 19: Z-Score for Standardization (1.6)

Slide 20: Selected Exercises (1.7)

Slide 21: Case Study - Customer Satisfaction (1.8)

Slide 11: Exploring `mtcars` Dataset