PPT for Chapter 1 Introduction to Statistical Data Analysis

Slide 1: Chapter Content

  • Objective: Learn fundamental statistical concepts and data visualization using R.
  • Topics:
    • Measures of Dispersion
    • Data Visualization Techniques
    • R Environment Setup
    • Practical Examples and Exercises
  • Let’s dive in!

Slide 2: Types of Variables

Quantitative (Numerical)

  • Discrete:
    • Number of cars
    • Students in class
  • Continuous:
    • Height/Weight
    • Temperature

Qualitative (Categorical)

  • Nominal:
    • Eye color
    • Country
  • Ordinal:
    • Education level
    • Satisfaction ratings

Slide 3: Measures of Central Tendency

  • Data example 1: 4, 6, 8, 8, 10, 14, 20
  • Mean: 10
  • median: 8
  • mode: 8
  • Data example 2: 0, 3, 4, 4, 4, 5, 8, 9, 12, 20
  • Mean: 6.9
  • median: 4.5
  • mode: 4

Slide 4: Measures of Dispersion (1.1.2)

  • Range: Difference between maximum and minimum values.
  • Variance: Average of squared deviations from the mean.
  • Standard Deviation: Square root of variance; typical deviation from mean.
  • Interquartile Range (IQR): Range between Q1 (25th percentile) and Q3 (75th percentile), capturing middle 50% of data.
    • Percentile: Value below which a certain percentage of data falls (e.g., Q1 = 25%, Median = 50%, Q3 = 75%).

Slide 5: Plots for Visualizing Data Distribution (1.1.3)

  • Numerical Data:
    • Histogram
    • Boxplot
    • Density Curve
  • Categorical Data:
    • Barplot
  • Visualization helps uncover patterns in data distribution.

Slide 6: How to Create a Histogram (1.1.4)

  • Requirements: Numeric data.
  • Steps:
    1. Divide data into k bins (k = 5–25) of equal length.
    2. Determine frequencies (data points per bin).
    3. Draw the plot:
      • X-axis: Bin boundaries (e.g., 72-76).
      • Y-axis: Frequencies.
      • Bars: Height corresponds to frequency.
      • Add a title.
  • We’ll use R to create histograms.

Slide 7: How to Create a Boxplot (1.1.5)

  • Requirements: Numeric data.
  • Components (Five-Number Summary):
    • Minimum: Smallest value.
    • Q1 (25th percentile): 25% of data below.
    • Median (Q2, 50th percentile): Middle value.
    • Q3 (75th percentile): 75% of data below.
    • Maximum: Largest value.
  • Visuals: Box (Q1 to Q3 with median line), whiskers (to min/max), outliers as points.
  • We’ll use R for boxplots.

Slide 8: How to Create a Density Curve (1.1.6)

  • Definition: Smooth curve representing distribution of numeric data.
  • Key Points:
    • Alternative to histogram with similar shape.
    • Total area under curve = 1.
    • Don’t worry if concept is unclear; focus on visualization.
  • We’ll create density curves using R.

Slide 9: How to Create a Barplot (1.1.7)

  • Requirements: Categorical data with frequencies.
  • Example: Favorite fruits survey (Apples: 12, Bananas: 8, Oranges: 15, Grapes: 5).
  • Steps:
    1. Axes: X-axis for categories, Y-axis for frequencies.
    2. Bars: Height matches frequency, equal width, spaced apart.
    3. Labels: Add title (e.g., “Favorite Fruits”), axis labels.
  • We’ll use R for barplots.

Slide 10: Setting Up Your R Environment (1.2)

  • Steps to Start:
    1. Visit https://posit.cloud/.
    2. Sign up/in with any email.
    3. Create new project → Choose “Quarto” → Click “OK”.
  • Posit Cloud Interface (4 Panes):
    • Upper-Left: Source code editor (Quarto Document).
    • Lower-Left: Console for direct commands.
    • Upper-Right: Minimize (not used here).
    • Lower-Right: Files and project directory.
  • We’ll use built-in dataset mtcars for examples.

Slide 11: Exploring mtcars Dataset

  • Dataset Overview: Data on 32 cars.
  • Variables: Miles per gallon (mpg), horsepower (hp), weight (wt), transmission (am: 0=auto, 1=manual).
  • Basic Commands:

Slide 12: Making Histograms in R (1.3.1)

Example: Histogram of mpg from mtcars.

hist(
  mtcars$mpg, 
  breaks = 5, 
  main = "Histogram of Miles per Gallon",
  xlab = "Miles per Gallon",
  ylab = "Number of Cars",
  col = "blue"
)

Explanation:

  • hist(): Function for histograms.
  • breaks=5: Number of bins.
  • main, xlab, ylab: Title and x-, y-Labels.
  • col=“blue”: Bar color.

Slide 13: Making Boxplots in R (1.3.2)

Example: Boxplot of mpg from mtcars.

boxplot(mtcars$mpg, main = "Boxplot of Miles per Gallon")

Explanation:

  • boxplot(): Function for creating boxplots.
  • Shows five-number summary and outliers.

Slide 14: Making Density Curves in R (1.3.3)

Example: Density curve for mpg from mtcars.

hist(mtcars$mpg, freq = FALSE, main = "Density Curve of Miles per Gallon", 
     xlab = "Miles per Gallon", ylab = "Density", col = "lightblue")

Explanation:

  • hist() with freq=FALSE: Converts frequency to density scale.
  • Y-axis represents density (area under curve = 1).

Slide 15: Making Barplots in R (1.3.4)

Example: Barplot of cylinder counts in mtcars.

cylinder_table <- table(mtcars$cyl)
barplot(cylinder_table, main = "Bar Plot of Cylinder Counts", xlab = "Cylinders", ylab = "Frequency", col = "green")

Explanation:

  • table(): Summarizes counts per category.
  • barplot(): Creates the bar plot with labels and color.

Slide 16: Scatter Plots for Quantitative Variables (1.4)

Definition: Visualizes relationship between two numeric variables. Strengths: Reveals patterns, clusters, outliers. Weaknesses: Can be cluttered with large data. Example: Weight (wt) vs. Miles per Gallon (mpg) from mtcars.

plot(wt ~ mpg, data = mtcars, main = "Scatterplot: MPG vs. Weight", xlab = "Miles per Gallon", ylab = "Weight", col = "red")

Interpretation:

  • As weight increases, MPG decreases.

Slide 18: Correlation Coefficient (1.5)

Definition: Measures strength and direction of linear relationship (-1 to 1). Example: Math vs. English scores for 10 students.

math <- c(65, 72, 78, 80, 85, 88, 90, 92, 95, 98)
english <- c(55, 60, 62, 68, 70, 75, 78, 82, 85, 90)
cor_value <- cor(math, english)
cat("Correlation:", round(cor_value, 2))
Correlation: 0.98

Result:

  • Correlation of 0.98 indicates strong positive relationship.

Slide 19: Z-Score for Standardization (1.6)

Definition: Measures how many standard deviations a data point is from the mean. Formula: \[z = \frac{x-\bar{x}}{s}\] where \(\bar{x}\) = mean, \(s\) = standard deviation. Uses:

  • Standardize data (mean=0, sd=1).
  • Detect outliers (\(|z| > 3\)).
  • Compare across datasets. Example: Test scores (70, 80, 85, 90, 95). Mean = 84, SD ≈ 9.62. Z-score for 70: \((70−84)/9.62=−1.46\) (1.46 SDs below mean).

Slide 20: Selected Exercises (1.7)

Exercise 1: Calculate mean, median, mode, variance, SD for croissant sales (25, 22, 28, etc.). Exercise 9: Z-score for math test (Score=85, Mean=75, SD=10). What does it mean? Exercise 10: Compare student performance (Math=85, English=78) using Z-scores across class data.

Slide 21: Case Study - Customer Satisfaction (1.8)

Data Overview: Survey of 10 customers (Age, Gender, Purchase Amount, etc.). Key Findings:

  • Avg. purchase: 130.
  • Satisfaction: Avg. 4.1/5.

Recommendations:

  • Target older customers (>45) with premium bundles.
  • Improve Home Goods quality.
  • Future Research: Analyze seasonal trends, repeat vs. first-time customers.

Slide 22: Conclusion Summary:

  • Learned measures of central tendency & dispersion, visualization techniques, and relationships.
  • Practiced R for histograms, boxplots, barplots, and more.
  • Explored Z-scores and correlation.
  • Next Steps: Apply these concepts to real datasets, complete exercises, and deepen R skills.

Thank You! Questions?