| Principle | Description |
|---|---|
| Privacy | Protecting participant identities and sensitive information |
| Informed Consent | Participants understand how their data will be used |
| Transparency | Clear communication about methods and limitations |
| Data Integrity | Accurate data collection and analysis without manipulation |
| Fairness | Avoiding biases that could harm groups or individuals |
| Accountability | Taking responsibility for statistical work and its consequences |
Chapter 4 Study Design and Data Ethics in Basic Statistical Practice
Understanding study design is crucial for conducting valid research. Ethical considerations ensure the integrity of the research process.
4.1 Types of Study Designs
- Experimental Studies: Researchers manipulate one variable to determine its effect on another.
- Random assignment of treatments to groups
- Control of extraneous variables
- Some experimental studies are double-blinded: neither the researcher nor the participants know what treatment is received by each participant.
- Observational Studies: Researchers observe subjects without manipulation.
- Simple random sampling: Every element has equal probability of selection
- Scenario: Randomly select a certain number of students from SCSU and ask the question “Have you used AI tools helping you understand hard concepts?”
- Convenience sampling: Selection of participants are made based on their ease of access or availability to the researcher. This method introduces bias to the research study.
- Scenario: A student conducting a survey for a class project stands outside their university cafeteria and asks the first 50 people who walk by to answer a questionnaire about study habits.
- Confounding is common in observation studies: Data shows that as ice cream sales increase, so do drowning deaths. This relationship does not imply cause and effect, because hot weather can cause increase in both ice cream sales and drowning deaths.
- Simple random sampling: Every element has equal probability of selection
In conclusion, controlled experiments can establish cause and effect, but observation studies can’t.
4.2 Core Ethical Principles
4.3 Common Ethical Challenges in Basic Statistics
4.3.1 Data Collection
Sampling Bias: Certain groups are disproportionately included or excluded without their knowledge or agreement, skewing results and potentially exploiting specific populations.
Informed Consent: It involves ensuring that participants are fully aware of the purpose, methods, risks, and benefits of the study before agreeing to provide data. Key issues include:
- Lack of transparency: Failing to clearly explain how data will be used, stored, or shared.
- Coercion: Pressuring individuals to participate, especially in vulnerable populations.
- Inadequate understanding: Participants may not fully comprehend technical details due to complex language or lack of education.
- Privacy concerns: Ensuring data anonymity and protecting personal information.
Examples:
- A researcher collects survey data on mental health from college students without clearly explaining that the data might be shared with third parties, leading participants to unknowingly disclose sensitive information. (Lack of Informed Consent)
- A company conducts a workplace wellness study and implies that employees must participate to be eligible for promotions, undermining voluntary participation and exploiting power dynamics. (Coercion or Pressure)
- A medical study collects data from a rural community using complex consent forms written in technical language, which participants with limited literacy cannot fully comprehend, invalidating their consent. (Inadequate Understanding)
- A mobile app collects location data from users for a traffic pattern study without explicitly informing them, storing identifiable information that could be linked back to individuals. (Privacy Violations)
- A marketing firm collects consumer behavior data by offering a “free” online quiz, not disclosing that responses will be used for targeted advertising, misleading participants about the study’s purpose. (Deceptive Practices)
- A public opinion poll on healthcare access only collects data from urban areas with reliable internet, excluding rural or low-income populations and skewing results to misrepresent broader needs. (Sampling Bias and Exclusion)
- A study on poverty collects detailed financial data from low-income families without adequate safeguards, exposing participants to potential stigmatization or exploitation if data is mishandled. (Harm to Vulnerable Populations)
- researcher reuses data collected for a study on education to analyze unrelated social behaviors without obtaining new consent from participants, violating their original agreement. (Unauthorized Data Use)
4.3.2 Data Analysis
Ethical challenges in data analysis include:
- Misrepresentation of Results: Manipulating or selectively reporting data (e.g., cherry-picking results, p-hacking) to support a desired outcome, leading to misleading conclusions.
- A marketing analyst manipulates survey data by excluding negative feedback to show that 90% of customers are satisfied with a product, when the true figure is closer to 60%.
- Bias in Interpretation: Allowing personal beliefs, external pressures, or conflicts of interest to influence how results are interpreted or presented.
- A researcher, influenced by a pharmaceutical company’s funding, interprets ambiguous clinical trial data as evidence of a drug’s effectiveness, downplaying side effects to favor the sponsor.
- Overfitting or Model Misuse: Using inappropriate statistical models or overfitting data to produce results that seem significant but lack generalizability or validity.
- A data scientist builds a machine learning model to predict loan defaults but overfits it to the training data, producing a model that performs well in tests but fails to generalize to new applicants, leading to unfair rejections.
- Ignoring Assumptions: Applying statistical methods without verifying underlying assumptions (e.g., normality, independence), which can lead to invalid conclusions.
- An analyst applies a t-test to compare group means without checking for normality or equal variances, resulting in invalid conclusions about the effectiveness of a new teaching method.
- Data Fabrication or Falsification: Altering or inventing data to achieve desired results, undermining the integrity of the analysis.
- A researcher fabricates experimental data points to fill gaps in a study on dietary supplements, claiming significant health benefits that don’t exist.
- Lack of Reproducibility: Failing to document methods, code, or data sources adequately, making it impossible for others to verify or replicate results.
- A team analyzing economic trends publishes a report on GDP growth but fails to document their data cleaning process or share their code, making it impossible for others to replicate their findings.
- Privacy Violations: Mishandling sensitive data during analysis, risking breaches of confidentiality or unauthorized use of personal information.
- During analysis of healthcare data, an analyst inadvertently includes identifiable patient information in a shared dataset, breaching confidentiality and exposing sensitive details.
- Inequitable Impact: Ignoring how results may disproportionately affect certain groups, especially if the analysis informs policy or decision-making.
- A statistical model used for hiring decisions is found to favor male candidates due to biased historical data, but the analyst proceeds without adjusting for fairness, perpetuating gender discrimination.
Examples:
Trying multiple tests until finding significant results
Report all analyses attempted, not just significant ones
4.3.3 Reporting Results
Ethical challenges in reporting include:
- Selective Reporting: Omitting inconvenient data or results (e.g., non-significant findings) to present a more favorable outcome, leading to biased conclusions.
- Example: A researcher studying a new drug reports only the trials where the drug showed positive effects, omitting trials with no effect or adverse outcomes, leading to an overly optimistic view of the drug’s efficacy.
- Exaggeration or Misinterpretation: Overstating the significance, impact, or generalizability of results, misleading stakeholders or the public.
- A company claims their product increases productivity by 50% based on a small, non-representative sample, ignoring that the effect was only observed in specific conditions and may not apply broadly.
- Lack of Transparency: Failing to disclose limitations, uncertainties, methodological details, or potential conflicts of interest, which obscures the reliability of findings.
- A study on educational interventions reports improved test scores but fails to mention that the control group was significantly disadvantaged or that the researchers were funded by the intervention’s developer.
- P-Hacking and Data Dredging: Manipulating analyses (e.g., tweaking variables or tests) to achieve statistically significant results, then reporting only those findings.
- A social scientist tests dozens of variables in a dataset about happiness, then reports only the one variable (e.g., coffee consumption) that showed a p-value below 0.05, falsely implying a causal link.
- Inaccessible or Unclear Communication: Presenting results in overly technical or vague language, making it difficult for non-experts to understand or evaluate.
- A public health report uses dense statistical jargon to describe vaccination rates, confusing policymakers and the public about the urgency of increasing coverage.
- Misleading Visualizations: Using graphs or charts that distort data (e.g., truncated axes, inappropriate scales) to exaggerate trends or differences.
- A graph showing a company’s revenue growth uses a y-axis that starts at $900,000 instead of $0, making a 1% increase look like a dramatic spike to investors.
- Failure to Report Reproducibility: Not providing access to data, code, or detailed methods, hindering verification or replication of results.
- A climate study reports alarming temperature trends but doesn’t share the raw data or code used for analysis, preventing other scientists from verifying the findings.
- Ignoring Ethical Implications: Neglecting to consider how reported results might harm or unfairly impact certain groups, especially in sensitive contexts like policy or healthcare.
- A predictive policing model is reported as highly accurate but fails to disclose that it disproportionately flags minority neighborhoods, potentially reinforcing systemic bias when used by law enforcement.
Example:
- Always report with p-values, not just by saying that the result is significant.
4.4 Practical Guidelines
- Anonymization Techniques:
- Remove direct identifiers
- Use codes instead of IDs
- Documentation Standards:
- Methodology Sections: A study on vaccine efficacy vaguely describes its sampling as “convenience-based” without detailing recruitment or exclusion criteria, making it impossible to assess representativeness or bias.
- Ethical Issue: Lack of transparency risks misinformed public health policies.
- Data Dictionaries: A dataset on patient outcomes lacks a data dictionary, and the variable “treatment” is unclear (e.g., drug type or dosage?), leading analysts to misinterpret results.
- Ethical Issue: Misinterpretation could lead to incorrect medical recommendations.
- Analysis Code Visibility: A researcher publishes a study on economic inequality but refuses to share analysis code, claiming proprietary methods, preventing others from verifying inflated claims.
- Ethical Issue: Lack of reproducibility undermines trust and could mislead economic policy.
Best Practices for drafting the methodology section:
- Detail Study Design: Specify the type of study (e.g., experimental, observational), population, sampling methods, and inclusion/exclusion criteria.
- Example: “A randomized controlled trial was conducted with 200 participants aged 18–65, selected via stratified random sampling to ensure gender balance.”
- Describe Data Collection: Outline tools, instruments, or surveys used, including their validity and reliability.
- Example: “Data was collected using a validated 10-item Likert-scale questionnaire on job satisfaction (Cronbach’s α = 0.85).”
- Explain Analytical Methods: Detail statistical techniques, software used, and rationale for their selection.
- Example: “A two-sample t-test was used to compare means, assuming normality verified by the Shapiro-Wilk test (p > 0.05).”
- Address Limitations: Acknowledge potential biases, confounding variables, or constraints.
- Example: “Self-reported data may introduce recall bias, and the sample was limited to urban residents.”
- Ethical Considerations: Mention informed consent, privacy protections, and institutional review board (IRB) approval.
- Example: “All participants provided written informed consent, and data was anonymized per IRB guidelines.”
Best Practices for data dictionary:
- Include Key Elements:
- Variable name (e.g., “age”).
- Variable description (e.g., “Participant’s age in years”).
- Data type (e.g., numeric, categorical).
- Units (e.g., years).
- Permissible values or range (e.g., 18–100).
- Missing value codes (e.g., -99 for “not reported”).
- Source or collection method (e.g., self-reported survey).
- Standardized Format: Use clear, machine-readable formats like CSV or JSON for accessibility.
Example:
Variable: income Description: Annual household income Type: Numeric Units: USD Range: 0–1,000,000 Missing: -99 Source: Self-reported questionnaire
- Update Regularly: Revise the dictionary if data is modified or new variables are added.
- Ensure Accessibility: Share the dictionary with datasets in repositories or publications, respecting privacy constraints.
- Protect Sensitive Data: Avoid including identifiable information or ensure encryption for sensitive datasets.
Best Practices for Analysis Code Visibility:
- Share Code Publicly: Host code in repositories like GitHub, GitLab, or institutional platforms, unless restricted by proprietary or privacy concerns.
- Example: A GitHub repository containing R scripts for a regression analysis, with a README explaining setup and dependencies.
- Use Clear, Commented Code: Include comments to explain each step, making it understandable to others.
- Specify Software and Versions: Document the programming language, software, and package versions used.
- Example: “Analysis conducted in Python 3.9 with pandas 1.4.2 and statsmodels 0.13.2.”
- Organize for Reproducibility: Provide a clear workflow, including data cleaning, transformation, and analysis steps, ideally with a script that runs end-to-end.
- Handle Sensitive Data: If data cannot be shared due to privacy, provide synthetic datasets or detailed pseudocode to demonstrate the process.
- License Code: Use open-source licenses (e.g., MIT, GPL) to clarify how others can use or modify the code.
4.5 Case Study: Ethical Dilemmas
Scenario: A clinical trial shows a drug is statistically significant (p=0.04) but has only 0.1% improvement in outcomes.
Ethical reporting would include: While statistically significant (p = 0.04 ), the practical improvement was only 0.1%.
Discussion Questions:
Marketing the drug as “effective” without qualifiers is unethical due to the negligible 0.1% improvement, which may mislead patients and providers.↩︎
Ethical reporting requires disclosing effect size, confidence intervals, study design, limitations, adverse effects, and comparisons to alternatives to ensure transparency and informed decision-making.↩︎
Sample size influences statistical significance and precision, but large samples can overemphasize trivial effects. Ethical reporting must clarify this to avoid misinterpretation.↩︎