Coding 4 conservation week 2 : Simple statistics

Welcome to week two! Today we will be using Rmarkdown to execute simple statistics, this module could be an entire class, but we are going to briefly review the topics below in a day:

Overview:

    1. Basic helpful functions in R 
    2. Testing assumptions 
    3. Comparing means 
        3.1 T-tests
        3.2 Non-parametric test
        3.3 Anova

What is Rmarkdown? Rmarkdown is a file type that allows users to save/execute their code into shareable reports. Read more about Rmarkdown with this tutorial

Part 1: Basic helpful functions in R

R basics

variable <- function(argument)

List of useful functions to create and explore data structure

Basic mathematical functions for summary statistics

+ `sum()`

+ `mean()`

+ `min()`

+ `max()`

+ `sd()`

+ `median()`

+ `summary()`

Functions to create & explore data structure

+ `c()` - concatenate function
+ `:` - colon operator for sequence generation
+ `seq()` - sequence generation
+ `rep()` - replicate elements of vectors and lists
+ `View()` - invoke a data viewer
+ `length()` - length of an object
+ `class()` - object class
+ `head()` - return the first 6 rows of an object
+ `tail()` - return the last last 6 rows an object
+ `dim()` - returns the dimensions of an object 
+ `nrow()` - number of rows
+ `ncol()` - number of columns
+ `str()` - display the structure of each column of an object
+ `names()` - display the names of an object, typically the column names
+ `nlevels()` - categorial levels within an object
+ `cbind()` - combine objects by columns
+ `rbind()` - combine objects by rows

Exercises

Let’s learn about our dataframe using some if the functions we learned last week

#view the top 6 rows 
head(c4c)
##     Id Sex Forearm Weight   Age       Date Prev Para   Size
## 1 fb_1   f   64.02  38.58 1.000 2015-01-11    0    0 Medium
## 2 fb_2   f   65.12  39.59 1.478 2015-01-12    1   38 Medium
## 3 fb_3   f   65.41  40.42 1.570 2015-01-13    0    0 Medium
## 4 fb_4   f   66.08  42.23 1.578 2015-01-14    0    0 Medium
## 5 fb_5   f   66.09  42.63 1.598 2015-01-15    0    0 Medium
## 6 fb_6   f   66.32  43.67 1.610 2015-01-16    0    0 Medium
#structure
str(c4c)
## 'data.frame':    1200 obs. of  9 variables:
##  $ Id     : chr  "fb_1" "fb_2" "fb_3" "fb_4" ...
##  $ Sex    : chr  "f" "f" "f" "f" ...
##  $ Forearm: num  64 65.1 65.4 66.1 66.1 ...
##  $ Weight : num  38.6 39.6 40.4 42.2 42.6 ...
##  $ Age    : num  1 1.48 1.57 1.58 1.6 ...
##  $ Date   : chr  "2015-01-11" "2015-01-12" "2015-01-13" "2015-01-14" ...
##  $ Prev   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Para   : int  0 38 0 0 0 0 0 0 0 0 ...
##  $ Size   : chr  "Medium" "Medium" "Medium" "Medium" ...
#lets correct the variables (review from last week)

c4c$Sex <- as.factor(c4c$Sex)
c4c$Size<- as.factor(c4c$Size)
c4c$Weight<- as.numeric(c4c$Weight)

It looks like there are N/As in our weight column/variable in our dataset. Let’s remove the N/A values from that columns

#use is.na() method
c4c <- c4c[!is.na(c4c$Weight),]

#use subset() method
c4c <- subset(c4c, !is.na(Weight))

#use tidyverse method

c4c <- c4c %>% 
  drop_na(Weight)
# basic stats - lets get basic stats for the weight variable 

var(c4c$Weight)
## [1] 287.5528
mean(c4c$Weight)
## [1] 57.03703
max(c4c$Weight)
## [1] 87.52
min(c4c$Weight)
## [1] 14.2
sd(c4c$Weight)
## [1] 16.95738
median(c4c$Weight)
## [1] 58.27
range(c4c$Weight)
## [1] 14.20 87.52
quantile(c4c$Weight)
##    0%   25%   50%   75%  100% 
## 14.20 48.99 58.27 63.52 87.52
# or just run summary()
summary(c4c$Weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.20   48.99   58.27   57.04   63.52   87.52

Useful functions for plotting in base R. Here are a list of useful functions for making plots in R. By using the ? in R you can figure out the details of these different plotting functions.

  1. plot() : Plot x and y against each other
  2. hist() : Make a histogram of your data
  3. scatterplot() : Make a scatter plot of x and y
  4. abline() : Add a straight line to your current plot
  5. lines() : Add more lines/data to your current plot
  6. curve() : Plot a function over some range
  7. boxplot() : Make a box plot to compare different groups of your data 8. densityPlot() : Fit a density kernel to your data
  8. pie() : Make a pie chart
  9. barplot() : Make a bar graph in R 11. mosaicplot() : Make a mosaic plot

Part 2: Testing assumptions

1. Violations of Assumptions

One of our critical assumptions for one-sample and two-sample t-tests is that the distribution of the sample means is normal. We have learned that if the distribution of our sample is normal we are guaranteed that the distribution of our sample means will be normal. Additionally, we have learned that the Central Limit Theorem tells us that if our sample size is big enough, the distribution of our sample means will be normal even if the distribution of our sample (or the population from which we drew this sample) is non-normal. However, it is often the case in ecology that our samples are very non-normal and the Central Limit Theorem does not apply. This may be because the data are strongly skewed or because there are outliers in the data. In this lab you will learn how to deal with non-normal data and data with extreme outliers.

2. Testing for Normality

We first have to check whether our data are normal using the following qualitative and quantitative methods.

  • Histogram: If the histogram is not “bell-shaped” this is your first indication that your data may be non-normal.

  • QQplot: If all data do not stay within the confidence band your data may be non-normal.

  • Shapiro-Wilk Test: A quantitative test that tells you whether your data are normal or not. \(H_0\): Data are normal, \(H_A\): Data are non-normal. If p < 0.05, reject your \(H_0\) and conclude that the data are not normal; if p > 0.05, “conclude” that your data are normal. Remember not to count on the results from this test alone, and always to visually examine a QQplot or histogram.

Checking normality with residuals

This allows us to check normality of more complicated relationships rather than just checking one variable at a time.

3. Checking for outliers

Outliers can drastically influence the results of t-tests and other parametric tests. It is always import to identify outliers before you begin your analysis. While there are formal methods for determining what constitutes an outlier, a simple method is to make a boxplot of your data. A boxplot will identify any points that do not lie within a given range above or below the median of your data. This range can vary depending on how you build your boxplots, but typically the width of the boxplots is related to the first and third quartiles of your data. In R, the default range is 1.5 times the distance between first and third quartiles.

While sometimes outliers can be erroneous values (i.e. you measured something incorrectly) they often tell you something important about the population you are measuring. Therefore, you cannot drop an outlier from your analysis without having good reason to do so. Good reasons could include measurement error or an underlying difference between that individual and the rest of the population. Dropping an outlier because it makes your life and the analysis easier is not a good reason…though we all wish it was. However, it is good practice to run your analysis with and without your outlier to determine if/how the outlier changes your results.

4. Transformations

When you transform your data you apply some function (often non-linear) to your data. Wait… why the heck can you do this? If you think about it, the scale that we use to measure things is completely arbitrary (think about Celsius vs Fahrenheit). We could just as easily measure things in a base 2 system and there would, in principal, be nothing wrong with that. If this helps, you can imagine that by transforming your data you are moving to an alien world where they measure everything on a different scale (e.g. on a log scale). Here are some of the reasons why we transform our data:

  • To make the data more normal
  • To make our variances more similar between groups
  • To reduce the influence of outliers
  • To make our data linear (we will see this in linear regression)

4.1 Choosing a transformation

Here are some common transformations and the scenarios in which you may use them. The variable Y represents your original data and Y’ represents your transformed data.

  • Consider log transformation: \(Y' = \ln (Y)\) if…
    • The measurements are ratios or products of variables
    • The frequency distribution is right skewed
    • The group having the larger mean also has the higher standard deviation
    • The data span several orders of magnitude
    • Note: If you have 0’s in your data, you can do \(Y' = \log{(Y + 1)}\)
    • In R, you can use the expression log(Y) to transform the data Y (remember that log(Y) gives the natural logarithm of Y)
  • Consider Square-root transformation: \(Y' = \sqrt{Y}\) if…
    • The measurements are count data
    • The frequency distribution is right skewed
    • Note: if you have negative numbers, you can’t take the square root. You should add a constant to each number to make them all positive.
    • In R, you can use the expression sqrt(Y) to transform the data Y
  • Square transformations and antilog transformations: \(Y' = Y^2\) and \(Y' = \exp Y\)
    • Use when your data are left skewed
    • In R, you can use the expression Y^2 or exp(Y) to square transform or antilog transform the data Y, respectively.
  • Arcsine transformation: \(Y' = \arcsin \sqrt Y\)
    • Useful when your data are proportions
    • Note: Make sure your proportion data is represented as decimals (i.e. divide by 100)
    • In R, you can use the expression asin(sqrt(Y)) to square-root arcsine transform the proportion data Y

4.2 After you transform your data

Transformations are not magic. After you have transformed your data you must retest all of your assumptions. Moreover, transforming your data inherently changes your null hypothesis. For example, if I log transformed my number of flowers variable and then tested whether the means were different between the edge and the interior of a habitat patch, my \(H_0\) becomes that there is no difference in the log-mean.

Any conclusions you make about your data have to be in terms of the transformation. For example, you would conclude that the log number of flowers on the edge was significantly more than the log number of flowers in the interior.

If a transformation does not fix your problem (e.g. make your data more normal, reduce the heterogeneity of variance, etc.) you cannot proceed with your parametric analysis. Fortunately, there are some non-parametric options that you can consider. Remember that non-parametric does not assume that the data have any kind of distribution.

Determine normality of data

To know what normal (gaussian) data should look like, we will generate a dataframe with a known normal distribution

# Generate randomly sampled data to visualize
norm_samp = rnorm(1000, mean=21 , sd=3) #normal distribution 
exp_samp = rexp(1000) #exponential distribution 

# Visualization with a Histogram
hist(norm_samp,
     breaks=20, #20 bins is more practical, even though our sample size is 10x bigger. You get a sense of how many fall within bins that have a width that makes sense. This normal variable could be "age" for example 
     col='darkgray',
     ylab="Frequency",
     main="Histogram of Normally Distributed Data") 

hist(exp_samp,
     breaks=20, #20 bins is more practical, even though our sample size is 10x bigger. You get a sense of how many fall within bins that have a width that makes sense. This normal variable could be "age" for example 
     col='darkgray',
     ylab="Frequency",
     main="Histogram of Non-Normally Distributed Data")