Welcome to week two! Today we will be using Rmarkdown to execute simple statistics, this module could be an entire class, but we are going to briefly review the topics below in a day:
1. Basic helpful functions in R 2. Testing assumptions 3. Comparing means 3.1 T-tests 3.2 Non-parametric test 3.3 Anova
What is Rmarkdown? Rmarkdown is a file type that allows users to save/execute their code into shareable reports. Read more about Rmarkdown with this tutorial
variable <- function(argument)
Basic mathematical functions for summary statistics
+ `sum()` + `mean()` + `min()` + `max()` + `sd()` + `median()` + `summary()`
Functions to create & explore data structure
+ `c()` - concatenate function + `:` - colon operator for sequence generation + `seq()` - sequence generation + `rep()` - replicate elements of vectors and lists + `View()` - invoke a data viewer + `length()` - length of an object + `class()` - object class + `head()` - return the first 6 rows of an object + `tail()` - return the last last 6 rows an object + `dim()` - returns the dimensions of an object + `nrow()` - number of rows + `ncol()` - number of columns + `str()` - display the structure of each column of an object + `names()` - display the names of an object, typically the column names + `nlevels()` - categorial levels within an object + `cbind()` - combine objects by columns + `rbind()` - combine objects by rows
Let’s learn about our dataframe using some if the functions we learned last week
#view the top 6 rows head(c4c)
## Id Sex Forearm Weight Age Date Prev Para Size ## 1 fb_1 f 64.02 38.58 1.000 2015-01-11 0 0 Medium ## 2 fb_2 f 65.12 39.59 1.478 2015-01-12 1 38 Medium ## 3 fb_3 f 65.41 40.42 1.570 2015-01-13 0 0 Medium ## 4 fb_4 f 66.08 42.23 1.578 2015-01-14 0 0 Medium ## 5 fb_5 f 66.09 42.63 1.598 2015-01-15 0 0 Medium ## 6 fb_6 f 66.32 43.67 1.610 2015-01-16 0 0 Medium
## 'data.frame': 1200 obs. of 9 variables: ## $ Id : chr "fb_1" "fb_2" "fb_3" "fb_4" ... ## $ Sex : chr "f" "f" "f" "f" ... ## $ Forearm: num 64 65.1 65.4 66.1 66.1 ... ## $ Weight : num 38.6 39.6 40.4 42.2 42.6 ... ## $ Age : num 1 1.48 1.57 1.58 1.6 ... ## $ Date : chr "2015-01-11" "2015-01-12" "2015-01-13" "2015-01-14" ... ## $ Prev : int 0 1 0 0 0 0 0 0 0 0 ... ## $ Para : int 0 38 0 0 0 0 0 0 0 0 ... ## $ Size : chr "Medium" "Medium" "Medium" "Medium" ...
#lets correct the variables (review from last week) $Sex <- as.factor(c4c$Sex) c4c$Size<- as.factor(c4c$Size) c4c$Weight<- as.numeric(c4c$Weight)c4c
It looks like there are N/As in our weight column/variable in our dataset. Let’s remove the N/A values from that columns
#use is.na() method <- c4c[!is.na(c4c$Weight),] c4c #use subset() method <- subset(c4c, !is.na(Weight)) c4c #use tidyverse method <- c4c %>% c4c drop_na(Weight)
# basic stats - lets get basic stats for the weight variable var(c4c$Weight)
##  287.5528
##  57.03703
##  87.52
##  14.2
##  16.95738
##  58.27
##  14.20 87.52
## 0% 25% 50% 75% 100% ## 14.20 48.99 58.27 63.52 87.52
# or just run summary() summary(c4c$Weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 14.20 48.99 58.27 57.04 63.52 87.52
Useful functions for plotting in base R. Here are a list of useful functions for making plots in R. By using the ? in R you can figure out the details of these different plotting functions.
One of our critical assumptions for one-sample and two-sample t-tests is that the distribution of the sample means is normal. We have learned that if the distribution of our sample is normal we are guaranteed that the distribution of our sample means will be normal. Additionally, we have learned that the Central Limit Theorem tells us that if our sample size is big enough, the distribution of our sample means will be normal even if the distribution of our sample (or the population from which we drew this sample) is non-normal. However, it is often the case in ecology that our samples are very non-normal and the Central Limit Theorem does not apply. This may be because the data are strongly skewed or because there are outliers in the data. In this lab you will learn how to deal with non-normal data and data with extreme outliers.
We first have to check whether our data are normal using the following qualitative and quantitative methods.
Histogram: If the histogram is not “bell-shaped” this is your first indication that your data may be non-normal.
QQplot: If all data do not stay within the confidence band your data may be non-normal.
Shapiro-Wilk Test: A quantitative test that tells you whether your data are normal or not. \(H_0\): Data are normal, \(H_A\): Data are non-normal. If p < 0.05, reject your \(H_0\) and conclude that the data are not normal; if p > 0.05, “conclude” that your data are normal. Remember not to count on the results from this test alone, and always to visually examine a QQplot or histogram.
Checking normality with residuals
This allows us to check normality of more complicated relationships rather than just checking one variable at a time.
Outliers can drastically influence the results of t-tests and other parametric tests. It is always import to identify outliers before you begin your analysis. While there are formal methods for determining what constitutes an outlier, a simple method is to make a boxplot of your data. A boxplot will identify any points that do not lie within a given range above or below the median of your data. This range can vary depending on how you build your boxplots, but typically the width of the boxplots is related to the first and third quartiles of your data. In R, the default range is 1.5 times the distance between first and third quartiles.
While sometimes outliers can be erroneous values (i.e. you measured something incorrectly) they often tell you something important about the population you are measuring. Therefore, you cannot drop an outlier from your analysis without having good reason to do so. Good reasons could include measurement error or an underlying difference between that individual and the rest of the population. Dropping an outlier because it makes your life and the analysis easier is not a good reason…though we all wish it was. However, it is good practice to run your analysis with and without your outlier to determine if/how the outlier changes your results.
When you transform your data you apply some function (often non-linear) to your data. Wait… why the heck can you do this? If you think about it, the scale that we use to measure things is completely arbitrary (think about Celsius vs Fahrenheit). We could just as easily measure things in a base 2 system and there would, in principal, be nothing wrong with that. If this helps, you can imagine that by transforming your data you are moving to an alien world where they measure everything on a different scale (e.g. on a log scale). Here are some of the reasons why we transform our data:
Here are some common transformations and the scenarios in which you may use them. The variable Y represents your original data and Y’ represents your transformed data.
Transformations are not magic. After you have transformed your data you must retest all of your assumptions. Moreover, transforming your data inherently changes your null hypothesis. For example, if I log transformed my number of flowers variable and then tested whether the means were different between the edge and the interior of a habitat patch, my \(H_0\) becomes that there is no difference in the log-mean.
Any conclusions you make about your data have to be in terms of the transformation. For example, you would conclude that the log number of flowers on the edge was significantly more than the log number of flowers in the interior.
If a transformation does not fix your problem (e.g. make your data more normal, reduce the heterogeneity of variance, etc.) you cannot proceed with your parametric analysis. Fortunately, there are some non-parametric options that you can consider. Remember that non-parametric does not assume that the data have any kind of distribution.
To know what normal (gaussian) data should look like, we will generate a dataframe with a known normal distribution
# Generate randomly sampled data to visualize = rnorm(1000, mean=21 , sd=3) #normal distribution norm_samp = rexp(1000) #exponential distribution exp_samp # Visualization with a Histogram hist(norm_samp, breaks=20, #20 bins is more practical, even though our sample size is 10x bigger. You get a sense of how many fall within bins that have a width that makes sense. This normal variable could be "age" for example col='darkgray', ylab="Frequency", main="Histogram of Normally Distributed Data")
hist(exp_samp, breaks=20, #20 bins is more practical, even though our sample size is 10x bigger. You get a sense of how many fall within bins that have a width that makes sense. This normal variable could be "age" for example col='darkgray', ylab="Frequency", main="Histogram of Non-Normally Distributed Data")