projects/basicInferentialDataAnalysis.Rmd at master · hailuteju/projects · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
title: "Basic Inferential Data Analysis"
author: "Hailu Teju"
date: "September 16, 2017"
output:
  html_document: default
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


## Basic Inferential Data Analysis

```{r, message=FALSE, warning=FALSE, echo=FALSE}
library(ggplot2)
library(datasets)
data(ToothGrowth)
```

In this second part of the project we analyze the ToothGrowth data.

```{r}
str(ToothGrowth)

```

Here we see that the data set has 60 observations and 3 variables, length(len), supplement type (supp), and dosage (dose).

```{r}
head(ToothGrowth)

table(ToothGrowth$supp)

table(ToothGrowth$dose)

sapply(with(ToothGrowth, split(len, supp)), summary)
```
The $sapply()$ function above gave us the six values summary of tooth growth (len) with respect to the "OJ" and "VC" supplements.

```{r, fig.width=12, fig.height=9}
g <- ggplot(ToothGrowth, aes(x=supp, y=len))
g <- g + geom_boxplot(aes(fill=supp))
g <- g + facet_grid(.~as.factor(dose))
g <- g + ggtitle("Tooth growth of guinea pigs by supplement type and dosage (mg)")
g <- g + xlab("Supplement type") + ylab("Tooth length")
g <- g + theme(plot.title=element_text(color="dodgerblue", size=16, hjust=0.5))
g <- g + theme(axis.title = element_text(size=14, color="dodgerblue"))
g <- g + theme(axis.text=element_text(size=14, color="black"))
g <- g + theme(strip.text = element_text(size = 16, color='darkblue'))
g <- g + theme(legend.position = 'right', legend.text=element_text(color='deeppink', size=12))
g

```
The figure above shows that there is a positive relationship between either supplement type and tooth growth. We can also observe that generally the "OJ" supplement has greater effect on toothgrowth than the "VC" supplement, particularly at dosage levels 0.5 and 1. When dosage level of the "OJ" supplement is increased from 1 to 2, the the increase in tooth growth rate becomes less. On the other had, there seems to be a linear relationship between the "VC" supplement dosage and toothgrowth rate. Furthermore, the figure above suggests that the effect on tooth growth of the "OJ" and "VC" supplements is about the same at dosage level 0.2.

## Hypothesis Testing

### Test 1 - Supplements "OJ" vs "VC"

Let's assume that there is no difference between the two supplements in their effect on tooth growth (null hypothesis). So here,

* (Null hypothesis) H0: no difference betweeen the two supplements in tooth growth
* (Alternative) Ha: tooth grows longer when the "OJ" supplement is used.
```{r}
# Split the tooth growth data set by the "OJ" and "VC" supplement types

OJlen <- ToothGrowth$len[ToothGrowth$supp=="OJ"]
VClen <- ToothGrowth$len[ToothGrowth$supp=="VC"]
```

Next we will perform a one-sided t-test with unequal variance on the above vectors of values of tooth length.

```{r}

t.test(OJlen, VClen, alternative = "greater", paired = FALSE, var.equal = FALSE, conf.level = 0.95)
```
As the p-value (0.03032) is less than $\alpha = 0.05$ (the default value for tolerance of the error alpha), we reject the null hypothesis! In other words, it is very likely that tooth growth rate is higher when the "OJ" supplement is used instead of the "VC" supplement.

### Test 2 - Tooth growth vs dosage level

Let's assume that there is no difference in tooth growth between dosages (the null hypothesis). So here, our alternative hypotheis is that tooth grows at a greater rate as the dosage increases.

To this end, let's split the data by dosage.

```{r}
doseHalfLen <- ToothGrowth$len[ToothGrowth$dose==0.5]
doseOneLen <- ToothGrowth$len[ToothGrowth$dose==1]
doseTwoLen <- ToothGrowth$len[ToothGrowth$dose==2]
```

* One-sided independent t-test with unequal variance (doseHalfLen vs doseOneLen).

```{r}
t.test(doseHalfLen, doseOneLen, alternative = "less", paired = FALSE,
       var.equal = FALSE, conf.level = 0.95)
```
Here we get a p-value of $6.342\times 10^{-8},$ which is much much less than our tolerance level for error
$\alpha = 0.05.$ Thus, we reject the null hypothesis and accept the alternative. In other words, there is a very strong evidence that tooth grows more when the dosage level is 1 than when it is at 0.5.

* One-sided independent t-test with unequal variance (doseOneLen vs doseTwoLen).

```{r}
t.test(doseOneLen, doseTwoLen, alternative = "less", paired = FALSE,
       var.equal = FALSE, conf.level = 0.95)
```
Again, we get a p-value of $9.532\times 10^{-6},$ which is much much less than our tolerance level for error
$\alpha = 0.05.$ Thus, we reject the null hypothesis and accept the alternative. In other words, there is a very strong evidence that tooth grows more when the dosage level is at 2 than when it is at 1.

### Test 3 - Supplement "OJ" vs "VC" at dosage level 2

The figure above suggested that the effects of the "OJ" and "VC" supplements are about the same when the dosage level is at 2. So we here would like to know if this is indeed the case. To this end we make the assumption that there is no difference in tooth growth between the supplements "OJ" and "VC" when the dosage level is at 2 (null hypothesis). The alternative is that there is a difference.

First let's get the approprate subsets of the data set that reflect the effects of the two supplements at dosage level 2.

```{r}
OJdoseTwoLen <- ToothGrowth$len[ToothGrowth$supp=="OJ" & ToothGrowth$dose==2]
VCdoseTwoLen <- ToothGrowth$len[ToothGrowth$supp=="VC" & ToothGrowth$dose==2]
```

We perform a two-sided independent t-test with unequal variance.

```{r}
t.test(OJdoseTwoLen, VCdoseTwoLen, alternative = "two.sided", paired = FALSE,
       var.equal = FALSE, conf.level = 0.95)
```
In this case, the p-value 0.9639 is much much higher than our default error tolerance level ($\alpha = 0.05$,)
so, we don't reject the null hypothesis in this case, confirming what we suspected by observing the figure above - that the effect of the two supplements at dosage level 2 is about the same.