RTutorial/09-data.Rmd at main · sugnet/RTutorial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
# Reading data files into R, formatting and printing { #data }

## Reading Microsoft Excel files into R

The following three ways can be used to read an Excel file into R as an object:

(a)	The file can be stored as a *<span style="color:#BE99FF">.txt</span>*  or *<span style="color:#BE99FF">.csv</span>* file and then `read.table()`, `scan()` or `read.csv()` can be used to read the file into R.

(b)	Directly read the *<span style="color:#BE99FF">.xlsx</span>* file into R with the `readxl` package. List the sheet names with `excel_sheets()`. Specify a worksheet by name or number with a command like `objectname <- read_excel(xlsx_example, sheet = "Sheet1")`.

(c)	The *<span style="color:#BE99FF">.xlsx</span>* file can also be read into R with the `xlsx` package. The R functions `read.xlsx()` and `read.xlsx2()` can be used to read the contents of an Excel worksheet into an R data.frame. The difference between these two functions is that `read.xlsx()` preserves the data type. It tries to guess the class type of the variable corresponding to each column in the worksheet. Note that, the `read.xlsx()` function is slow for large data sets (worksheet with more than 100 000 cells). The `read.xlsx2()` function is faster on big files compared to `read.xlsx()` function. The commands have the following format: `objectname <- read.xlsx (file, sheetIndex, header = TRUE,  colClasses=NA)` and `objectname <- read.xlsx2 (file, sheetIndex, header = TRUE, colClasses="character")`.

(d)	Select the data in *<span style="color:#BE99FF">Excel</span>* (Data can also be selected in any other application such as *<span style="color:#BE99FF">Word</span>* or a text editor). Copy the selected range. In R: `objectname <- read.table (file = "clipboard")`. *Hint*: Be careful with empty cells in *<span style="color:#BE99FF">Excel</span>*: some preparation of the *<span style="color:#BE99FF">Excel</span>* file might be needed.

(e)	To avoid problems with end-of-file characters that can occur when using the method in (d), the package `clipr` can be used.

```{r, cliprExample, eval = FALSE}
library (clipr)
objectname <- read_clip_tbl (header = TRUE, row.names = 1)
```

<div style="margin-left: 25px; margin-right: 20px;">
The functions `clear_clip()` and `write_clip()` can also be very useful.
</div>

## Reading other data files into R

The R package `foreign()` provides functions for reading data from other packages into R:

```{r, foreignExample}
library(foreign)
objects(name="package:foreign")
```

Study the helpfiles of these functions for reading into R binary data, *<span style="color:#BE99FF">SAS</span>* XPORT format, *<span style="color:#BE99FF">Weka</span>* Attribute-Relation File Format, the Xbase family of database languages *<span style="color:#BE99FF">dBase</span>*, *<span style="color:#BE99FF">Clipper</span>* and *<span style="color:#BE99FF">FoxPro</span>*, *<span style="color:#BE99FF">Stata</span>*, *<span style="color:#BE99FF">Epi</span>* Info and Data files, *<span style="color:#BE99FF">Minitab</span>* portable worksheets, *<span style="color:#BE99FF">Octave</span>* text files, data.dump files that were produced in *<span style="color:#BE99FF">S</span>* version 3, *<span style="color:#BE99FF">SPSS</span>* save or export files, *<span style="color:#BE99FF">SAS</span>* data sets to be converted to *<span style="color:#BE99FF">.ssd</span>* format^[This function requires SAS to be installed since it creates and run a SAS program that converts the data set to .ssd format and uses `read.xport()` to obtain a dataframe.] and *<span style="color:#BE99FF">Systat</span>* files.

## Sending output to a file

The function `sink("filename")` can be used to divert output that normally appears in the console to a file.  The option `options (echo = TRUE)` ensures that the R instructions will also be included in the file. The instruction `sink()` makes output to appear in the console again.

How do the functions `write(x)` and `sink("filename")` differ?   Study the arguments of `write()` thoroughly.

## Writing R objects for transport

The R function `save(..., file = )` writes an external representation of R objects to the specified file. The names of the objects to be saved should appear either as symbols (or character strings) in `...` or as a character vector in list. These objects can be read back from the file using the function `load (file = )`. Study how these two functions work by consulting the help files.  The functions `save()` and `load()` are very useful for transporting R objects between computers.

The functions `saveRDS (object = , file = )` and `object.name <- readRDS (file = )` write a single R object to a file, and restore it named `object.name`. Care has to be taken with the deprecated functions `dump()` and `source()`. If R objects were saved to a file using `dump()`, it should be restored to an R workspace with `source()`, not `load()`.

## The use of the file .Rhistory and the function `history()`

The file *<span style="color:#FF9696">.Rhistory</span>* is created in the same folder where the *<span style="color:#FF9696">.Rdata</span>* exists. It can be inspected with any text editor or with *<span style="color:#BE99FF">MS Word</span>* and as such provides an exact record of all activity in the R console (commands window).

Study the help file of the function  `history()`.

## Command re-editing

(a)	Use of the up and down arrows to recall previous commands. Delete, Backspace, Home and End keys for editing.

(b)	Note the use of the script window to execute entire functions or selected instructions only.

## Customized printing

The basic tool for customized printing is the function `cat()`.   This function can be used to output messages to the console or to a file.  Note the different arguments that are available for `cat()`:

(i)	By default output is display on the screen; for output to be directed to a file, use argument `file = "file name including path"`.

(ii)	By default output directed to a file replaces previous contents of the file; use argument `append = TRUE` to append new output to previous contents.

(iii)	Use `sep = "xx"` to automatically insert characters between the unnamed arguments to `cat()` in the output.

(iv)	To automatically insert new lines in the output use `fill = TRUE`.

(v)	The `labels =` argument allows insertion of a character string at the beginning of each output line. If labels is a vector its values are used cyclically.

Write today’s date as given by the function date() in the form `“The date today is:   Day of the week,  xx, month,  20xx.”` as an heading to a file. *Hint*: recall functions `cat()`, `match()`, `substring()`, `paste()`,  `replace()`.

##	Formatting numbers

(a)	Study how the functions `round()` and `signif()`  together with `cat()` can be used to set the number of decimals that are printed.

(b)	Study the use of `options(digits=xx)`.

(c)	Study how the function `format()` works. Note the use of `format()` together with `paste()` and `cat()`.

(d)	What does `print()` do?

(e)	Study the help file of `write.table()`.

(f)	The functions `prmatrix()` or `print()` can be used to output matrices to the console during execution of a function. This is very convenient for inspecting intermediate results.  Determine how the latter function differs from `cat()`.

(g)	Note the difference between the following statements:

```{r, formatExample}
colnames(state.x77)
format(colnames(state.x77))
```

(h)	Study the following example carefully:

```{r, summaryExample}
format.mns <- format (apply (state.x77, 2, mean))
format.names <- format (colnames (state.x77))
descrip.mns <- paste("Mean for variable", format.names, " = ", format.mns)
cat(descrip.mns, fill = max(nchar(descrip.mns)))
```

## Printing tables

Study the example below of how to represent the maximum and minimum value of the variables in the  state.x77 data set in a table with the names of the countries corresponding to the values.

```{r, min_max_table}
mins <- apply(state.x77, 2, min)
maxs <- apply(state.x77, 2, max)
min.name <- character(ncol(state.x77))
min.name
for(i in 1:8) min.name[i] <- rownames(state.x77)[state.x77[,i] == mins[i]][1]
max.name <- character(8)
for(i in 1:8) max.name[i] <- rownames(state.x77)[state.x77 [,i] == maxs[i]][1]
my.table <- data.frame(mins, min.name, maxs, max.name)
dimnames(my.table) <- list(names(mins),c("Minimum",
                                         "State with Min",
                                         "Maximum",
                                         "State with Max"))
colnames(my.table)[3] <- paste("     ", colnames(my.table)[3])
my.table
```

An alternative version of the above table could be obtained with the following instructions:

```{r, min_max_table_alt}
cat (paste (format (    c  (" ", "Statistic", " ", names(mins))),
            format ( paste ("  ", c("  ", "Minimum", " ", format(mins)))),
            format (    c  ("State having", "Minimum", " ", min.name)),
            format (paste  ("       ", c(" ", "Maximum", " ", format(maxs)))),
            format (    c  ("State having","Maximum", " ", max.name))),
              fill=TRUE)
```

Make the necessary changes in the above lines of code to improve the column spacing.

## Communicating with the operating system

Study how the function `system()` works using the instructions:  *“time”*,  *“date”* and *“dir”*.  *Hint*:  First study the help file of the R function `system()` and then the following instructions:

```{r, system, eval = FALSE}
system (paste (Sys.getenv ("COMSPEC"), "/c", "time \t"),
         show.output.on.console = TRUE, invisible = TRUE)
system (paste (Sys.getenv ("COMSPEC"), "/c", "date \t"),
         show.output.on.console = TRUE, invisible = TRUE)
system (paste (Sys.getenv ("COMSPEC"), "/c", "dir c:\\"),
         show.output.on.console = TRUE, invisible = TRUE)
```

The R function `system()` can also be used together with Notepad  to create a text file during an R session:

```{r, system2, eval = FALSE}
system (paste (Sys.getenv ("COMSPEC"), "/c",
               "notepad c:\\temp\\test.txt"),
        show.output.on.console = TRUE, invisible = TRUE)
```

(a)	Use `system()` to create a text file without terminating the R session.
(b)	Use `system()` to write a function  `myfile.exists()` that checks if any specified file exists.

##	Exercise

::: {style="color: #80CC99;"}

1.	Construct tables displaying the values of all variables in the state.x77 data set separately for each region as defined in the R object `state.region`.

2.	Print a table from the state.x77 data set such that for each variable, an asterisk is placed after the maximum value for that variable. The numbers must line up correctly.

:::

## Tidyverse

*<span style="color:#FF9966">Tidyverse</span>* is a collection or *ecosystem* of R packages that use the same data structures for data manipulation and exploration. With the command `library (tidyverse)`, the core packages listed in Table \@ref(tab:TidyverseCore) will also be loaded. A selection of other packages from the tidyverse collection is given in Table \@ref(tab:TidyverseOther).

Table: (\#tab:TidyverseCore) Additional core tidyverse packages.

| *<span style="color:#F7CE21">Package</span>* | *<span style="color:#F7CE21">Purpose</span>*  |
| ------ | --------------- |
| `dplyr`   |	Data manipulation |
| `tidyr`   |	Data tidying |
| `tibble`  |	Similar to data frames |
| `readr`   |	Data import |
| `ggplot2` |	Data visualisation (see Chapter 10) |
| `stringr` |	String manipulation |
| `forcats` |	Factor variable manipulation |
| `purrr`    |	Functional programming |

Table: (\#tab:TidyverseOther) Selection of packages from tidyverse.

| *<span style="color:#F7CE21">Package</span>* | *<span style="color:#F7CE21">Purpose</span>*  |
| ------ | --------------- |
| `hms`, `lubridate` |	Working with date/time vectors |
| `feather`          |	Sharing with *<span style="color:#BE99FF">Python</span>* and other languages |
| `haven`            |	Importing *<span style="color:#BE99FF">SPSS</span>*, *<span style="color:#BE99FF">SAS</span>* and *<span style="color:#BE99FF">Stata</span>* files |
| `httr`             |	Sharing with web interfaces |
| `jsonlite`         |	*<span style="color:#BE99FF">Java</span>* script (JSON) |
| `rvest`            |	Web scraping |
| `readxl`           |	Reading *<span style="color:#BE99FF">.xls</span>* and *<span style="color:#BE99FF">.xlsx</span>* files |
| `xml2`             |	*<span style="color:#BE99FF">XML</span>* |
| `modelr`           |	Modelling within a pipeline |
| `broom`            |	Turning models into tidy data |


### Tibbles

A *<span style="color:#FF9966">tibble</span>* is a new version of a dataframe. Tibbles have an enhanced `print()` method which makes them easier to use with large datasets containing complex objects. To create a tibble from the dataframe iris, we use the commands:

```{r, tibble_iris, message = FALSE, warning = FALSE}
library ("tidyverse")
iris.tibble <- tibble(iris)
iris.tibble
```

Tibbles can also be formed from vectors automatically creating a column vector.

```{r, tibble_vector}
tibble(x = fruit)   # data set fruit in package stringr
```

Matrices are also easily converted to tibbles.

```{r, tibble_matrix}
X <- matrix (1:12,ncol=3)
tibble(X)
```

Even lists can be converted to tibbles.

```{r, tibble_list}
my.list <- list(a = 1:10, beta = exp(-3:3),
                logic = c(TRUE,FALSE,FALSE,TRUE))

my.list
tibble (my.list)
```

To create a tibble from scratch we can use the command:

```{r, tibble_scratch}
my.dat <- tibble(x = 1:5, y = 1, z = y - x ^ 2)
my.dat
```

There are three major differences between tibbles and dataframes.

(a) As seen above, the print method for tibbles only shows the first 10 rows and uses fonts and colours for emphasis. It also only shows the columns that fit onto the screen and provides a summary of each column type. You can control the default print behaviour by setting options: `options(tibble.print_max = n, tibble.print_min = m)`. If there are more than $n$ rows, print only $m$ rows. Use `options(tibble.print_min = Inf)` to always show all rows and `options(tibble.width = Inf)` to always print all columns, regardless of the width of the screen.

(b) Tibbles are stricter with subsetting, always returning another tibble.

```{r, tibble_subset}
my.dat["y"]
```

<div style="margin-left: 25px; margin-right: 20px;">
To extract a column, there are three options:
</div>

```{r, tibble_extract_col}
my.dat$x
my.dat[["y"]]
my.dat[[3]]
```

<div style="margin-left: 25px; margin-right: 20px;">
Tibbles never do partial matching, and will return NULL with a warning if the column does not exist.
</div>

(c) Tibbles are also stricter with recycling, only allowing values of length one to be recycled. The first column with length different to one determines the number of rows in the tibble and conflicts will lead to an error. To create a tibble with zero rows, use the first row to have $0 \neq 1$ rows with the command

```{r, tibble_zero}
tibble(a = integer(), b = 1)
```

### Pipe operator

The pipe operator, `|>`, pipes an object forward into a function or call expression, something like `x |> f`, rather than $f(x)$.  A simple example to achieve the same result as the three commands with two intermediate objects, `car_data` and `cyl_means` created, would be a single call as shown below:

```{r, pipe_example}
car_data <- mtcars[mtcars$hp > 100,]
cyl_means <- apply(car_data, 2, function(x, cyl)
                                  { tapply(x, cyl, mean)
                                  }, cyl=car_data$cyl)
cyl_means

mtcars |>
  filter(hp > 100) |>
  group_by(cyl) |>
  summarise(across(everything(), mean))
```

The first pipe operator `%>%` was created in the package `magrittr`. This package is automatically loaded when `tidyverse` is attached. The following call with therefore have a similar outcome:

```{r, pipe_old}
mtcars %>%
  filter(hp > 100) %>%
  group_by(cyl) %>%
  summarise(across(everything(), mean))
```

From R version 4.1.0 the pipe operator `|>` is directly built into R and can therefore be used at any time without having to attach another package.

The dataframe (or tibble) is piped forward to the function `filter()`, i.e. telling R that the variable `hp` belongs to `mtcars` and the sub-tibble with only `hp > 100` values, is piped forward to the `group_by()` function.

### Tidy data

Tidy data is data where every column represents a single variable, every row is a single observation and in every cell is a single value. The terms ‘variable’ and ‘observation’ are important – a variable contains all values that measure the same feature across units; an observation contains all values on a single unit, across features. For creating a tidy data set there are five main types of operations:

####	Pivotting

The functions `pivot_longer()` and `pivot_wider()` are used to convert data into long or wide format respectively. Consider the long data set Rabbit in package `MASS`.

```{r, Rabbit}
library (MASS)
tibble (Rabbit)
```

The command below, pivots the tibble into a wide format.

```{r, pivot_wider}
rabbit <- Rabbit |>
  pivot_wider(names_from = c(Animal, Treatment, Run), values_from = BPchange)
rabbit
```

For the converse, the command below pivots the wide tibble, `rabbit`, to long format.

```{r, pivot_longer}
rabbit |> pivot_longer(cols = -Dose, names_to = "Treat.comb",
                       values_to = "BPchange")
```

Note that the column headings now form a single variable. To separate the combination of variables into different columns, we need the following command:

```{r, pivot_longer2}
rabbit |>
  pivot_longer(cols = -Dose,
               names_to = c("animal","treatment","run"),
               names_pattern ="(.*)_(.*)_(.*)",
               values_to = "BPchange")
```

#### Rectangling {#rectangling}

Rectangling is used to place lists in clean data rectangular format. Consider the list below:

```{r, Toothless_Dory_example}
df <- tibble(
  character = c("Toothless", "Dory"),
  metadata = list(
    list(
      species = "dragon",
      color = "black",
      films = c(
        "How to Train Your Dragon",
        "How to Train Your Dragon 2",
        "How to Train Your Dragon: The Hidden World"
      )
    ),
    list(
      species = "blue tang",
      color = "blue",
      films = c("Finding Nemo", "Finding Dory")
    )
  )
)
df
```

The following command places the two list items of metadata in a tibble with two rows, one for Toothless and one for Dory. Each of the three components – species, color and films – forms a column in the new tibble.

```{r, unnest}
df |> unnest_auto(metadata)
```

In addition to the function `unnest_auto()`, the functions `unnest_wider()` and `unnest_longer()` places the list components into columns or rows respectively. The `unnest_auto()` selects the most appropriate of `unnest_wider()` or `unnest_longer()`. In the first line of the output above, the `unnest_auto()` function states Using `'unnest_wider(metadata)'`, indicating that the wider application was used for this list.

The function `hoist()` can be used to reach down multiple layers.

```{r, hoist}
df |> hoist(metadata, "species",
            first_film = list("films", 1L),
            third_film = list("films", 3L))
```

Note that `hoist()` also allows us to extract only certain components.

####	Nesting

In nesting, a tibble of lists are created. In the example below, we create a tibble with three rows – one for each species – and two columns where each element in the second column is a $50 \times 4$ matrix of the four variables measured on $50$ samples from that particular species.

```{r, nest}
iris |> nest(data = !Species)
```

We can also create tibbles with three columns where the data is grouped by ‘Petal’ and ‘Sepal’ in the first instance and by ‘width’ and ‘length’ in the second.

```{r, nest2}
iris |> nest(petal = starts_with("Petal"), sepal = starts_with("Sepal"))

iris |> nest(width = contains("Width"), length = contains("Length"))
```

The function `unnest()` is similar to the functions discussed in \@ref(rectangling), and can be used to simultaneously `unlist` several column from a simple table containing lists.

```{r, unnest2}
df <- tibble(x = 1:3,
             y = list(NULL,
                      tibble(a = 1, b = 2),
                      tibble(a = 1:3, b = 3:1)))
df

df |> unnest(y)

df %>% unnest(y, keep_empty = TRUE)

df <- tibble(a = list(c("a", "b"), "c"),
             b = list(1:2, 3),
             c = c(11, 22))
df

df |> unnest(c(a, b))

df |> unnest(a) %>% unnest(b)
```

#### Splitting and combining

We use the functions `separate()` and `extract()` for separating columns and `unite()` to combine columns into a single column. The function `separate()` divides the data, while `extract()` picks out a part of the data.

```{r, split_combin}
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df

df |> separate(x, c("A", "B"))

df |> separate(x, c(NA, "B"))

df |> extract(x, "A")

df |> extract(x, c("A", "B"),"([[:alnum:]]+).([[:alnum:]]+)")

df <- expand_grid(x = c("a", NA), y = c("b", NA))
df

df |> unite("z", x:y, remove = FALSE)

df |> unite("z", x:y, na.rm = TRUE, remove = FALSE)
```

#### Dealing with missing values

The functions `complete()`, `drop_na()`, `fill()` and `replace_na()` are the most important for treatment of missing values.

```{r, missingValues}
df <- tibble(group = c(1:2, 1),
             item_id = c(1:2, 2),
             item_name = c("a", "b", "b"),
             value1 = 1:3,
             value2 = 4:6)
df

df |> complete(group, nesting(item_id, item_name))

df |> complete(group, nesting(item_id, item_name),
                 fill = list(value1 = 0))

df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df

df |> replace_na(list(x = 0, y = "unknown"))

df |> drop_na()

df |> drop_na(x)
```

### Package `dplyr` {#dplyr}

The main data manipulation functions is found in the package `dplyr`. The functions are referred to as “verbs”, since each performs a particular operation of data manipulation. The verbs are grouped in Table \@ref(tab:dplyr) according to operations on columns, rows or groups of rows.

Table: (\#tab:dplyr) Verbs for data manipulation in dplyr.

| *<span style="color:#F7CE21">Verb</span>* | *<span style="color:#F7CE21">Operates on</span>*  |
| ------ | --------------- |
| `select()`   |	Columns |
| `rename()`   |	Columns |
| `mutate()`   |	Columns |
| `relocate()` |	Columns |
| `filter()`   |	Rows |
| `arrange()`  |	Rows |
| `slice()`    |	Rows |
| `group_by()` |	Rows |
| `summarise()`|	Group of rows |

The functioning of the verbs will be illustrated with `UScereal` in the package `MASS`.

```{r, cereal}
library (MASS)
cereal <- tibble (UScereal)
cereal
```

The function `select()` allows for extracting one or more columns from a data set. The columns can be names or referred to by index. Using the function `everything()` in conjunction with `select()` is useful to sort or reorder the columns of a data set.

```{r, select}
dplyr::select (cereal, calories)        # select only column calories

dplyr::select (cereal, calories, fat)   # select two columns

dplyr::select (cereal, c(5,7:8))        # select by index

dplyr::select (cereal, -c(1,9,11))      # select columns to exclude

                   # reorder with calories first, followed by fibre
dplyr::select (cereal, calories, fibre, everything())
```

The `rename()` function changes one of more column names. The companion function `rename_with()` can be used to apply a function to column headings, such as `tolower()` and `toupper()` to change the case of column headings.

```{r, rename}
rename (cereal, Manufacturer=mfr)

rename_with (cereal, toupper, starts_with("F"))
```

New variables can be added or created from existing columns with the function `mutate()`. The newly formed variables are immediately available for creating more variables. Variables can be removed by transforming them to `NULL` or using the `.keep` argument.

```{r, mutate}
mutate (cereal, fat.vs.pr = fat/protein, mfr=NULL) |>
     dplyr::select (fat.vs.pr, everything())

mutate (cereal, fat.vs.pr = fat/protein,
                 comb.var = sodium + fat.vs.pr,
                 new.var=1:nrow(cereal), .keep="used")
```

Why is it useful to pipe the mutated tibble above to select? In comparison, `relocate()` makes it easy to move blocks of columns.

```{r, relocate}
relocate (cereal, shelf)

relocate (cereal, cal=calories, .before = fat)

relocate(cereal, where(is.factor), .after=last_col())
```

The `filter()` function select rows from a tibble, based on any operator that evaluates to a column of `TRUE` / `FALSE` values equal to the number of rows.

```{r, filter}
filter (cereal, fat<1)

filter (cereal, fat<1, mfr=="K")

filter (cereal, fat<1 | mfr=="K")

filter(cereal, between(sugars, 10, 20))
```

The verb `arrange()` refers to sorting the rows according to the values in one or more columns.

```{r, arrange}
arrange (cereal, fibre)

arrange (cereal, -fibre)

arrange (cereal, fat, desc(mfr))
```

The function `slice()` also allows for the selection of rows and works with a few helper functions: `slice_head()`, `slice_tail()`, `slice_sample()`, `slice_min()` and `slice_max()` to select the first few, last few, random sample, rows with lowest values or rows with highest values, respectively.

```{r, slice}
slice (cereal, 10:20)

slice (cereal, -(10:20))

slice_tail (cereal, n=3)

slice_sample (cereal, n=8)

slice_max (cereal, sodium, n=4)
```

A grouped object can be formed with the `group_by()` function. At first glance, it appears similar to the ungrouped tibble, but grouping will prove useful further data manipulations.

```{r, groupby}
cereal.mfr <- group_by(cereal, mfr)

cereal.mfr          # looks no different

class(cereal)

class(cereal.mfr)   # but it is a grouped object
```

The `summarise()` function allows for the computation of descriptive statistics. Operating on an ungrouped object, the overall statistic is computed, while the grouped object will provide the required statistics by group.

```{r, summarise}
summarise(cereal.mfr, mean.cal = mean(calories),
          median.carbo = median(carbo))

group_by(cereal, mfr, shelf) |>
    summarise(mean.cal = mean(calories))

summarise(cereal, mean.cal = mean(calories), max.fat = max(fat),
          median.carbo = median(carbo), sum.sugar = tibble(fivenum(sugars)))
```

Since the function `fivenum()` does not return a scalar value, but a vector, the output appears as a tibble above. Alternatively, the function `reframe()` can be used.

```{r, reframe}
reframe(cereal, mean.cal = mean(calories), max.fat = max(fat),
          median.carbo = median(carbo), sum.sugar = fivenum(sugars))
```

##	Exercise

::: {style="color: #80CC99;"}

1.	Use the `fish_encounters` in package `tidyr` to convert it into a wide format with fish IDs as the row variable and a column for each station. The entries in the cells should be '1' for a fish encounter and '0' otherwise.

2.	The `billboard` data set in package `tidyr` contains song rankings for billboard top 100 in the year 2000 with columns artist, track, date.enter and wk1 - w76 which contains the ranking of the song in each week after it entered the charts.

    (a)	Create a long data set listing the columns wk1 to w76 below each other in a single column called week and the associated rank position in a column called rank. Note that not all songs stayed on the charts for the entire 76 weeks. *Hint*, use `values_drop_na = TRUE`.

    (b)	Use the command `nest()` to create a tibble with one row for each artist-track combination and a rank.hist variable where each cell contains a tibble with 76 rows (one for each week) and a column for each of date.entered, week and rank.

3.	Another form of mutation, is to join together two separate data sets. Study the working of the functions `inner_join()`, `left_join()`, `right_join()` and `full_join()` together with the output of the commands:

```{r, bands, eval = FALSE}
band_members %>% inner_join(band_instruments)
band_members %>% left_join(band_instruments)
band_members %>% right_join(band_instruments)
band_members %>% full_join(band_instruments)
band_members %>% full_join(band_instruments2,
                              by = c("name" = "artist"))
```

4.	Use `state.x77` in package `MASS` to create a tibble called `USA.states` with the names of the states in the first column. *Hint*: first convert the matrix to a dataframe to get neater column names.

    (a)	Add the column `state.region`, also from package `MASS`, to USA.states in the second position.

    (b)	Select only the columns State, Region, Population, Income, Illiteracy, Life Exp and Area, then use the pipe operator to reorder the columns such that Area appears between Region and Population.

    (c)	Add a column `Pop.Density` for the Population density in number per square miles. Note that the population values in `state.x77` represent 1000's of persons. This column should appear between Population and Income.

    (d)	In a single command, using the pipe operator, create a tibble called `USA.groups` where you:
    * select only states with an area < 500 000 square miles;
    *	order the rows according to decreasing population density;
    * group by Region

    (e)	Compute the mean income and median life expectancy per region.

:::