dplyr
packagelibrary(tidyverse)
iris
data set. How many
observations and variables are in the data set?data("iris")
class(iris) # data frame
## [1] "data.frame"
tbl_df(iris) # convert to tibble
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
There are a total of 150 observations and 5 variables in the
iris
dataset.
iris1
that contains only the
species virginica and versicolor with sepal lengths longer than 6 cm and
sepal widths longer than 2.5 cm. How many observations and variables are
in the data set? iris1 <- filter(iris, Species == "virginica" | Species == "versicolor", Sepal.Length > 6, Sepal.Width > 2.5)
str(iris1)
## 'data.frame': 56 obs. of 5 variables:
## $ Sepal.Length: num 7 6.4 6.9 6.5 6.3 6.6 6.1 6.7 6.1 6.1 ...
## $ Sepal.Width : num 3.2 3.2 3.1 2.8 3.3 2.9 2.9 3.1 2.8 2.8 ...
## $ Petal.Length: num 4.7 4.5 4.9 4.6 4.7 4.6 4.7 4.4 4 4.7 ...
## $ Petal.Width : num 1.4 1.5 1.5 1.5 1.6 1.3 1.4 1.4 1.3 1.2 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
There are now 56 observations in the new iris1
dataframe
and there are still 5 variables
iris2
data frame from
iris1
that contains only the columns for Species,
Sepal.Length, and Sepal.Width. How many observations and variables are
in the data set?iris2 <- select(iris1, c(Species, Sepal.Length, Sepal.Width))
str(iris2)
## 'data.frame': 56 obs. of 3 variables:
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Sepal.Length: num 7 6.4 6.9 6.5 6.3 6.6 6.1 6.7 6.1 6.1 ...
## $ Sepal.Width : num 3.2 3.2 3.1 2.8 3.3 2.9 2.9 3.1 2.8 2.8 ...
In the iris2
dataset, there are 56 observations and 3
variables.
iris3
data frame from iris2
that orders the observations from largest to smallest sepal length. Show
the first 6 rows of this data set.iris3<-arrange(iris2, desc(Sepal.Length))
head(iris3)
## Species Sepal.Length Sepal.Width
## 1 virginica 7.9 3.8
## 2 virginica 7.7 3.8
## 3 virginica 7.7 2.6
## 4 virginica 7.7 2.8
## 5 virginica 7.7 3.0
## 6 virginica 7.6 3.0
iris4
data frame from iris3
that creates a column with a sepal area (length * width) value for each
observation. How many observations and variables are in the data
set?iris4<-mutate(iris3, Sepal.Area = Sepal.Length * Sepal.Width )
head(iris4)
## Species Sepal.Length Sepal.Width Sepal.Area
## 1 virginica 7.9 3.8 30.02
## 2 virginica 7.7 3.8 29.26
## 3 virginica 7.7 2.6 20.02
## 4 virginica 7.7 2.8 21.56
## 5 virginica 7.7 3.0 23.10
## 6 virginica 7.6 3.0 22.80
str(iris4)
## 'data.frame': 56 obs. of 4 variables:
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Sepal.Length: num 7.9 7.7 7.7 7.7 7.7 7.6 7.4 7.3 7.2 7.2 ...
## $ Sepal.Width : num 3.8 3.8 2.6 2.8 3 3 2.8 2.9 3.6 3.2 ...
## $ Sepal.Area : num 30 29.3 20 21.6 23.1 ...
In the iris4
dataset there are still 56 observations but
now there are 4 variables (Sepal.Area added as the last column)
iris5
that calculates the average sepal
length, the average sepal width, and the sample size of the entire
iris4
data frame and print iris5
.iris5<-summarize(iris4, MeanSepal.Length = mean(Sepal.Length), MeanSepal.Width = mean(Sepal.Width), number = n())
print(iris5)
## MeanSepal.Length MeanSepal.Width number
## 1 6.698214 3.041071 56
iris6
that calculates the average
sepal length, the average sepal width, and the sample size for each
species of in the iris4
data frame and print
iris6
.iris6 <- iris4%>%
group_by(Species) %>%
summarize(MeanSepal.Length = mean(Sepal.Length), MeanSepal.Width = mean(Sepal.Width), number = n())
print(iris6)
## # A tibble: 2 × 4
## Species MeanSepal.Length MeanSepal.Width number
## <fct> <dbl> <dbl> <int>
## 1 versicolor 6.48 2.99 17
## 2 virginica 6.79 3.06 39
iris1 iris2 iris3 iris4 iris5 iris6
. At each stage, the
output data frame from one operation serves as the input fro the next. A
more efficient way to do this is to use the pipe operator %>% from
the tidyr package. See if you can rework all of your previous statements
(except for iris5
) into an extended piping operation that
uses iris as the input and generates irisFinal
as the
output.irisFinal <- iris %>%
filter(Species == "virginica" | Species == "versicolor", Sepal.Length > 6, Sepal.Width > 2.5) %>%
select(c(Species, Sepal.Length, Sepal.Width)) %>%
arrange(desc(Sepal.Length)) %>%
mutate(Sepal.Area = Sepal.Length * Sepal.Width) %>%
group_by(Species) %>%
summarize(MeanSepal.Length = mean(Sepal.Length), MeanSepal.Width = mean(Sepal.Width), number = n())
irisFinal
## # A tibble: 2 × 4
## Species MeanSepal.Length MeanSepal.Width number
## <fct> <dbl> <dbl> <int>
## 1 versicolor 6.48 2.99 17
## 2 virginica 6.79 3.06 39
iris_Longer <- iris %>% pivot_longer(cols = Sepal.Length:Petal.Width,
names_to = "Measure",
values_to = "Value")
head(iris_Longer)
## # A tibble: 6 × 3
## Species Measure Value
## <fct> <chr> <dbl>
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Width 3.5
## 3 setosa Petal.Length 1.4
## 4 setosa Petal.Width 0.2
## 5 setosa Sepal.Length 4.9
## 6 setosa Sepal.Width 3