Data manipulations using the `dplyr` package

library(tidyverse)

1. Examine the structure of the `iris` data set. How many observations and variables are in the data set?

data("iris")
class(iris) # data frame

## [1] "data.frame"

tbl_df(iris) # convert to tibble

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

There are a total of 150 observations and 5 variables in the iris dataset.

2. Create a new data frame `iris1` that contains only the species virginica and versicolor with sepal lengths longer than 6 cm and sepal widths longer than 2.5 cm. How many observations and variables are in the data set?

 iris1 <- filter(iris, Species == "virginica" | Species == "versicolor", Sepal.Length > 6, Sepal.Width > 2.5)
 str(iris1)

## 'data.frame':    56 obs. of  5 variables:
##  $ Sepal.Length: num  7 6.4 6.9 6.5 6.3 6.6 6.1 6.7 6.1 6.1 ...
##  $ Sepal.Width : num  3.2 3.2 3.1 2.8 3.3 2.9 2.9 3.1 2.8 2.8 ...
##  $ Petal.Length: num  4.7 4.5 4.9 4.6 4.7 4.6 4.7 4.4 4 4.7 ...
##  $ Petal.Width : num  1.4 1.5 1.5 1.5 1.6 1.3 1.4 1.4 1.3 1.2 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...

There are now 56 observations in the new iris1 dataframe and there are still 5 variables

3. Now, create a `iris2` data frame from `iris1` that contains only the columns for Species, Sepal.Length, and Sepal.Width. How many observations and variables are in the data set?

iris2 <- select(iris1, c(Species, Sepal.Length, Sepal.Width))
str(iris2)

## 'data.frame':    56 obs. of  3 variables:
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Sepal.Length: num  7 6.4 6.9 6.5 6.3 6.6 6.1 6.7 6.1 6.1 ...
##  $ Sepal.Width : num  3.2 3.2 3.1 2.8 3.3 2.9 2.9 3.1 2.8 2.8 ...

In the iris2 dataset, there are 56 observations and 3 variables.

4. Create an `iris3` data frame from `iris2` that orders the observations from largest to smallest sepal length. Show the first 6 rows of this data set.

iris3<-arrange(iris2, desc(Sepal.Length))
head(iris3)

##     Species Sepal.Length Sepal.Width
## 1 virginica          7.9         3.8
## 2 virginica          7.7         3.8
## 3 virginica          7.7         2.6
## 4 virginica          7.7         2.8
## 5 virginica          7.7         3.0
## 6 virginica          7.6         3.0

5. Create an `iris4` data frame from `iris3` that creates a column with a sepal area (length * width) value for each observation. How many observations and variables are in the data set?

iris4<-mutate(iris3, Sepal.Area = Sepal.Length * Sepal.Width )
head(iris4)

##     Species Sepal.Length Sepal.Width Sepal.Area
## 1 virginica          7.9         3.8      30.02
## 2 virginica          7.7         3.8      29.26
## 3 virginica          7.7         2.6      20.02
## 4 virginica          7.7         2.8      21.56
## 5 virginica          7.7         3.0      23.10
## 6 virginica          7.6         3.0      22.80

str(iris4)

## 'data.frame':    56 obs. of  4 variables:
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Sepal.Length: num  7.9 7.7 7.7 7.7 7.7 7.6 7.4 7.3 7.2 7.2 ...
##  $ Sepal.Width : num  3.8 3.8 2.6 2.8 3 3 2.8 2.9 3.6 3.2 ...
##  $ Sepal.Area  : num  30 29.3 20 21.6 23.1 ...

In the iris4 dataset there are still 56 observations but now there are 4 variables (Sepal.Area added as the last column)

6. Create `iris5` that calculates the average sepal length, the average sepal width, and the sample size of the entire `iris4` data frame and print `iris5`.

iris5<-summarize(iris4, MeanSepal.Length = mean(Sepal.Length), MeanSepal.Width = mean(Sepal.Width), number = n())
print(iris5)

##   MeanSepal.Length MeanSepal.Width number
## 1         6.698214        3.041071     56

7. Finally, create `iris6` that calculates the average sepal length, the average sepal width, and the sample size for each species of in the `iris4` data frame and print `iris6`.

iris6 <- iris4%>%
  group_by(Species) %>%
  summarize(MeanSepal.Length = mean(Sepal.Length), MeanSepal.Width = mean(Sepal.Width), number = n())
print(iris6)

## # A tibble: 2 × 4
##   Species    MeanSepal.Length MeanSepal.Width number
##   <fct>                 <dbl>           <dbl>  <int>
## 1 versicolor             6.48            2.99     17
## 2 virginica              6.79            3.06     39

8. In these exercises, you have successively modified different versions of the data frame `iris1 iris2 iris3 iris4 iris5 iris6`. At each stage, the output data frame from one operation serves as the input fro the next. A more efficient way to do this is to use the pipe operator %>% from the tidyr package. See if you can rework all of your previous statements (except for `iris5`) into an extended piping operation that uses iris as the input and generates `irisFinal` as the output.

irisFinal <- iris %>%
  filter(Species == "virginica" | Species == "versicolor", Sepal.Length > 6, Sepal.Width > 2.5) %>%
  select(c(Species, Sepal.Length, Sepal.Width)) %>%
  arrange(desc(Sepal.Length)) %>%
  mutate(Sepal.Area = Sepal.Length * Sepal.Width) %>%
  group_by(Species) %>%
  summarize(MeanSepal.Length = mean(Sepal.Length), MeanSepal.Width = mean(Sepal.Width), number = n())
irisFinal

## # A tibble: 2 × 4
##   Species    MeanSepal.Length MeanSepal.Width number
##   <fct>                 <dbl>           <dbl>  <int>
## 1 versicolor             6.48            2.99     17
## 2 virginica              6.79            3.06     39

9. Create a ‘longer’ data frame using the original iris data set with three columns named “Species”, “Measure”, “Value”. The column “Species” will retain the species names of the data set. The column “Measure” will include whether the value corresponds to Sepal.Length, Sepal.Width, Petal.Length, or Petal.Width and the column “Value” will include the numerical values of those measurements.

iris_Longer <- iris %>% pivot_longer(cols = Sepal.Length:Petal.Width,
                      names_to = "Measure",
                      values_to = "Value")
head(iris_Longer)

## # A tibble: 6 × 3
##   Species Measure      Value
##   <fct>   <chr>        <dbl>
## 1 setosa  Sepal.Length   5.1
## 2 setosa  Sepal.Width    3.5
## 3 setosa  Petal.Length   1.4
## 4 setosa  Petal.Width    0.2
## 5 setosa  Sepal.Length   4.9
## 6 setosa  Sepal.Width    3

Homework_08

Madelynn Edwards

2024-03-20

Data manipulations using the `dplyr` package

1. Examine the structure of the `iris` data set. How many observations and variables are in the data set?

2. Create a new data frame `iris1` that contains only the species virginica and versicolor with sepal lengths longer than 6 cm and sepal widths longer than 2.5 cm. How many observations and variables are in the data set?

3. Now, create a `iris2` data frame from `iris1` that contains only the columns for Species, Sepal.Length, and Sepal.Width. How many observations and variables are in the data set?

4. Create an `iris3` data frame from `iris2` that orders the observations from largest to smallest sepal length. Show the first 6 rows of this data set.

5. Create an `iris4` data frame from `iris3` that creates a column with a sepal area (length * width) value for each observation. How many observations and variables are in the data set?

6. Create `iris5` that calculates the average sepal length, the average sepal width, and the sample size of the entire `iris4` data frame and print `iris5`.

7. Finally, create `iris6` that calculates the average sepal length, the average sepal width, and the sample size for each species of in the `iris4` data frame and print `iris6`.

Homework_08

Madelynn Edwards

2024-03-20

Data manipulations using the dplyr package

1. Examine the structure of the iris data set. How many observations and variables are in the data set?

2. Create a new data frame iris1 that contains only the species virginica and versicolor with sepal lengths longer than 6 cm and sepal widths longer than 2.5 cm. How many observations and variables are in the data set?

3. Now, create a iris2 data frame from iris1 that contains only the columns for Species, Sepal.Length, and Sepal.Width. How many observations and variables are in the data set?

4. Create an iris3 data frame from iris2 that orders the observations from largest to smallest sepal length. Show the first 6 rows of this data set.

5. Create an iris4 data frame from iris3 that creates a column with a sepal area (length * width) value for each observation. How many observations and variables are in the data set?

6. Create iris5 that calculates the average sepal length, the average sepal width, and the sample size of the entire iris4 data frame and print iris5.

7. Finally, create iris6 that calculates the average sepal length, the average sepal width, and the sample size for each species of in the iris4 data frame and print iris6.

Data manipulations using the `dplyr` package

1. Examine the structure of the `iris` data set. How many observations and variables are in the data set?

2. Create a new data frame `iris1` that contains only the species virginica and versicolor with sepal lengths longer than 6 cm and sepal widths longer than 2.5 cm. How many observations and variables are in the data set?

3. Now, create a `iris2` data frame from `iris1` that contains only the columns for Species, Sepal.Length, and Sepal.Width. How many observations and variables are in the data set?

4. Create an `iris3` data frame from `iris2` that orders the observations from largest to smallest sepal length. Show the first 6 rows of this data set.

5. Create an `iris4` data frame from `iris3` that creates a column with a sepal area (length * width) value for each observation. How many observations and variables are in the data set?

6. Create `iris5` that calculates the average sepal length, the average sepal width, and the sample size of the entire `iris4` data frame and print `iris5`.

7. Finally, create `iris6` that calculates the average sepal length, the average sepal width, and the sample size for each species of in the `iris4` data frame and print `iris6`.