BSTA 511/611, OHSU
2023-10-04
Tuesday, October 10, 2022
12 am - 11:59 pm UTC (5pm 10/9 to 4:59 pm 10/10 here)
Are you frustrated that Slack sends a message when you press Enter? You can change that!
dim()
, nrow()
, ncol()
, names()
, str()
, summary()
, head()
, tail()
, $
mean()
, median()
, sd()
, quantile()
A good analogy for R packages is that they
are like apps you can download onto a mobile phone:
knitr
tidyverse
rstatix
janitor
ggridges
devtools
oi_biostat_data
oi_biostat_data
oibiostat
oibiostat
oibiostat
package requires first installing devtools
packagedevtools::install_github()
tells R to use the command install_github()
from the devtools
package without loading the entire package and all of its commands (which library(devtools)
would do).#
in front of the commands so that RStudio doesn’t evaluate them when rendering.oibiostat
package
library()
commandTip: at the top of your Qmd file, create a chunk that loads all of the R packages you want to use in that file.
Use the library()
command to load each required package.
library()
commands to load needed packages must be in the Qmd fileIn the US, individuals with developmental disabilities typically receive services and support from state governments.
Dataset dds.discr
Previous research
Result: an allegation of ethnic discrimination was brought against the California DDS.
Question: Are the data sufficient evidence of ethnic discrimination?
See Section 1.7.1 in the textbook for more details
dds.discr
dataset from oibiostat
packageThe textbook’s datasets are in the R package oibiostat
Make sure the oibiostat
package is installed before running the code below.
Load the oibiostat
package and the dataset dds.discr
the code below needs to be run every time you restart R or render a Qmd file
dds.discr
using data("dds.discr")
, you will see dds.discr
in the Data list of the Environment window.str()
structurestr()
to get information about variable types in a dataset.tibble
instead of a data.frame
tibble [1,000 × 6] (S3: tbl_df/tbl/data.frame)
$ id : int [1:1000] 10210 10409 10486 10538 10568 10690 10711 10778 10820 10823 ...
$ age.cohort : Factor w/ 6 levels "0-5","6-12","13-17",..: 3 5 1 4 3 3 3 3 3 3 ...
$ age : int [1:1000] 17 37 3 19 13 15 13 17 14 13 ...
$ gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 1 1 2 1 2 ...
$ expenditures: int [1:1000] 2113 41924 1454 6400 4412 4566 3915 3873 5021 2887 ...
$ ethnicity : Factor w/ 8 levels "American Indian",..: 8 8 4 4 8 4 8 3 8 4 ...
- attr(*, "spec")=
.. cols(
.. ID = col_integer(),
.. `Age Cohort` = col_character(),
.. Age = col_integer(),
.. Gender = col_character(),
.. Expenditures = col_integer(),
.. Ethnicity = col_character()
.. )
glimpse()
New: glimpse()
glimpse()
from the tidyverse
package (technically it’s from the dplyr
package) to get information about variable types.glimpse()
tends to have nicer output for tibbles
than str()
Rows: 1,000
Columns: 6
$ id <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10778, 1…
$ age.cohort <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age <int> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20…
$ gender <fct> Female, Male, Male, Female, Male, Female, Female, Male, F…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 3873, 5021, 28…
$ ethnicity <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…
summary()
summary()
to get summary information about variables id age.cohort age gender expenditures
Min. :10210 0-5 : 82 Min. : 0.0 Female:503 Min. : 222
1st Qu.:31809 6-12 :175 1st Qu.:12.0 Male :497 1st Qu.: 2899
Median :55384 13-17:212 Median :18.0 Median : 7026
Mean :54663 18-21:199 Mean :22.8 Mean :18066
3rd Qu.:76135 22-50:226 3rd Qu.:26.0 3rd Qu.:37713
Max. :99898 51+ :106 Max. :95.0 Max. :75098
ethnicity
White not Hispanic:401
Hispanic :376
Asian :129
Black : 59
Multi Race : 26
American Indian : 4
(Other) : 5
tbl_summary()
: summary tableggplot
What is being measured on the vertical axes?
What is being measured on the vertical axes?
No outliers:
With outliers:
Can you determine the following using boxplots?
Response vs. explanatory variables (Section 1.2.3)
Describe the association between the variables
Two variables \(x\) and \(y\) are
positively associated if \(y\) increases as \(x\) increases.
negatively associated if \(y\) decreases as \(x\) increases.
If there is no association between the variables, then we say they are uncorrelated or independent.
\(r = -1\) indicates a perfect negative linear relationship: As one variable increases, the value of the other variable tends to go down, following a straight line.
\(r = 0\) indicates no linear relationship: The values of both variables go up/down independently of each other.
\(r = 1\) indicates a perfect positive linear relationship: As the value of one variable goes up, the value of the other variable tends to go up as well in a linear fashion.
The closer \(r\) is to ±1, the stronger the linear association.
The (Peasron) correlation coefficient of variables \(x\) and \(y\) can be computed using the formula \[r = \frac{1}{n-1}\sum_{i=1}^{n}\Big(\frac{x_i - \bar{x}}{s_x}\Big)\Big(\frac{y_i - \bar{y}}{s_y}\Big)\] where
Rossman & Chance’s applet
Tracks performance of guess vs. actual, error vs. actual, and error vs. trial
Or, for the Atari-like experience
Describe the association between the variables
count()
count
is from the dplyr
package# A tibble: 35 × 3
ethnicity age.cohort n
<fct> <fct> <int>
1 American Indian 13-17 1
2 American Indian 22-50 1
3 American Indian 51+ 2
4 Asian 0-5 8
5 Asian 6-12 18
6 Asian 13-17 20
7 Asian 18-21 41
8 Asian 22-50 29
9 Asian 51+ 13
10 Black 0-5 3
# ℹ 25 more rows
%>%
The pipe operator %>%
strings together commands to be performed sequentially
# A tibble: 3 × 6
id age.cohort age gender expenditures ethnicity
<int> <fct> <int> <fct> <int> <fct>
1 10210 13-17 17 Female 2113 White not Hispanic
2 10409 22-50 37 Male 41924 White not Hispanic
3 10486 0-5 3 Male 1454 Hispanic
janitor
package’s tabyl
functionadorn_
your table!
A relative frequency table shows proportions (or percentages) instead of counts
To the right I removed (deselected) the counts column (n
) to create a relative frequency table
dds.discr %>%
tabyl(ethnicity, age.cohort) %>%
adorn_totals(c("row")) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits=0) %>%
adorn_ns()
ethnicity 0-5 6-12 13-17 18-21 22-50 51+
American Indian 0% (0) 0% (0) 25% (1) 0% (0) 25% (1) 50% (2)
Asian 6% (8) 14% (18) 16% (20) 32% (41) 22% (29) 10% (13)
Black 5% (3) 19% (11) 20% (12) 15% (9) 29% (17) 12% (7)
Hispanic 12% (44) 24% (91) 27% (103) 21% (78) 11% (43) 5% (17)
Multi Race 27% (7) 35% (9) 27% (7) 8% (2) 4% (1) 0% (0)
Native Hawaiian 0% (0) 0% (0) 0% (0) 0% (0) 67% (2) 33% (1)
Other 0% (0) 0% (0) 100% (2) 0% (0) 0% (0) 0% (0)
White not Hispanic 5% (20) 11% (46) 17% (67) 17% (69) 33% (133) 16% (66)
Total 8% (82) 18% (175) 21% (212) 20% (199) 23% (226) 11% (106)
dds.discr %>%
group_by(ethnicity) %>%
summarize(
ave = mean(expenditures),
SD = sd(expenditures),
med = median(expenditures))
# A tibble: 8 × 4
ethnicity ave SD med
<fct> <dbl> <dbl> <dbl>
1 American Indian 36438. 25694. 41818.
2 Asian 18392. 19209. 9369
3 Black 20885. 20549. 8687
4 Hispanic 11066. 15630. 3952
5 Multi Race 4457. 7332. 2622
6 Native Hawaiian 42782. 6576. 40727
7 Other 3316. 1836. 3316.
8 White not Hispanic 24698. 20604. 15718
get_summary_stats()
from rstatix
package# A tibble: 3 × 13
variable n min max median q1 q3 iqr mad mean sd
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 id 1000 10210 99898 55384. 31809. 76135. 44326 3.27e4 5.47e4 2.56e4
2 age 1000 0 95 18 12 26 14 1.04e1 2.28e1 1.85e1
3 expenditures 1000 222 75098 7026 2899. 37713. 34814 7.76e3 1.81e4 1.95e4
# ℹ 2 more variables: se <dbl>, ci <dbl>
# A tibble: 8 × 11
ethnicity variable n min max median iqr mean sd se ci
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 American… expendi… 4 3726 58392 41818. 34085. 36438. 25694. 12847. 40885.
2 Asian expendi… 129 374 75098 9369 30892 18392. 19209. 1691. 3346.
3 Black expendi… 59 240 60808 8687 37987 20885. 20549. 2675. 5355.
4 Hispanic expendi… 376 222 65581 3952 7961. 11066. 15630. 806. 1585.
5 Multi Ra… expendi… 26 669 38619 2622 2060. 4457. 7332. 1438. 2962.
6 Native H… expendi… 3 37479 50141 40727 6331 42782. 6576. 3797. 16337.
7 Other expendi… 2 2018 4615 3316. 1298. 3316. 1836. 1298. 16499.
8 White no… expendi… 401 340 68890 15718 39157 24698. 20604. 1029. 2023.
Use kable()
from the knitr
package.
variable | n | min | max | median | q1 | q3 | iqr | mad | mean | sd | se | ci |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | 1000 | 10210 | 99898 | 55384.5 | 31808.75 | 76134.75 | 44326 | 32734.325 | 54662.85 | 25643.673 | 810.924 | 1591.310 |
age | 1000 | 0 | 95 | 18.0 | 12.00 | 26.00 | 14 | 10.378 | 22.80 | 18.462 | 0.584 | 1.146 |
expenditures | 1000 | 222 | 75098 | 7026.0 | 2898.75 | 37712.75 | 34814 | 7760.670 | 18065.79 | 19542.831 | 617.999 | 1212.724 |
knitr
(2/2)Use kable()
from the knitr
package.
ethnicity | variable | n | min | max | median | iqr | mean | sd | se | ci |
---|---|---|---|---|---|---|---|---|---|---|
American Indian | expenditures | 4 | 3726 | 58392 | 41817.5 | 34085.25 | 36438.250 | 25693.912 | 12846.956 | 40884.748 |
Asian | expenditures | 129 | 374 | 75098 | 9369.0 | 30892.00 | 18392.372 | 19209.225 | 1691.278 | 3346.482 |
Black | expenditures | 59 | 240 | 60808 | 8687.0 | 37987.00 | 20884.593 | 20549.274 | 2675.288 | 5355.170 |
Hispanic | expenditures | 376 | 222 | 65581 | 3952.0 | 7961.25 | 11065.569 | 15629.847 | 806.048 | 1584.940 |
Multi Race | expenditures | 26 | 669 | 38619 | 2622.0 | 2059.75 | 4456.731 | 7332.135 | 1437.950 | 2961.514 |
Native Hawaiian | expenditures | 3 | 37479 | 50141 | 40727.0 | 6331.00 | 42782.333 | 6576.462 | 3796.922 | 16336.838 |
Other | expenditures | 2 | 2018 | 4615 | 3316.5 | 1298.50 | 3316.500 | 1836.356 | 1298.500 | 16499.007 |
White not Hispanic | expenditures | 401 | 340 | 68890 | 15718.0 | 39157.00 | 24697.549 | 20604.376 | 1028.933 | 2022.793 |