From 2ae7f201c21288a71be09d04b0af15e7585ecd43 Mon Sep 17 00:00:00 2001 From: jhudsl-robot Date: Tue, 10 Oct 2023 17:35:55 +0000 Subject: [PATCH] Render toc-less --- docs/no_toc/02-lesson2.md | 8 ++++---- docs/no_toc/About.md | 2 +- docs/no_toc/about-the-authors.html | 2 +- docs/no_toc/search_index.json | 2 +- docs/no_toc/working-with-data-structures.html | 8 ++++---- 5 files changed, 11 insertions(+), 11 deletions(-) diff --git a/docs/no_toc/02-lesson2.md b/docs/no_toc/02-lesson2.md index 647f73d..42eeea2 100644 --- a/docs/no_toc/02-lesson2.md +++ b/docs/no_toc/02-lesson2.md @@ -368,8 +368,8 @@ Let's convert this into code! ```r metadata_filtered = filter(metadata, OncotreeLineage == "Breast") -brca_metadata = select(metadata_filtered, ModelID, Age, Sex) -head(brca_metadata) +breast_metadata = select(metadata_filtered, ModelID, Age, Sex) +head(breast_metadata) ``` ``` @@ -392,7 +392,7 @@ Let's carefully a look what how the R Console is interpreting the `filter()` fun - The first argument of `filter()` is a dataframe, which we give `metadata`. -- The second argument is strange: the expression we give it looks like a logical indexing vector built from a comparison operator, but the variable `OncotreeLineage` does not exist in our environment! Rather, `OncotreeLineage` is a column from `metadata`, and we are referring to it as a **data variable** in the context of the dataframe `metadata`. So, we make a comparsion operation on the column `OncotreeLineage` from `metadata` and its resulting logical indexing vector is the input to the second argument. +- The second argument is strange: the expression we give it looks like a logical indexing vector built from a comparison operator, but the variable `OncotreeLineage` does not exist in our environment! Rather, `OncotreeLineage` is a column from `metadata`, and we are referring to it as a **data variable** in the context of the dataframe `metadata`. So, we make a comparison operation on the column `OncotreeLineage` from `metadata` and its resulting logical indexing vector is the input to the second argument. - How do we know when a variable being used is a variable from the environment, or a data variable from a dataframe? It's not clear cut, but here's a rule of thumb: most functions from the `tidyverse` package allows you to use data variables to refer to columns of a dataframe. We refer to documentation when we are not sure. @@ -416,4 +416,4 @@ Let's carefully a look what how the R Console is interpreting the `select()` fun - Putting it together, `select()` takes in a dataframe, and as many data variables you like to select columns, and returns a dataframe with the columns you described by data variables. -- Store this in `brca_metadata` variable. +- Store this in `breast_metadata` variable. diff --git a/docs/no_toc/About.md b/docs/no_toc/About.md index a05ee8b..70ad0b6 100644 --- a/docs/no_toc/About.md +++ b/docs/no_toc/About.md @@ -49,7 +49,7 @@ These credits are based on our [course contributors table guidelines](https://ww ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2023-10-06 +## date 2023-10-10 ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date lib source diff --git a/docs/no_toc/about-the-authors.html b/docs/no_toc/about-the-authors.html index 2b2dbcc..f2fd71a 100644 --- a/docs/no_toc/about-the-authors.html +++ b/docs/no_toc/about-the-authors.html @@ -355,7 +355,7 @@

About the Authors

## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2023-10-06 +## date 2023-10-10 ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date lib source diff --git a/docs/no_toc/search_index.json b/docs/no_toc/search_index.json index c4a12f4..3fa79e7 100644 --- a/docs/no_toc/search_index.json +++ b/docs/no_toc/search_index.json @@ -1 +1 @@ -[["index.html", "Introduction to R, Season 1 Chapter 1 About this Course 1.1 Curriculum 1.2 Target Audience", " Introduction to R, Season 1 October, 2023 Chapter 1 About this Course 1.1 Curriculum The course covers fundamentals of R, a high-level programming language, and use it to wrangle data for analysis and visualization. 1.2 Target Audience The course is intended for researchers who want to learn coding for the first time with a data science application, or have explored programming and want to focus on fundamentals. "],["intro-to-computing.html", "Chapter 2 Intro to Computing 2.1 Goals of the course 2.2 What is a computer program? 2.3 A programming language has following elements: 2.4 What is R and why should I use it? 2.5 R vs. Python as a first language 2.6 Posit Cloud Setup 2.7 Using Quarto for your work 2.8 Grammar Structure 1: Evaluation of Expressions 2.9 Grammar Structure 2: Storing data types in the global environment 2.10 Grammar Structure 3: Evaluation of Functions 2.11 Tips on Exercises / Debugging", " Chapter 2 Intro to Computing 2.1 Goals of the course Fundamental concepts in high-level programming languages (R, Python, Julia, WDL, etc.) that is transferable: How do programs run, and how do we solve problems using functions and data structures? Beginning of data science fundamentals: How do you translate your scientific question to a data wrangling problem and answer it? Data science workflow Find a nice balance between the two throughout the course: we will try to reproduce a figure from a scientific publication using new data. 2.2 What is a computer program? A sequence of instructions to manipulate data for the computer to execute. A series of translations: English <-> Programming Code for Interpreter <-> Machine Code for Central Processing Unit (CPU) We will focus on English <-> Programming Code for R Interpreter in this class. More importantly: How we organize ideas <-> Instructing a computer to do something. 2.3 A programming language has following elements: Grammar structure (simple building blocks) Means of combination to analyze and create content (examples around genomics provided, and your scientific creativity is strongly encouraged!) Means of abstraction for modular and reusable content (data structures, functions) Culture (emphasis on open-source, collaborative, reproducible code) Requires a lot of practice to be fluent! 2.4 What is R and why should I use it? It is a: Dynamic programming interpreter Highly used for data science, visualization, statistics, bioinformatics Open-source and free; easy to create and distribute your content; quirky culture 2.5 R vs. Python as a first language In terms of our goals, recall: Fundamental concepts in high-level programming languages Beginning of data science fundamentals There are a lot of nuances and debates, but I argue that Python is a better learning environment for the former and R is better for the latter. Ultimately, either should be okay! Perhaps more importantly, consider what your research group and collaborator are more comfortable with. 2.6 Posit Cloud Setup Posit Cloud/RStudio is an Integrated Development Environment (IDE). Think about it as Microsoft Word to a plain text editor. It provides extra bells and whistles to using R that is easier for the user. Today, we will pay close attention to: Script editor: where sequence of instructions are typed and saved as a text document as a R program. To run the program, the console will execute every single line of code in the document. Console (interpreter): Instead of giving a entire program in a text file, you could interact with the R Console line by line. You give it one line of instruction, and the console executes that single line. It is what R looks like without RStudio. Environment: Often, code will store information in memory, and it is shown in the environment. More on this later. 2.7 Using Quarto for your work Why should we use Quarto for data science work? Encourages reproducible workflows Code, output from code, and prose combined together Extendability to Python, Julia, and more. More options and guides can be found in Introduction to Quarto . 2.8 Grammar Structure 1: Evaluation of Expressions Expressions are be built out of operations or functions. Operations and functions combine data types to return another data type. We can combine multiple expressions together to form more complex expressions: an expression can have other expressions nested inside it. For instance, consider the following expressions entered to the R Console: 18 + 21 ## [1] 39 max(18, 21) ## [1] 21 max(18 + 21, 65) ## [1] 65 18 + (21 + 65) ## [1] 104 nchar("ATCG") ## [1] 4 Here, our input data types to the operation are numeric in lines 1-4 and our input data type to the function is character in line 5. Operations are just functions in hiding. We could have written: sum(18, 21) ## [1] 39 sum(18, sum(21, 65)) ## [1] 104 Remember the function machine from algebra class? We will use this schema to think about expressions. Function machine from algebra class. If an expression is made out of multiple, nested operations, what is the proper way of the R Console interpreting it? Being able to read nested operations and nested functions as a programmer is very important. 3 * 4 + 2 ## [1] 14 3 * (4 + 2) ## [1] 18 Lastly, a note on the use of functions: a programmer should not need to know how the function is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language. 2.8.1 Data types Here are some data types that we will be using in this course: Numeric: 18, 21, 65, 1.25 Character: “ATCG”, “Whatever”, “948-293-0000” Logical: TRUE, FALSE 2.9 Grammar Structure 2: Storing data types in the global environment To build up a computer program, we need to store our returned data type from our expression somewhere for downstream use. We can assign a variable to it as follows: x = 18 + 21 If you enter this in the Console, you will see that in the Environment, the variable x has a value of 39. 2.9.1 Execution rule for variable assignment Evaluate the expression to the right of =. Bind variable to the left of = to the resulting value. The variable is stored in the environment. <- is okay too! The environment is where all the variables are stored, and can be used for an expression anytime once it is defined. Only one unique variable name can be defined. The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM. When we work with large datasets, if you assign a variable to a data type larger than the available RAM, it will not work. More on this later. Look, now x can be reused downstream: x - 2 ## [1] 37 y = x * 2 2.10 Grammar Structure 3: Evaluation of Functions A function has a function name, arguments, and returns a data type. 2.10.1 Execution rule for functions: Evaluate the function by its arguments, and if the arguments are functions or contains operations, evaluate those functions or operations first. The output of functions is called the returned value. sqrt(nchar("hello")) ## [1] 2.236068 (nchar("hello") + 4) * 2 ## [1] 18 2.11 Tips on Exercises / Debugging Common errors: Syntax error. It did something else than I expected! The function or operation does not accept the input data type. Changing a variable without realizing you did so. Solutions: Where is the problem? What kind of problem is it? Explain your problem to someone! "],["working-with-data-structures.html", "Chapter 3 Working with data structures 3.1 Vectors 3.2 Dataframes", " Chapter 3 Working with data structures 3.1 Vectors In the first exercise, you started to explore data structures, which store information about data types. You played around with vectors, which is a ordered collection of a data type. Each element of a vector contains a data type, and there is no limit on how big a vector can be, as long the memory use of it is within the computer’s memory (RAM). We can now store a vast amount of information in a vector, and assign it to a single variable. We can now use operations and functions on a vector, modifying many elements within the vector at once! This fits with the theme of abstraction and modular organization described in the first lesson! We often create vectors using the combine function, c() : staff = c("chris", "shasta", "jeff") chrNum = c(2, 3, 1) If we try to create a vector with mixed data types, R will try to make them be the same data type, or give an error: staff = c("chris", "shasta", 123) staff ## [1] "chris" "shasta" "123" Our numeric got converted to character so that the entire vector is all characters. 3.1.1 Using operations on vectors Recall from the first class: Expressions are be built out of operations or functions. Operations and functions combine data types to return another data type. Now that we are working with data structures, the same principle applies: Operations and functions combine data structures to return another data structure (or data type!). What happens if we use some familiar operations we used for numerics on a numerical vector? If we multiply a numerical vector by a numeric, what do we get? chrNum = chrNum * 3 chrNum ## [1] 6 9 3 All of chrNum’s elements tripled! Our multiplication operation, when used on a numeric vector with a numeric, has a new meaning: it multiplied all the elements by 3. Multiplication is an operation that can be used for multiple data types or data structures: we call this property operator overloading. Here’s another example: numeric vector multiplied by another numeric vector: chrNum * c(2, 2, 0) ## [1] 12 18 0 but there are also limits: a numeric vector added to a character vector creates an error: #chrNum + staff When we work with operations and functions, we must be mindful what inputs the operation or function takes in, and what outputs it gives, no matter how “intuitive” the operation or function name is. 3.1.2 Subsetting vectors explicitly In the exercise this past week, you looked at a new operation to subset elements of a vector using brackets. Inside the bracket is either a single numeric value or an a numerical indexing vector containing numerical values. They dictate which elements of the vector to return. staff[2] ## [1] "shasta" staff[c(1, 2)] ## [1] "chris" "shasta" small_staff = staff[c(1, 2)] In the last line, we created a new vector small_staff that is a subset of the staff given the indexing vector c(1, 2). We have three vectors referenced in one line of code. This is tricky and we need to always refer to our rules step-by-step: evaluate the expression right of the =, which contains a vector bracket. Follow the rule of the vector bracket. Then store the returning value to the variable left of =. Alternatively, instead of using numerical indexing vectors, we can use a logical indexing vector. The logical indexing vector must be the same length as the vector to be subsetted, with TRUE indicating an element to keep, and FALSE indicating an element to drop. The following block of code gives the same value as before: staff[c(TRUE, FALSE, FALSE)] ## [1] "chris" staff[c(TRUE, TRUE, FALSE)] ## [1] "chris" "shasta" small_staff = staff[c(TRUE, TRUE, FALSE)] 3.1.3 Subsetting vectors implicitly Here are two applications of subsetting on vectors that need distinction to write the correct code: Explicit subsetting: Suppose someone approaches you a 100-length vector of people’s ages, and say that they want to subset to the first 10 elements. Implicit subsetting: Suppose someone approaches you a 100-length vector of people’s ages, and say that they want to subset to elements < 18 age. We already know how to explicitly subset: set.seed(123) #don't worry about this function age = round(runif(100, 1, 100)) #don't worry about these functions first_ten_age = age[1:10] For implicit subsetting, we don’t know which elements to select off the top of our head! If we know which elements have less than 18, then we can give the elements for an explicit subset. Therefore, we need to create a logical indexing vector using a comparison operator: indexing_vector = age < 18 indexing_vector ## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE ## [13] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE ## [25] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE ## [37] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE ## [49] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE ## [61] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [73] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE ## [85] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE ## [97] FALSE TRUE FALSE FALSE The comparison operator < compared the numeric value of age to see which elements of age is less than 18, and then returned a logical vector that has TRUE if age is less than 18 at that element and FALSE otherwise. Then, age_young = age[indexing_vector] age_young ## [1] 6 11 5 16 3 15 16 15 6 13 14 10 1 12 11 14 10 We could have done this all in one line without storing the indexing vector as a variable in the environment: age_young = age[age < 18] We have the following comparison operators in R: < less than <= less or equal than == equal to != not equal to > greater than >= greater than or equal to You can also put these comparison operators together to form more complex statements, which you will explore in this week’s exercise. Another example: age_90 = age[age == 90] age_90 ## [1] 90 90 90 age_not_90 = age[age != 90] age_not_90 ## [1] 29 79 41 88 94 6 53 89 56 46 96 46 68 58 11 25 5 33 95 89 70 64 99 66 71 ## [26] 55 60 30 16 96 69 80 3 48 76 22 32 24 15 42 42 38 16 15 24 47 27 86 6 45 ## [51] 80 13 57 21 14 76 38 67 10 39 28 82 45 81 81 80 45 76 63 71 1 48 23 39 62 ## [76] 36 12 25 67 42 79 11 44 99 89 89 18 14 66 35 66 33 20 78 10 47 52 For most of our subsetting tasks on vectors (and dataframes below), we will be encouraging implicit subsetting. The power of implicit subsetting is that you don’t need to know what your vector contains to do something with it! This technique is related to abstraction in programming mentioned in the first lesson: by using expressions to find the specific value you are interested instead of hard-coding the value explicitly, it generalizes your code to handle a wider variety of situations. 3.2 Dataframes Before we dive into dataframes, check that the tidyverse package is properly installed by loading it in your R Console: library(tidyverse) ## Warning: package 'tidyverse' was built under R version 4.0.3 ## Warning: package 'purrr' was built under R version 4.0.5 ## Warning: package 'stringr' was built under R version 4.0.3 Here is the data structure you have been waiting for: the dataframe. A dataframe is a spreadsheet such that each column must have the same data type. Think of a bunch of vectors organized as columns, and you get a dataframe. For the most part, we load in dataframes from a file path (although they are sometimes created by combining several vectors of the same length, but we won’t be covering that here): load(url("https://github.com/fhdsl/S1_Intro_to_R/raw/main/classroom_data/CCLE.RData")) 3.2.1 Using functions and operations on dataframes We can run some useful functions on dataframes to get some useful properties, similar to how we used length() for vectors: nrow(metadata) ## [1] 1864 ncol(metadata) ## [1] 30 dim(metadata) ## [1] 1864 30 colnames(metadata) ## [1] "ModelID" "PatientID" "CellLineName" ## [4] "StrippedCellLineName" "Age" "SourceType" ## [7] "SangerModelID" "RRID" "DepmapModelType" ## [10] "AgeCategory" "GrowthPattern" "LegacyMolecularSubtype" ## [13] "PrimaryOrMetastasis" "SampleCollectionSite" "Sex" ## [16] "SourceDetail" "LegacySubSubtype" "CatalogNumber" ## [19] "CCLEName" "COSMICID" "PublicComments" ## [22] "WTSIMasterCellID" "EngineeredModel" "TreatmentStatus" ## [25] "OnboardedMedia" "PlateCoating" "OncotreeCode" ## [28] "OncotreeSubtype" "OncotreePrimaryDisease" "OncotreeLineage" The last function, colnames() returns a character vector of the column names of the dataframe. This is an important property of dataframes that we will make use of to subset on it. We introduce an operation for dataframes: the dataframe$column_name operation selects for a column by its column name and returns the column as a vector. For instance: metadata$OncotreeLineage[1:5] ## [1] "Ovary/Fallopian Tube" "Myeloid" "Bowel" ## [4] "Myeloid" "Myeloid" metadata$Age[1:5] ## [1] 60 36 72 30 30 We treat the resulting value as a vector, so we can perform implicit subsetting: metadata$OncotreeLineage[metadata$OncotreeLineage == "Myeloid"] ## [1] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [8] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [15] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [22] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [29] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [36] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [43] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [50] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [57] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [64] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [71] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" Lastly, try running View(metadata) in RStudio Console…whew, a nice way to examine your dataframe like a spreadsheet program! 3.2.2 “What do you want to do with this dataframe”? Before diving into the technical part of subsetting dataframes, we will use different mindset to think about what we want to do with this dataframe as scientists. Remember that a major theme of the course is about: How we organize ideas <-> Instructing a computer to do something. Until now, we haven’t focused too much on how we organize our scientific ideas to interact with what we can do with code. Let’s write our code driven by our scientific curiosity. Here’s a starting prompt: In the dataframe you have here, which rows would you filter for and columns would you select that relate to a scientific question? Use the implicit subsetting mindset here: ie. “I want to filter for rows (cell lines) that are breast cancer and look at the Age and Sex.” and not “I want to filter for rows 20-50 and select columns 2 and 8”. Notice that when we filter for rows in an implicitly way, we often formulate criteria about the columns. (This is because we are guaranteed to have column names in dataframes. Some dataframes have row names, but because the data types are not guranteed to have the same data type, it makes describing by row properties difficult.) Let’s convert this into code! metadata_filtered = filter(metadata, OncotreeLineage == "Breast") brca_metadata = select(metadata_filtered, ModelID, Age, Sex) head(brca_metadata) ## ModelID Age Sex ## 1 ACH-000017 43 Female ## 2 ACH-000019 69 Female ## 3 ACH-000028 69 Female ## 4 ACH-000044 47 Female ## 5 ACH-000097 63 Female ## 6 ACH-000111 41 Female Here, filter() and select() are functions from the tidyverse package. 3.2.3 Filter rows Let’s carefully a look what how the R Console is interpreting the filter() function: We evaluate the expression right of =. The first argument of filter() is a dataframe, which we give metadata. The second argument is strange: the expression we give it looks like a logical indexing vector built from a comparison operator, but the variable OncotreeLineage does not exist in our environment! Rather, OncotreeLineage is a column from metadata, and we are referring to it as a data variable in the context of the dataframe metadata. So, we make a comparsion operation on the column OncotreeLineage from metadata and its resulting logical indexing vector is the input to the second argument. How do we know when a variable being used is a variable from the environment, or a data variable from a dataframe? It’s not clear cut, but here’s a rule of thumb: most functions from the tidyverse package allows you to use data variables to refer to columns of a dataframe. We refer to documentation when we are not sure. This encourages more readable code at the expense of consistency of referring to variables in the environment. The authors of this package describes this trade-off. Putting it together, filter() takes in a dataframe, and an logical indexing vector described by data variables as arguments, and returns a data frame with rows that match condition described by the logical indexing vector. Store this in metadata_filtered variable. 3.2.4 Select columns Let’s carefully a look what how the R Console is interpreting the select() function: We evaluate the expression right of =. The first argument of filter() is a dataframe, which we give metadata. The second and third arguments are data variables referring the columns of metadata. For certain functions like filter(), there is no limit on the number of arguments you provide. You can keep adding data variables to select for more column names. Putting it together, select() takes in a dataframe, and as many data variables you like to select columns, and returns a dataframe with the columns you described by data variables. Store this in brca_metadata variable. "],["about-the-authors.html", "About the Authors", " About the Authors These credits are based on our course contributors table guidelines.     Credits Names Pedagogy Lead Content Instructor(s) Chris Lo Lecturer Chris Lo Content Author(s) (include chapter name/link in parentheses if only for specific chapters) - make new line if more than one chapter involved If any other authors besides lead instructor Content Contributor(s) (include section name/link in parentheses) - make new line if more than one section involved Wrote less than a chapter Content Editor(s)/Reviewer(s) Checked your content Content Director(s) Helped guide the content direction Content Consultants (include chapter name/link in parentheses or word “General”) - make new line if more than one chapter involved Gave high level advice on content Acknowledgments Gave small assistance to content but not to the level of consulting Production Content Publisher(s) Helped with publishing platform Content Publishing Reviewer(s) Reviewed overall content and aesthetics on publishing platform Technical Course Publishing Engineer(s) Helped with the code for the technical aspects related to the specific course generation Template Publishing Engineers Candace Savonen, Carrie Wright, Ava Hoffman Publishing Maintenance Engineer Candace Savonen Technical Publishing Stylists Carrie Wright, Ava Hoffman, Candace Savonen Package Developers (ottrpal) Candace Savonen, John Muschelli, Carrie Wright Art and Design Illustrator(s) Created graphics for the course Figure Artist(s) Created figures/plots for course Videographer(s) Filmed videos Videography Editor(s) Edited film Audiographer(s) Recorded audio Audiography Editor(s) Edited audio recordings Funding Funder(s) Institution/individual who funded course including grant number Funding Staff Staff members who help with funding   ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.0.2 (2020-06-22) ## os Ubuntu 20.04.5 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC ## date 2023-10-06 ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date lib source ## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.5) ## bookdown 0.24 2023-03-28 [1] Github (rstudio/bookdown@88bc4ea) ## bslib 0.4.2 2022-12-16 [1] CRAN (R 4.0.2) ## cachem 1.0.7 2023-02-24 [1] CRAN (R 4.0.2) ## callr 3.5.0 2020-10-08 [1] RSPM (R 4.0.2) ## cli 3.6.1 2023-03-23 [1] CRAN (R 4.0.2) ## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) ## desc 1.2.0 2018-05-01 [1] RSPM (R 4.0.3) ## devtools 2.3.2 2020-09-18 [1] RSPM (R 4.0.3) ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) ## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.3) ## evaluate 0.20 2023-01-17 [1] CRAN (R 4.0.2) ## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0) ## fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.0.2) ## fs 1.5.0 2020-07-31 [1] RSPM (R 4.0.3) ## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.5) ## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0) ## htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.0.2) ## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.0.2) ## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2) ## knitr 1.33 2023-03-28 [1] Github (yihui/knitr@a1052d1) ## lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.0.2) ## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.0.2) ## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.0.2) ## ottrpal 1.0.1 2023-03-28 [1] Github (jhudsl/ottrpal@151e412) ## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.0.2) ## pkgbuild 1.1.0 2020-07-13 [1] RSPM (R 4.0.2) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3) ## pkgload 1.1.0 2020-05-29 [1] RSPM (R 4.0.3) ## prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.0.3) ## processx 3.4.4 2020-09-03 [1] RSPM (R 4.0.2) ## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2) ## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0) ## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2) ## remotes 2.2.0 2020-07-21 [1] RSPM (R 4.0.3) ## rlang 1.1.0 2023-03-14 [1] CRAN (R 4.0.2) ## rmarkdown 2.10 2023-03-28 [1] Github (rstudio/rmarkdown@02d3c25) ## rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.0.2) ## sass 0.4.5 2023-01-24 [1] CRAN (R 4.0.2) ## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3) ## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.3) ## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3) ## testthat 3.0.1 2023-03-28 [1] Github (R-lib/testthat@e99155a) ## tibble 3.2.1 2023-03-20 [1] CRAN (R 4.0.2) ## usethis 1.6.3 2020-09-17 [1] RSPM (R 4.0.2) ## utf8 1.1.4 2018-05-24 [1] RSPM (R 4.0.3) ## vctrs 0.6.1 2023-03-22 [1] CRAN (R 4.0.2) ## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2) ## xfun 0.26 2023-03-28 [1] Github (yihui/xfun@74c2a66) ## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3) ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library "],["references.html", "Chapter 4 References", " Chapter 4 References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] +[["index.html", "Introduction to R, Season 1 Chapter 1 About this Course 1.1 Curriculum 1.2 Target Audience", " Introduction to R, Season 1 October, 2023 Chapter 1 About this Course 1.1 Curriculum The course covers fundamentals of R, a high-level programming language, and use it to wrangle data for analysis and visualization. 1.2 Target Audience The course is intended for researchers who want to learn coding for the first time with a data science application, or have explored programming and want to focus on fundamentals. "],["intro-to-computing.html", "Chapter 2 Intro to Computing 2.1 Goals of the course 2.2 What is a computer program? 2.3 A programming language has following elements: 2.4 What is R and why should I use it? 2.5 R vs. Python as a first language 2.6 Posit Cloud Setup 2.7 Using Quarto for your work 2.8 Grammar Structure 1: Evaluation of Expressions 2.9 Grammar Structure 2: Storing data types in the global environment 2.10 Grammar Structure 3: Evaluation of Functions 2.11 Tips on Exercises / Debugging", " Chapter 2 Intro to Computing 2.1 Goals of the course Fundamental concepts in high-level programming languages (R, Python, Julia, WDL, etc.) that is transferable: How do programs run, and how do we solve problems using functions and data structures? Beginning of data science fundamentals: How do you translate your scientific question to a data wrangling problem and answer it? Data science workflow Find a nice balance between the two throughout the course: we will try to reproduce a figure from a scientific publication using new data. 2.2 What is a computer program? A sequence of instructions to manipulate data for the computer to execute. A series of translations: English <-> Programming Code for Interpreter <-> Machine Code for Central Processing Unit (CPU) We will focus on English <-> Programming Code for R Interpreter in this class. More importantly: How we organize ideas <-> Instructing a computer to do something. 2.3 A programming language has following elements: Grammar structure (simple building blocks) Means of combination to analyze and create content (examples around genomics provided, and your scientific creativity is strongly encouraged!) Means of abstraction for modular and reusable content (data structures, functions) Culture (emphasis on open-source, collaborative, reproducible code) Requires a lot of practice to be fluent! 2.4 What is R and why should I use it? It is a: Dynamic programming interpreter Highly used for data science, visualization, statistics, bioinformatics Open-source and free; easy to create and distribute your content; quirky culture 2.5 R vs. Python as a first language In terms of our goals, recall: Fundamental concepts in high-level programming languages Beginning of data science fundamentals There are a lot of nuances and debates, but I argue that Python is a better learning environment for the former and R is better for the latter. Ultimately, either should be okay! Perhaps more importantly, consider what your research group and collaborator are more comfortable with. 2.6 Posit Cloud Setup Posit Cloud/RStudio is an Integrated Development Environment (IDE). Think about it as Microsoft Word to a plain text editor. It provides extra bells and whistles to using R that is easier for the user. Today, we will pay close attention to: Script editor: where sequence of instructions are typed and saved as a text document as a R program. To run the program, the console will execute every single line of code in the document. Console (interpreter): Instead of giving a entire program in a text file, you could interact with the R Console line by line. You give it one line of instruction, and the console executes that single line. It is what R looks like without RStudio. Environment: Often, code will store information in memory, and it is shown in the environment. More on this later. 2.7 Using Quarto for your work Why should we use Quarto for data science work? Encourages reproducible workflows Code, output from code, and prose combined together Extendability to Python, Julia, and more. More options and guides can be found in Introduction to Quarto . 2.8 Grammar Structure 1: Evaluation of Expressions Expressions are be built out of operations or functions. Operations and functions combine data types to return another data type. We can combine multiple expressions together to form more complex expressions: an expression can have other expressions nested inside it. For instance, consider the following expressions entered to the R Console: 18 + 21 ## [1] 39 max(18, 21) ## [1] 21 max(18 + 21, 65) ## [1] 65 18 + (21 + 65) ## [1] 104 nchar("ATCG") ## [1] 4 Here, our input data types to the operation are numeric in lines 1-4 and our input data type to the function is character in line 5. Operations are just functions in hiding. We could have written: sum(18, 21) ## [1] 39 sum(18, sum(21, 65)) ## [1] 104 Remember the function machine from algebra class? We will use this schema to think about expressions. Function machine from algebra class. If an expression is made out of multiple, nested operations, what is the proper way of the R Console interpreting it? Being able to read nested operations and nested functions as a programmer is very important. 3 * 4 + 2 ## [1] 14 3 * (4 + 2) ## [1] 18 Lastly, a note on the use of functions: a programmer should not need to know how the function is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language. 2.8.1 Data types Here are some data types that we will be using in this course: Numeric: 18, 21, 65, 1.25 Character: “ATCG”, “Whatever”, “948-293-0000” Logical: TRUE, FALSE 2.9 Grammar Structure 2: Storing data types in the global environment To build up a computer program, we need to store our returned data type from our expression somewhere for downstream use. We can assign a variable to it as follows: x = 18 + 21 If you enter this in the Console, you will see that in the Environment, the variable x has a value of 39. 2.9.1 Execution rule for variable assignment Evaluate the expression to the right of =. Bind variable to the left of = to the resulting value. The variable is stored in the environment. <- is okay too! The environment is where all the variables are stored, and can be used for an expression anytime once it is defined. Only one unique variable name can be defined. The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM. When we work with large datasets, if you assign a variable to a data type larger than the available RAM, it will not work. More on this later. Look, now x can be reused downstream: x - 2 ## [1] 37 y = x * 2 2.10 Grammar Structure 3: Evaluation of Functions A function has a function name, arguments, and returns a data type. 2.10.1 Execution rule for functions: Evaluate the function by its arguments, and if the arguments are functions or contains operations, evaluate those functions or operations first. The output of functions is called the returned value. sqrt(nchar("hello")) ## [1] 2.236068 (nchar("hello") + 4) * 2 ## [1] 18 2.11 Tips on Exercises / Debugging Common errors: Syntax error. It did something else than I expected! The function or operation does not accept the input data type. Changing a variable without realizing you did so. Solutions: Where is the problem? What kind of problem is it? Explain your problem to someone! "],["working-with-data-structures.html", "Chapter 3 Working with data structures 3.1 Vectors 3.2 Dataframes", " Chapter 3 Working with data structures 3.1 Vectors In the first exercise, you started to explore data structures, which store information about data types. You played around with vectors, which is a ordered collection of a data type. Each element of a vector contains a data type, and there is no limit on how big a vector can be, as long the memory use of it is within the computer’s memory (RAM). We can now store a vast amount of information in a vector, and assign it to a single variable. We can now use operations and functions on a vector, modifying many elements within the vector at once! This fits with the theme of abstraction and modular organization described in the first lesson! We often create vectors using the combine function, c() : staff = c("chris", "shasta", "jeff") chrNum = c(2, 3, 1) If we try to create a vector with mixed data types, R will try to make them be the same data type, or give an error: staff = c("chris", "shasta", 123) staff ## [1] "chris" "shasta" "123" Our numeric got converted to character so that the entire vector is all characters. 3.1.1 Using operations on vectors Recall from the first class: Expressions are be built out of operations or functions. Operations and functions combine data types to return another data type. Now that we are working with data structures, the same principle applies: Operations and functions combine data structures to return another data structure (or data type!). What happens if we use some familiar operations we used for numerics on a numerical vector? If we multiply a numerical vector by a numeric, what do we get? chrNum = chrNum * 3 chrNum ## [1] 6 9 3 All of chrNum’s elements tripled! Our multiplication operation, when used on a numeric vector with a numeric, has a new meaning: it multiplied all the elements by 3. Multiplication is an operation that can be used for multiple data types or data structures: we call this property operator overloading. Here’s another example: numeric vector multiplied by another numeric vector: chrNum * c(2, 2, 0) ## [1] 12 18 0 but there are also limits: a numeric vector added to a character vector creates an error: #chrNum + staff When we work with operations and functions, we must be mindful what inputs the operation or function takes in, and what outputs it gives, no matter how “intuitive” the operation or function name is. 3.1.2 Subsetting vectors explicitly In the exercise this past week, you looked at a new operation to subset elements of a vector using brackets. Inside the bracket is either a single numeric value or an a numerical indexing vector containing numerical values. They dictate which elements of the vector to return. staff[2] ## [1] "shasta" staff[c(1, 2)] ## [1] "chris" "shasta" small_staff = staff[c(1, 2)] In the last line, we created a new vector small_staff that is a subset of the staff given the indexing vector c(1, 2). We have three vectors referenced in one line of code. This is tricky and we need to always refer to our rules step-by-step: evaluate the expression right of the =, which contains a vector bracket. Follow the rule of the vector bracket. Then store the returning value to the variable left of =. Alternatively, instead of using numerical indexing vectors, we can use a logical indexing vector. The logical indexing vector must be the same length as the vector to be subsetted, with TRUE indicating an element to keep, and FALSE indicating an element to drop. The following block of code gives the same value as before: staff[c(TRUE, FALSE, FALSE)] ## [1] "chris" staff[c(TRUE, TRUE, FALSE)] ## [1] "chris" "shasta" small_staff = staff[c(TRUE, TRUE, FALSE)] 3.1.3 Subsetting vectors implicitly Here are two applications of subsetting on vectors that need distinction to write the correct code: Explicit subsetting: Suppose someone approaches you a 100-length vector of people’s ages, and say that they want to subset to the first 10 elements. Implicit subsetting: Suppose someone approaches you a 100-length vector of people’s ages, and say that they want to subset to elements < 18 age. We already know how to explicitly subset: set.seed(123) #don't worry about this function age = round(runif(100, 1, 100)) #don't worry about these functions first_ten_age = age[1:10] For implicit subsetting, we don’t know which elements to select off the top of our head! If we know which elements have less than 18, then we can give the elements for an explicit subset. Therefore, we need to create a logical indexing vector using a comparison operator: indexing_vector = age < 18 indexing_vector ## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE ## [13] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE ## [25] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE ## [37] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE ## [49] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE ## [61] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [73] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE ## [85] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE ## [97] FALSE TRUE FALSE FALSE The comparison operator < compared the numeric value of age to see which elements of age is less than 18, and then returned a logical vector that has TRUE if age is less than 18 at that element and FALSE otherwise. Then, age_young = age[indexing_vector] age_young ## [1] 6 11 5 16 3 15 16 15 6 13 14 10 1 12 11 14 10 We could have done this all in one line without storing the indexing vector as a variable in the environment: age_young = age[age < 18] We have the following comparison operators in R: < less than <= less or equal than == equal to != not equal to > greater than >= greater than or equal to You can also put these comparison operators together to form more complex statements, which you will explore in this week’s exercise. Another example: age_90 = age[age == 90] age_90 ## [1] 90 90 90 age_not_90 = age[age != 90] age_not_90 ## [1] 29 79 41 88 94 6 53 89 56 46 96 46 68 58 11 25 5 33 95 89 70 64 99 66 71 ## [26] 55 60 30 16 96 69 80 3 48 76 22 32 24 15 42 42 38 16 15 24 47 27 86 6 45 ## [51] 80 13 57 21 14 76 38 67 10 39 28 82 45 81 81 80 45 76 63 71 1 48 23 39 62 ## [76] 36 12 25 67 42 79 11 44 99 89 89 18 14 66 35 66 33 20 78 10 47 52 For most of our subsetting tasks on vectors (and dataframes below), we will be encouraging implicit subsetting. The power of implicit subsetting is that you don’t need to know what your vector contains to do something with it! This technique is related to abstraction in programming mentioned in the first lesson: by using expressions to find the specific value you are interested instead of hard-coding the value explicitly, it generalizes your code to handle a wider variety of situations. 3.2 Dataframes Before we dive into dataframes, check that the tidyverse package is properly installed by loading it in your R Console: library(tidyverse) ## Warning: package 'tidyverse' was built under R version 4.0.3 ## Warning: package 'purrr' was built under R version 4.0.5 ## Warning: package 'stringr' was built under R version 4.0.3 Here is the data structure you have been waiting for: the dataframe. A dataframe is a spreadsheet such that each column must have the same data type. Think of a bunch of vectors organized as columns, and you get a dataframe. For the most part, we load in dataframes from a file path (although they are sometimes created by combining several vectors of the same length, but we won’t be covering that here): load(url("https://github.com/fhdsl/S1_Intro_to_R/raw/main/classroom_data/CCLE.RData")) 3.2.1 Using functions and operations on dataframes We can run some useful functions on dataframes to get some useful properties, similar to how we used length() for vectors: nrow(metadata) ## [1] 1864 ncol(metadata) ## [1] 30 dim(metadata) ## [1] 1864 30 colnames(metadata) ## [1] "ModelID" "PatientID" "CellLineName" ## [4] "StrippedCellLineName" "Age" "SourceType" ## [7] "SangerModelID" "RRID" "DepmapModelType" ## [10] "AgeCategory" "GrowthPattern" "LegacyMolecularSubtype" ## [13] "PrimaryOrMetastasis" "SampleCollectionSite" "Sex" ## [16] "SourceDetail" "LegacySubSubtype" "CatalogNumber" ## [19] "CCLEName" "COSMICID" "PublicComments" ## [22] "WTSIMasterCellID" "EngineeredModel" "TreatmentStatus" ## [25] "OnboardedMedia" "PlateCoating" "OncotreeCode" ## [28] "OncotreeSubtype" "OncotreePrimaryDisease" "OncotreeLineage" The last function, colnames() returns a character vector of the column names of the dataframe. This is an important property of dataframes that we will make use of to subset on it. We introduce an operation for dataframes: the dataframe$column_name operation selects for a column by its column name and returns the column as a vector. For instance: metadata$OncotreeLineage[1:5] ## [1] "Ovary/Fallopian Tube" "Myeloid" "Bowel" ## [4] "Myeloid" "Myeloid" metadata$Age[1:5] ## [1] 60 36 72 30 30 We treat the resulting value as a vector, so we can perform implicit subsetting: metadata$OncotreeLineage[metadata$OncotreeLineage == "Myeloid"] ## [1] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [8] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [15] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [22] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [29] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [36] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [43] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [50] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [57] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [64] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" ## [71] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" Lastly, try running View(metadata) in RStudio Console…whew, a nice way to examine your dataframe like a spreadsheet program! 3.2.2 “What do you want to do with this dataframe”? Before diving into the technical part of subsetting dataframes, we will use different mindset to think about what we want to do with this dataframe as scientists. Remember that a major theme of the course is about: How we organize ideas <-> Instructing a computer to do something. Until now, we haven’t focused too much on how we organize our scientific ideas to interact with what we can do with code. Let’s write our code driven by our scientific curiosity. Here’s a starting prompt: In the dataframe you have here, which rows would you filter for and columns would you select that relate to a scientific question? Use the implicit subsetting mindset here: ie. “I want to filter for rows (cell lines) that are breast cancer and look at the Age and Sex.” and not “I want to filter for rows 20-50 and select columns 2 and 8”. Notice that when we filter for rows in an implicitly way, we often formulate criteria about the columns. (This is because we are guaranteed to have column names in dataframes. Some dataframes have row names, but because the data types are not guranteed to have the same data type, it makes describing by row properties difficult.) Let’s convert this into code! metadata_filtered = filter(metadata, OncotreeLineage == "Breast") breast_metadata = select(metadata_filtered, ModelID, Age, Sex) head(breast_metadata) ## ModelID Age Sex ## 1 ACH-000017 43 Female ## 2 ACH-000019 69 Female ## 3 ACH-000028 69 Female ## 4 ACH-000044 47 Female ## 5 ACH-000097 63 Female ## 6 ACH-000111 41 Female Here, filter() and select() are functions from the tidyverse package. 3.2.3 Filter rows Let’s carefully a look what how the R Console is interpreting the filter() function: We evaluate the expression right of =. The first argument of filter() is a dataframe, which we give metadata. The second argument is strange: the expression we give it looks like a logical indexing vector built from a comparison operator, but the variable OncotreeLineage does not exist in our environment! Rather, OncotreeLineage is a column from metadata, and we are referring to it as a data variable in the context of the dataframe metadata. So, we make a comparison operation on the column OncotreeLineage from metadata and its resulting logical indexing vector is the input to the second argument. How do we know when a variable being used is a variable from the environment, or a data variable from a dataframe? It’s not clear cut, but here’s a rule of thumb: most functions from the tidyverse package allows you to use data variables to refer to columns of a dataframe. We refer to documentation when we are not sure. This encourages more readable code at the expense of consistency of referring to variables in the environment. The authors of this package describes this trade-off. Putting it together, filter() takes in a dataframe, and an logical indexing vector described by data variables as arguments, and returns a data frame with rows that match condition described by the logical indexing vector. Store this in metadata_filtered variable. 3.2.4 Select columns Let’s carefully a look what how the R Console is interpreting the select() function: We evaluate the expression right of =. The first argument of filter() is a dataframe, which we give metadata. The second and third arguments are data variables referring the columns of metadata. For certain functions like filter(), there is no limit on the number of arguments you provide. You can keep adding data variables to select for more column names. Putting it together, select() takes in a dataframe, and as many data variables you like to select columns, and returns a dataframe with the columns you described by data variables. Store this in breast_metadata variable. "],["about-the-authors.html", "About the Authors", " About the Authors These credits are based on our course contributors table guidelines.     Credits Names Pedagogy Lead Content Instructor(s) Chris Lo Lecturer Chris Lo Content Author(s) (include chapter name/link in parentheses if only for specific chapters) - make new line if more than one chapter involved If any other authors besides lead instructor Content Contributor(s) (include section name/link in parentheses) - make new line if more than one section involved Wrote less than a chapter Content Editor(s)/Reviewer(s) Checked your content Content Director(s) Helped guide the content direction Content Consultants (include chapter name/link in parentheses or word “General”) - make new line if more than one chapter involved Gave high level advice on content Acknowledgments Gave small assistance to content but not to the level of consulting Production Content Publisher(s) Helped with publishing platform Content Publishing Reviewer(s) Reviewed overall content and aesthetics on publishing platform Technical Course Publishing Engineer(s) Helped with the code for the technical aspects related to the specific course generation Template Publishing Engineers Candace Savonen, Carrie Wright, Ava Hoffman Publishing Maintenance Engineer Candace Savonen Technical Publishing Stylists Carrie Wright, Ava Hoffman, Candace Savonen Package Developers (ottrpal) Candace Savonen, John Muschelli, Carrie Wright Art and Design Illustrator(s) Created graphics for the course Figure Artist(s) Created figures/plots for course Videographer(s) Filmed videos Videography Editor(s) Edited film Audiographer(s) Recorded audio Audiography Editor(s) Edited audio recordings Funding Funder(s) Institution/individual who funded course including grant number Funding Staff Staff members who help with funding   ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.0.2 (2020-06-22) ## os Ubuntu 20.04.5 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC ## date 2023-10-10 ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date lib source ## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.5) ## bookdown 0.24 2023-03-28 [1] Github (rstudio/bookdown@88bc4ea) ## bslib 0.4.2 2022-12-16 [1] CRAN (R 4.0.2) ## cachem 1.0.7 2023-02-24 [1] CRAN (R 4.0.2) ## callr 3.5.0 2020-10-08 [1] RSPM (R 4.0.2) ## cli 3.6.1 2023-03-23 [1] CRAN (R 4.0.2) ## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) ## desc 1.2.0 2018-05-01 [1] RSPM (R 4.0.3) ## devtools 2.3.2 2020-09-18 [1] RSPM (R 4.0.3) ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) ## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.3) ## evaluate 0.20 2023-01-17 [1] CRAN (R 4.0.2) ## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0) ## fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.0.2) ## fs 1.5.0 2020-07-31 [1] RSPM (R 4.0.3) ## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.5) ## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0) ## htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.0.2) ## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.0.2) ## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2) ## knitr 1.33 2023-03-28 [1] Github (yihui/knitr@a1052d1) ## lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.0.2) ## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.0.2) ## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.0.2) ## ottrpal 1.0.1 2023-03-28 [1] Github (jhudsl/ottrpal@151e412) ## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.0.2) ## pkgbuild 1.1.0 2020-07-13 [1] RSPM (R 4.0.2) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3) ## pkgload 1.1.0 2020-05-29 [1] RSPM (R 4.0.3) ## prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.0.3) ## processx 3.4.4 2020-09-03 [1] RSPM (R 4.0.2) ## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2) ## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0) ## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2) ## remotes 2.2.0 2020-07-21 [1] RSPM (R 4.0.3) ## rlang 1.1.0 2023-03-14 [1] CRAN (R 4.0.2) ## rmarkdown 2.10 2023-03-28 [1] Github (rstudio/rmarkdown@02d3c25) ## rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.0.2) ## sass 0.4.5 2023-01-24 [1] CRAN (R 4.0.2) ## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3) ## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.3) ## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3) ## testthat 3.0.1 2023-03-28 [1] Github (R-lib/testthat@e99155a) ## tibble 3.2.1 2023-03-20 [1] CRAN (R 4.0.2) ## usethis 1.6.3 2020-09-17 [1] RSPM (R 4.0.2) ## utf8 1.1.4 2018-05-24 [1] RSPM (R 4.0.3) ## vctrs 0.6.1 2023-03-22 [1] CRAN (R 4.0.2) ## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2) ## xfun 0.26 2023-03-28 [1] Github (yihui/xfun@74c2a66) ## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3) ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library "],["references.html", "Chapter 4 References", " Chapter 4 References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] diff --git a/docs/no_toc/working-with-data-structures.html b/docs/no_toc/working-with-data-structures.html index 2b390b3..beb77dd 100644 --- a/docs/no_toc/working-with-data-structures.html +++ b/docs/no_toc/working-with-data-structures.html @@ -384,8 +384,8 @@

3.2.2 “What do you want to do w

(This is because we are guaranteed to have column names in dataframes. Some dataframes have row names, but because the data types are not guranteed to have the same data type, it makes describing by row properties difficult.)

Let’s convert this into code!

metadata_filtered = filter(metadata, OncotreeLineage == "Breast")
-brca_metadata = select(metadata_filtered, ModelID, Age, Sex)
-head(brca_metadata)
+breast_metadata = select(metadata_filtered, ModelID, Age, Sex) +head(breast_metadata)
##      ModelID Age    Sex
 ## 1 ACH-000017  43 Female
 ## 2 ACH-000019  69 Female
@@ -401,7 +401,7 @@ 

3.2.3 Filter rows

  • We evaluate the expression right of =.

  • The first argument of filter() is a dataframe, which we give metadata.

  • -
  • The second argument is strange: the expression we give it looks like a logical indexing vector built from a comparison operator, but the variable OncotreeLineage does not exist in our environment! Rather, OncotreeLineage is a column from metadata, and we are referring to it as a data variable in the context of the dataframe metadata. So, we make a comparsion operation on the column OncotreeLineage from metadata and its resulting logical indexing vector is the input to the second argument.

    +
  • The second argument is strange: the expression we give it looks like a logical indexing vector built from a comparison operator, but the variable OncotreeLineage does not exist in our environment! Rather, OncotreeLineage is a column from metadata, and we are referring to it as a data variable in the context of the dataframe metadata. So, we make a comparison operation on the column OncotreeLineage from metadata and its resulting logical indexing vector is the input to the second argument.

    • How do we know when a variable being used is a variable from the environment, or a data variable from a dataframe? It’s not clear cut, but here’s a rule of thumb: most functions from the tidyverse package allows you to use data variables to refer to columns of a dataframe. We refer to documentation when we are not sure.

    • This encourages more readable code at the expense of consistency of referring to variables in the environment. The authors of this package describes this trade-off.

    • @@ -421,7 +421,7 @@

      3.2.4 Select columns

    • For certain functions like filter(), there is no limit on the number of arguments you provide. You can keep adding data variables to select for more column names.
  • Putting it together, select() takes in a dataframe, and as many data variables you like to select columns, and returns a dataframe with the columns you described by data variables.

  • -
  • Store this in brca_metadata variable.

  • +
  • Store this in breast_metadata variable.