From Python to R

26 Feb 2018

today i begin learning R LOL

no choice, gotta pick it up for work. i guess it’s always good to learn more languages!

i begin by searching for a python/r cheatsheet - this is pretty good for a quick peek into the differences in the syntax:

https://www.analyticsvidhya.com/blog/2015/09/full-cheatsheet-machine-learning-algorithms/

this is more indepth (in particular basic R gives an interesting overview): http://www.datasciencefree.com/cheatsheets.html

[http://www.datasciencefree.com/basicR.pdf] (http://www.datasciencefree.com/basicR.pdf)

for tidyverse:

http://datacamp-community.s3.amazonaws.com/e63a8f6b-2aa3-4006-89e0-badc294b179c


but for now…i need some real (basic) practice. datacamp, here i come!


most important takeaways so far:

R BEGINS WITH INDEX 1 INSTEAD OF 0!!! faints.

vector <- c(x, y, z) creates a vector, and assigns it to vector. 1:9 is a shortcut for creating a vector from 1 to 9.

name(vector) names the parts of the vector! interesting.

selection of x[1:5] INCLUDES both 1 and 5

matrix(1:9, byrow = TRUE, nrow = 3) constructs a 3 row matrix from the vector 1:9, filled in by rows (instead of by columns)

colnames(matrix), rownames(matrix) can be used to name the matrix

cbind(matrix, vector) can be used to join matrices/vectors n vectors together.

rbind(matrix, matrix) are for two matrices

factor(vector) changes it to a factor, and by changing the level(factor), you can change all the names (e.g. M,F to Male, Female)!!! homg

summary(factor) is the equivalent of .describe(), summary(vector) gives info about the length, type, etc

factor_speed_vector <- factor(speed_vector, ordered=TRUE, levels=c(‘slow’, ‘medium’, ‘fast’)) this creates an ordinal factor, levels is the correct order.

str(dataframe) gives the structure of the df head(dataframe) is equivalent to pandas df.head()

data.frame(x, y, z) creates a df

dataframe$col selects the col

subset(dataframe, subset = x < 1) selects the rows where x<1

use position <- order(df$col) to get the positions sorted by col, then df[position,] to get the sorted list

list(x, y, z) to create lists

list[[1]] to select first element from list

list[[‘actors’]][2] selects the second element in the actors col


RStudio is like, the equivalent of Jupyter Notebook (I think!)

Download R, and then download RStudio.

i’m following this tutorial: https://www.computerworld.com/article/2497143/business-intelligence/business-intelligence-beginner-s-guide-to-r-introduction.html?page=2

as a start to actually get coding, and then trying stuff out on the Titanic Dataset.

https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic

install.packages(‘packagename’) is like the equivalent of pip install


if you come across the ‘tar: failed to set default locale error’, use the solution here: https://stackoverflow.com/questions/3907719/how-to-fix-tar-failed-to-set-default-locale-error


so i’m trying to do a df.isnull().sum() in R…

after a bit of a search, sapply(df, function(x) sum(is.na(x))) or map(df, ~sum(is.na(.))) seems the cleanest - the former requires no dependencies, the latter needs tidyverse. also i like the former’s format a little better…

here’s a summary: https://sebastiansauer.github.io/sum-isna/

also, if NA for your string is represented by ‘’, remember to add na.strings = '' when reading it in!


seq() is the equivalent of range()

rep(vector, times = 2) repeats the vector by the number of times, entirely. changing times to each repeats each element in the vector several times first, before going to the next element

is.*() checks for type

as.*() converts it to type e.g. as.data.table() but need to import library(data.table)

rev() reverses the order

grepl(patterns = regex, x = string) search for regex in the string, returns logical

grep() returns indices instead

sub() replaces the first match

gsub() replaces all

Sys.Date() returns a date object, today’s date Sys.time() current time as.Date() to convert


learning dplyr: select, mutate, filter, arrange, summarize

names(df) is the equivalent of df.columns n() gives the length of the df

df_less <- df %>%
    select(df, 1:5, -2) # select columns 1 to 5, excluding 2.

df_lesser <- df %>%
    select(df, year, country, lifeExp:pop) # select columns by name

helper functions in select:

starts_with(“X”); ends_with(“X”); contains(“X”);
matches(“X”) “X” can be a regular expression;
num_range(“x”, 1:5) this gives the variables named x01, x02, x03, x04 and x05;
one_of(x): every name that appears in x, which should be a character vector.

library(dplyr)
df %>% # pipes this result into the first argument in the next step
    filter(year == 1990, country == "United States") # to filter rows
    arrange(desc(year)) # to arrange
    mutate(lifeExpMonths = lifeExp * 12) # to create a new column

by_year_continent <- df %>% # assign the result to a variable
    filter(year <= 1990) %>%
    group_by(continent, year) %>% # group by
    summarize(meanLifeExp = mean(lifeExp), # to summarize
              totalPop = sum(pop))


learning ggplot2

library(ggplot2)

# scatterplots
ggplot(df, aes(x = pop, y = gdpPercap, color = continent, size = pop)) + # color and size
  geom_point() + # geom_point is scatterplot
  scale_x_log10() + # puts x on log scale. if y, just use scale_y_log10()
  expand_limits(x=0) # sets x axis to start at 0

# subplots
ggplot(df, aes(x = pop, y = gdpPercap, color = continent, size = pop)) + # color and size
  geom_point() + 
  facet_wrap(~ continent) # creates subplots by continent

to plot a line plot, change geom_point() to geom_line().

bar plot: geom_col().

histogram: geom_histogram(binwidth = 5)

boxplot: geom_boxplot()

to add title: + ggtitle("Title")


connecting to mysql

Set up a connection to the mysql database my_db <- src_mysql(dbname = “dplyr”, host = “courses.xxx.us-east-1.rds.amazonaws.com”, port = 3306, user = “student”, password = “xxx”)

Reference a table within that source: df df <- tbl(my_db, “dplyr”)


linear regression

model <- lm(y ~ x, data = df)
summary(model)

tidying data up

library(broom) # to tidy up the results from summary, turns model into a df
tidy(model)
bind_rows(tidy(model1), tiday(model2)) # combine results into one big df
library(tidyr)
nested <- by_year_country %>%
    nest(-country) %>% # nest all except country column, into a data column (is a list). there is a tibble (df) for each country
    unnest(data) # brings it back up to the same level 

to retrieve one of the nested data, use indexing: nested$data[[7]]

library(purrr)
by_year_country %>%
    nest(-country) %>%
    mutate(models = map(data, ~ lm(percent_yes ~ year, .))) %>% # map applies function to all items in the list. in this case the function is a linear model
    mutate(tidied = map(models,tidy)) %>% # tidies the models
    unnest(tidied) # unnest the tidied models