Instructions

Solutions to the exercises of this homework 3 should, just as for HW1-HW2, be written in an R-Markdown document with output: github_document. Both the R-Markdown document (.Rmd-file) and the compiled Markdown document (.md file), as well as any figures needed for properly rendering the Markdown file on GitHub, should be uploaded to your Homework repository as part of a HW3 folder. Code should be written clearly in a consistent style, see in particular Hadley Wickham’s tidyverse style guide. As an example, code should be easily readable and avoid unnecessary repetition of variable names.

Deadline

Deadline for the homework is 2021-11-21 at 23.59. Submission occurrs as usual by creating a new issue with the title “HW3 ready for grading!” in your repository. Your peer review will be assigned on 2021-11-22 and is due 2021-11-24 at 12:00.

Exercise 1: Marketing, conversions and conversion lags

In the file HW_data/seo_marketing_data.sqlite we have an example of a database often seen in search engine marketing. It contains data on who buys on that site (conversions). The data has been simplified a lot, but you can imagine it being from a site that rents out rooms for hotels. The database represents customers coming from a search engine (e.g. google), pressing an ad (a “click”) and whether or not they bought something (if a conversion occurred). There are two tables, “Clicks” and “Conversion_value”. Each id represent a unique customer. Use the following code snippet to read tables from the database.

library(RSQLite)
con <- dbConnect(RSQLite::SQLite(), "HW_data/seo_marketing_data.sqlite")
# List all available tables
dbListTables(con)
# Tables are "Conversion value" and "Clicks"
# Pull all data in a table to R.
df_clicks <- dbReadTable(con, "Clicks")

Tasks

  1. Visualize the number of clicks made on each day (date).
  2. Given that there is a cost to each click, which day of the week has been most costly? Hint: there are many ways to solve to problem, but have a look at lubridate::wday.
  3. Make a histogram of the conversion value customers provided.
  4. The “Conversion lag” (CL) of a customer is the number of days from the customer clicked on an ad to the customer bought something/converted. An example is that the customer clicked an add on the first on January and rented a house on the third of January. The CL is then equal to 2. Plot its distribution in R, interpret and comment on the results. Hint: Google on how to take differences of dates. If you want to write a full SQL solution you can use julianday.
  5. Present two tables on the following format
Date of Click CL - 25 day CL - 26 days CL - 27 days
2020-.. a number number ….
2020-.. a number number ….
2020-.. more numbers ….

The two tables should contain the sum of clicks and average conversion value. Display the last five rows with knitr::kable. The tables will be sparse (many zeros). Hint: Use a suitable function to make the data in wide format.

Exercise 2: SL lines

The file HW_data/sl-api.sqlite is a SQLite database containing all SL’s current stops, lines, and the stops on each line. The data are obtained from a call to TrafikLab’s SL Hållplatser och Linjer 2 API on 2019-11-17 using the httr package. Note: Trafiklab has further documentation of the API and the variables.

Tasks

  1. Figure out and describe how the tables relate to eachother. Note: Answer the question by both studying the descriptive information on the linked Webseite and, to a smaller degree, by manually looking at the data from the NoSQL-Database file from the HW_data folder. The later could, e.g., be useful to confirm what is written in the description. At the end, your answers is supposed to consists purely of text and possibly a figure, but no code.

  2. Pull all data from the SQLite database into R. Note: The SiteId, StopPointNumber and StopAreaNumber variables are all of the type integer according to the data model, but have not been properly converted as part of the API call. As a consequence, convert them manually to numerical values after importing them into R.

  3. Present a table of the number of active unique rail traffic stops (i.e. train, tram or metro stops in each ticket zone (ZoneShortName in stopAreas/stopPoints). By “active” we mean stops that are part of the journey pattern of a line.

  4. Choose a line, and plot the stops as points on a map with the name of each stop as a label. Write the code in such a way that it is easily reusable if you want to plot another line. In order to produce a map use the latitude and longitude coordinates and generate the plot using leaflet package. As an example:

library(leaflet)

kräftriket <-  data.frame(lon = 18.055, lat = 59.357, name = "Kräftriket")

m <- leaflet(kräftriket) %>% 
    addTiles() %>%  # Add default OpenStreetMap map tiles
    addMarkers(lng=~lon, lat=~lat, popup = ~name) 
m

For documents with output: html_output or when working with the code in interactive mode you can then just print m to see the map. This might, however, not work with .md files (i.e. github_document) though! Instead, you can store the Leaflet input into a file and then read and display this again with knitr::includegraphics:

#Store leaflet widget as a PNG, see https://stackoverflow.com/questions/31336898/how-to-save-leaflet-in-r-map-as-png-or-jpg-file
library(mapview)
mapshot(m, file = "Leaflet-plot.png")
knitr::include_graphics("Leaflet-plot.png")

  1. Consider the stopAreas and stopPoints tables. What do the two tables have in common, e.g., are there any (unecessary) redudancies/columns? If so, which? Suggest another way to structure the two tables so that we dont loose any information but reduces the size of the database. Hint: Try using two different tables and then use joins to get back to the structure of stopAreas and stopPoints. A very operational way to check memory consumption for storing a variable in R is to use the object.size function or the pryr::object_size function.

Peer review

After deadline has passed, you will be given access to another students repository on GitHub. You should provide summary feedback by responding to the “HW3 ready for grading!” issue. Copy the following checklist and use it in your review:

* Is the homework complete, e.g. are all questions in the homework answered?

* Is any code showing? If yes, is there any text about it?

* Do the Figures have proper axis names, do you understand them?

* When did the most clicks occur, Sunday or Monday? 

* Does the table show the last 5 days of the conversion lag and 5 rows? 

* What type of join was used in 2 e?