Instructions

Solutions to the exercises of this homework 4 should, just as for HW1-HW3, be written in an R-Markdown document with output: github_document. Both the R-Markdown document (.Rmd-file) and the compiled Markdown document (.md file), as well as any figures needed for properly rendering the Markdown file on GitHub, should be uploaded to your Homework repository as part of a HW3 folder. Code should be written clearly in a consistent style, see in particular Hadley Wickham’s tidyverse style guide. As an example, code should be easily readable and avoid unnecessary repetition of variable names.

Deadline

Deadline for the homework is 2021-11-28 at 23.59. Submission occurrs as usual by creating a new issue with the title “HW4 ready for grading!” in your repository. Please also add a link from your repository’s README.md file to HW4/HW4.md. Your peer review will be assigned on 2021-11-29 and is due 2021-12-01 at 12:00.

Exercise 1: SQL, SQl, Sql, sql!

We are going to work with HW_data/chinook.db. The database represents a digital media store. It contains information on tracks, artists, composers and a lot more. There is a specification on the database and its tables in the file HW_data/chinookDB.pdf. In this pdf you can see how the tables relate to eachother and what columns exist. All questions should be answered with SQL code, except for those who asks for plots. If we ask for a mean, use SQL to compute it.

  1. Extract all data from the tracks table using dbReadTable and plot the UnitPrice in a histogram. Add the mean as a vertical line. Comment on your result.
  2. Which genre has the least amount of tracks?
  3. Which genre has the most amount of tracks in a playlist?
  4. Which Composer (which is not NA) has most tracks in a playlist and what is that playlists name?

Exercise 2: Skoleverket’s information about 6th graders

Data from this exercise originate from Skolverket and contain the grades of 6th graders from all elementary schools in Sweden. The data are freely available when aggregated at the municipality level and consists of a CSV file HW_data/exp_betyg_ak6_kommun_2018_19.csv containing the results of the year 2018/2019. Skolverket provides some additional information about the content of the data, if you click on “Analysstöd” at Skolverket’s data download page.

Tasks

  1. Read the data into R and restrict your attention to the results of “Samtliga” schools for each municipality. In what follows we shall analyse the average grade points 23 subjects (Ämnen) contained in the data. Read the link called “Analysstöd”. Your solution should contain a description of how you deal with “.”, “..”, “-” and “~100” results when importing the data. If you get any warnings, describe what they are and why they appear. You may thereafter continue.

  2. Make a plot which shows the difference of average grades between Boys and Girls in each county. Take into account that there is a difference in the numbers of girls and boys in each municipality and subject. Weight according to their respective counts. When you choose how you visualize your result, make the plot as easy as possible to interpret. Explain your choice of plot.

  3. Make a map of Sweden where the municipalities are colored according to the event that the mean grade is higher in “Engelska” compared to “Idrott och hälsa” or not. In the csv file kommun-karta.csv we provide the borders of the municipalities in Sweden in polygon format. You can use the following example code as a basis for your solution.

kommun_karta <- read_csv("HW_data/kommun_karta.csv")
ggplot(kommun_karta, aes(x = long, y = lat, group = group, fill = as.numeric(id))) +
    geom_polygon() +
    coord_fixed() +
    #theme_void()
    theme_minimal()

which produce the following figure. What conclusions can be drawn from looking at the map you produce?

  1. For each subject, compute the overall mean in Sweden. Do your conclusion in exercise c) still hold?

Peer review

After deadline has passed, you will be given access to another students repository on GitHub. You should provide summary feedback by responding to the “HW4 ready for grading!” issue. Copy the following checklist and use it in your review:

* Is the homework complete, e.g. are all steps in the homework done?

* Is there a working link from the main repository `README.md` to `HW4.md`?

* Is any code showing? If yes, is there any text about it?

* Was the genre with most least amount of tracks "Opera" and most amount of tracks in a playlist "Rock"? 

* How did the student apprach 2b? Explain the steps they took to compute the weighted average and the figure presented in words. Did you plot it differently?