Complete course Intro to SQL for Data Science and chapters Importing data from databases Part 1 and 2 DataCamp. Read the RSQLite vignette.
There are essentially three ways to send SQL-queries directly from RStudio to a database mybase
, where we assume mybase
is an SQLite database stored in file mybase.sqlite
.
dbGetQuery
after establishing a conection using dbConnect
as incon <- dbConnect(SQLite(), "mybase.sqlite")
dbGetQuery(con, "SELECT * FROM table")
```{r}
con <- dbConnect(SQLite(), "mybase.sqlite")
```
```{sql, connection = con}
SELECT * FROM table
```
We recommend using the first option within an R script file.
Practise SQL joins at w3schools.
The SQLite database Class_files/sthlm_metro.sqlite
contains data on the stops of the Stockholm metro at the platform level, which was generated for the 2018 class by Class_files/sthlm_metro.R
and originates from Trafiklab.se.
Connect to the database, list the tables and figure out how they relate to eachother.
Query for the LineName
from table Line
where LineNumber
is 18.
Query for the number of platforms at each station in descending order.
Query for the five most southern StationName
, where position is measured as the average Latitude
of its PlatformNumber
.
Query for the number of stations on LineNumber
18.
Query for all StationName
on LineNumber
18 in alphabetical order.
The SQLite database Class_files/pokedex.sqlite
contains the full set of tables from https://github.com/veekun/pokedex/tree/master/pokedex/data/csv generated by the file Class_files/pokedexDB.R
.
Close any existing connections with dbDisconnect
and connect to the database.
height
and average weight
of pokemon (pokemon
table).weight
.ghost
-type pokemon.pokemon_id
, pokemon_name
and six additional columns giving the pokemons base_stat
-value in each of categories hp
, attack
, defense
, special-attack
, special-defense
and speed
(see tables stats
and pokemon_stats
). You may use SQL to ask for the table in long format and convert it to the final wide format using spread
in R.Ensembl hosts a public genomic database with a MySQL server. You can connect to the database holding the human genome using (note the use of MySQL
rather than SQLite
)
library(RMySQL)
## Loading required package: DBI
con <- dbConnect(MySQL(), host = "ensembldb.ensembl.org",
user = "anonymous", password = "",
port = 3306)
we are now connected to a whole set of databases
databases <- dbGetQuery(con, "SHOW DATABASES")
head(databases)
## Database
## 1 information_schema
## 2 PERCONA_SCHEMA
## 3 acanthochromis_polyacanthus_core_100_1
## 4 acanthochromis_polyacanthus_core_101_1
## 5 acanthochromis_polyacanthus_core_102_1
## 6 acanthochromis_polyacanthus_core_103_1
nrow(databases)
## [1] 15416
Yes, plenty of databases. The latest version of the human genome is in homo_sapiens_core_94_38
, we choose this by
dbSendQuery(con, "USE homo_sapiens_core_94_38")
## <MySQLResult:12,0,1>
dbListTables(con)
## [1] "alt_allele"
## [2] "alt_allele_attrib"
## [3] "alt_allele_group"
## [4] "analysis"
## [5] "analysis_description"
## [6] "assembly"
## [7] "assembly_exception"
## [8] "associated_group"
## [9] "associated_xref"
## [10] "attrib_type"
## [11] "biotype"
## [12] "coord_system"
## [13] "data_file"
## [14] "density_feature"
## [15] "density_type"
## [16] "dependent_xref"
## [17] "ditag"
## [18] "ditag_feature"
## [19] "dna"
## [20] "dna_align_feature"
## [21] "dna_align_feature_attrib"
## [22] "exon"
## [23] "exon_transcript"
## [24] "external_db"
## [25] "external_synonym"
## [26] "gene"
## [27] "gene_archive"
## [28] "gene_attrib"
## [29] "genome_statistics"
## [30] "identity_xref"
## [31] "interpro"
## [32] "intron_supporting_evidence"
## [33] "karyotype"
## [34] "map"
## [35] "mapping_session"
## [36] "mapping_set"
## [37] "marker"
## [38] "marker_feature"
## [39] "marker_map_location"
## [40] "marker_synonym"
## [41] "meta"
## [42] "meta_coord"
## [43] "misc_attrib"
## [44] "misc_feature"
## [45] "misc_feature_misc_set"
## [46] "misc_set"
## [47] "object_xref"
## [48] "ontology_xref"
## [49] "operon"
## [50] "operon_transcript"
## [51] "operon_transcript_gene"
## [52] "peptide_archive"
## [53] "prediction_exon"
## [54] "prediction_transcript"
## [55] "protein_align_feature"
## [56] "protein_feature"
## [57] "repeat_consensus"
## [58] "repeat_feature"
## [59] "seq_region"
## [60] "seq_region_attrib"
## [61] "seq_region_mapping"
## [62] "seq_region_synonym"
## [63] "simple_feature"
## [64] "stable_id_event"
## [65] "supporting_feature"
## [66] "transcript"
## [67] "transcript_attrib"
## [68] "transcript_intron_supporting_evidence"
## [69] "transcript_supporting_feature"
## [70] "translation"
## [71] "translation_attrib"
## [72] "unmapped_object"
## [73] "unmapped_reason"
## [74] "xref"
Check the assembly table if you wonder how the tables relate to eachother… We may take a closer look at a table by DESCRIBE
dbGetQuery(con, "DESCRIBE gene")
## Field Type Null Key Default Extra
## 1 gene_id int(10) unsigned NO PRI <NA> auto_increment
## 2 biotype varchar(40) NO <NA>
## 3 analysis_id smallint(5) unsigned NO MUL <NA>
## 4 seq_region_id int(10) unsigned NO MUL <NA>
## 5 seq_region_start int(10) unsigned NO <NA>
## 6 seq_region_end int(10) unsigned NO <NA>
## 7 seq_region_strand tinyint(2) NO <NA>
## 8 display_xref_id int(10) unsigned YES MUL <NA>
## 9 source varchar(40) NO <NA>
## 10 description text YES <NA>
## 11 is_current tinyint(1) NO 1
## 12 canonical_transcript_id int(10) unsigned NO MUL <NA>
## 13 stable_id varchar(128) YES MUL <NA>
## 14 version smallint(5) unsigned YES <NA>
## 15 created_date datetime YES <NA>
## 16 modified_date datetime YES <NA>
gene
-table?biotype
in the gene
-table).dna
table.