Databases using SQL

Overview

Questions
  • What is a relational database and why should I use it?

  • What is SQL?

Objectives
  • Describe why relational databases are useful.

  • Create and populate a database from a csv file.

  • Define SQLite data types.

  • Select, group, add to, and analyze subsets of data.

  • Combine data across multiple tables.

Setup

Note: this should have been done by participants before the start of the workshop.

We use DB Browser for SQLite and the Portal Project dataset throughout this lesson. See Setup for instructions on how to download the data, and also how to install DB Browser for SQLite.

What is SQL?

SQL stands for Structured Query Language. SQL allows us to interact with relational databases through queries.

Dataset Description

The data we will be using is a time-series for a small mammal community in southern Arizona. This is part of a project studying the effects of rodents and ants on the plant community that has been running for almost 40 years. The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots.

This is a real dataset that has been used in over 100 publications. We’ve simplified it just a little bit for the workshop, but you can download the full dataset and work with it using exactly the same tools we’ll learn about today.

Questions

First, let’s download and look at some of the cleaned spreadsheets from the Portal Project dataset.
We’ll need the following three files:

Challenge

Open each of these csv files and explore them. What information is contained in each file? What would you need to answer these questions?:

  • How has the hindfoot length and weight of Dipodomys species changed over time?
  • What is the average weight of each species, per year?
  • What information can I learn about Dipodomys species in the 2000s, over time?

Which files have the data you need? What operations would you need to perform if you were doing these analyses by hand?

Goals

In order to answer the questions described above, we’ll need to do the following basic data operations:

But we don’t want to do this manually. Instead of searching for the right pieces of data ourselves, or clicking between spreadsheets, or manually sorting columns, we want to make the computer do the work.

In particular, we want to use a tool where it’s easy to repeat our analysis in case our data changes. We also want to do all this searching without actually modifying our source data.

Putting our data into a relational database and using SQL will help us achieve these goals.

Definition: Relational Database

A relational database stores data in relations made up of records with fields. The relations are usually represented as tables; each record is usually shown as a row, and the fields as columns. In most cases, each record will have a unique identifier, called a key, which is stored as one of its fields. Records may also contain keys that refer to records in other tables, which enables us to combine information from two or more sources.

Databases

Why use relational databases

Using a relational database serves several purposes.

Database Management Systems

There are a number of different database management systems for working with relational data. We’re going to use SQLite today, but basically everything we teach you will apply to the other database systems as well (e.g. MySQL, PostgreSQL, MS Access, MS SQL Server, Oracle Database and Filemaker Pro). The only things that will differ are the details of exactly how to import and export data and the details of data types.

Database Design

Relational databases

Let’s look at a pre-existing database, the portal_mammals.sqlite file from the Portal Project dataset that we downloaded during Setup. Click on the “Open Database” button, select the portal_mammals.sqlite file, and click “Open” to open the database.

You can see the tables in the database by looking at the left hand side of the screen under Database Structure tab. Here you will see a list under “Tables.” Each item listed here corresponds to one of the csv files we were exploring earlier. To see the contents of any table, click on it, and then click the “Browse Data” tab next to the “Database Structure” tab. This will give us a view that we’re used to - just a copy of the table. Hopefully this helps to show that a database is, in some sense, just a collection of tables, where there’s some value in the tables that allows them to be connected to each other (the “related” part of “relational database”).

The “Database Structure” tab also provides some metadata about each table. If you click on the down arrow next to a table name, you will see information about the columns, which in databases are referred to as “fields,” and their assigned data types.
(The rows of a database table are called records.) Each field contains one variety or type of data, often numbers or text. You can see in the surveys table that most fields contain numbers (BIGINT, or big integer, and FLOAT, or floating point numbers/decimals) while the species table is entirely made up of text fields.

The “Execute SQL” tab is blank now - this is where we’ll be typing our queries to retrieve information from the database tables.

To summarize:

Import

Before we get started with writing our own queries, we’ll create our own database. We’ll be creating this database from the three csv files we downloaded earlier. Close the currently open database (File > Close Database) and then follow these instructions:

  1. Start a New Database
    • Click on the New Database icon or select File » New Database
    • Assign a name to the new database, choose the folder where you’d like to save it, and click Save. This creates the database in the selected folder.
  2. We will be importing tables and not creating tables from scratch, so click Cancel to edit out of the next pop-up window.
  3. Select File > Import > Table from CSV file… Choose surveys.csv from the data folder we downloaded and click Open.
  4. Give the table a name that matches the file name or use the default.
  5. If the first row has column headings, be sure to check the box next to “Column names in first line.”
  6. Be sure the field separator and quotation options are correct. If you’re not sure which options are correct, test some of the options and until the preview at the bottom of the window looks right.
  7. Click OK
  8. Back on the Database Structure tab, you should now see the table listed. Right click on the table name and choose Modify Table, or click on the Modify Table just under the tabs and above the table.
  9. In the center panel of the windown you’ll see, set the data types for each field using the suggestions in the table below (this includes fields from plots and species tables also):
Field Data Type Motivation Table(s)
day INTEGER Having data as numeric allows for meaningful arithmetic and comparisons surveys
genus TEXT Field contains text data species
hindfoot_length REAL Field contains measured numeric data surveys
month INTEGER Having data as numeric allows for meaningful arithmetic and comparisons surveys
plot_id INTEGER Field contains numeric data plots, surveys
plot_type TEXT Field contains text data plots
record_id INTEGER Field contains numeric data surveys
sex TEXT Field contains text data surveys
species_id TEXT Field contains text data species, surveys
species TEXT Field contains text data species
taxa TEXT Field contains text data species
weight REAL Field contains measured numerical data surveys
year INTEGER Allows for meaningful arithmetic and comparisons surveys

Finally, click OK one more time to confirm the operation.

Challenge

  • Import the plots and species tables

You can also use this same approach to append new fields to an existing table.

Data types

Data type Description
CHARACTER(n) Character string. Fixed-length n
VARCHAR(n) or CHARACTER VARYING(n) Character string. Variable length. Maximum length n
BINARY(n) Binary string. Fixed-length n
BOOLEAN Stores TRUE or FALSE values
VARBINARY(n) or BINARY VARYING(n) Binary string. Variable length. Maximum length n
INTEGER(p) Integer numerical (no decimal).
SMALLINT Integer numerical (no decimal).
INTEGER Integer numerical (no decimal).
BIGINT Integer numerical (no decimal).
DECIMAL(p,s) Exact numerical, precision p, scale s.
NUMERIC(p,s) Exact numerical, precision p, scale s. (Same as DECIMAL)
FLOAT(p) Approximate numerical, mantissa precision p. A floating number in base 10 exponential notation.
REAL Approximate numerical
FLOAT Approximate numerical
DOUBLE PRECISION Approximate numerical
DATE Stores year, month, and day values
TIME Stores hour, minute, and second values
TIMESTAMP Stores year, month, day, hour, minute, and second values
INTERVAL Composed of a number of integer fields, representing a period of time, depending on the type of interval
ARRAY A set-length and ordered collection of elements
MULTISET A variable-length and unordered collection of elements
XML Stores XML data

SQL Data Type Quick Reference

Different databases offer different choices for the data type definition.

The following table shows some of the common names of data types between the various database platforms:

Data type Access SQLServer Oracle MySQL PostgreSQL
boolean Yes/No Bit Byte N/A Boolean
integer Number (integer) Int Number Int / Integer Int / Integer
float Number (single) Float / Real Number Float Numeric
currency Currency Money N/A N/A Money
string (fixed) N/A Char Char Char Char
string (variable) Text (<256) / Memo (65k+) Varchar Varchar2 Varchar Varchar
binary object OLE Object Memo Binary (fixed up to 8K) Varbinary (<8K) Image (<2GB) Long Raw Blob Text Binary Varbinary

Key Points

  • SQL allows us to select and group subsets of data, do math and other calculations, and combine data.

  • A relational database is made up of tables which are related to each other by shared keys.

  • Different database management systems (DBMS) use slightly different vocabulary, but they are all based on the same ideas.