Joins
Overview
Teaching: 15 min
Exercises: 10 minQuestions
How do I bring data together from separate tables?
Objectives
Employ joins to combine data from two tables.
Apply functions to manipulate individual values.
Employ aliases to assign new names to tables and columns in a query.
Joins
To combine data from two tables we use the SQL JOIN
command, which comes after
the FROM
command.
The JOIN
command on its own will result in a cross product, where each row in
the first table is paired with each row in the second table. Usually this is not
what is desired when combining two tables with data that is related in some way.
For that, we need to tell the computer which columns provide the link between the two
tables using the word ON
. What we want is to join the data with the same
species id.
SELECT *
FROM surveys
JOIN species
ON surveys.species_id = species.species_id;
ON
is like WHERE
, it filters things out according to a test condition. We use
the table.colname
format to tell the manager what column in which table we are
referring to.
The output of the JOIN
command will have columns from the first table plus the
columns from the second table. For the above command, the output will be a table
that has the following column names:
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | species_id | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
… | |||||||||||||
96 | 8 | 20 | 1997 | 12 | DM | M | 36 | 41 | DM | Dipodomys | merriami | Rodent | |
… |
Alternatively, we can use the word USING
, as a short-hand. USING
only
works on columns which share the same name. In this case we are
telling the manager that we want to combine surveys
with species
and that
the common column is species_id
.
SELECT *
FROM surveys
JOIN species
USING (species_id);
The output will only have one species_id column
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
… | ||||||||||||
96 | 8 | 20 | 1997 | 12 | DM | M | 36 | 41 | Dipodomys | merriami | Rodent | |
… |
We often won’t want all of the fields from both tables, so anywhere we would
have used a field name in a non-join query, we can use table.colname
.
For example, what if we wanted information on when individuals of each species were captured, but instead of their species ID we wanted their actual species names.
SELECT surveys.year, surveys.month, surveys.day, species.genus, species.species
FROM surveys
JOIN species
ON surveys.species_id = species.species_id;
year | month | day | genus | species | |
---|---|---|---|---|---|
… | |||||
1977 | 7 | 16 | Neotoma | albigula | |
1977 | 7 | 16 | Dipodomys | merriami | |
… |
Many databases, including SQLite, also support a join through the WHERE
clause of a query.
For example, you may see the query above written without an explicit JOIN.
SELECT surveys.year, surveys.month, surveys.day, species.genus, species.species
FROM surveys, species
WHERE surveys.species_id = species.species_id;
For the remainder of this lesson, we’ll stick with the explicit use of the JOIN
keyword for
joining tables in SQL.
Challenge:
- Write a query that returns the genus, the species name, and the weight of every individual captured at the site
Solution
SELECT species.genus, species.species, surveys.weight FROM surveys JOIN species ON surveys.species_id = species.species_id;
Different join types
We can count the number of records returned by our original join query.
SELECT COUNT(*)
FROM surveys
JOIN species
USING (species_id);
Notice that this number is smaller than the number of records present in the survey data.
SELECT COUNT(*) FROM surveys;
This is because, by default, SQL only returns records where the joining value
is present in the joined columns of both tables (i.e. it takes the intersection
of the two join columns). This joining behaviour is known as an INNER JOIN
.
In fact the JOIN
command is simply shorthand for INNER JOIN
and the two
terms can be used interchangably as they will produce the same result.
We can also tell the computer that we wish to keep all the records in the first
table by using the command LEFT OUTER JOIN
, or LEFT JOIN
for short.
Challenge:
- Re-write the original query to keep all the entries present in the
surveys
table. How many records are returned by this query?Solution
SELECT * FROM surveys LEFT JOIN species USING (species_id);
Challenge:
- Count the number of records in the
surveys
table that have aNULL
value in thespecies_id
column.Solution
SELECT COUNT(*) FROM surveys WHERE species_id IS NULL;
Remember: In SQL a NULL
value in one table can never be joined to a NULL
value in a
second table because NULL
is not equal to anything, not even itself.
Combining joins with sorting and aggregation
Joins can be combined with sorting, filtering, and aggregation. So, if we wanted average mass of the individuals on each different type of treatment, we could do something like
SELECT plots.plot_type, AVG(surveys.weight)
FROM surveys
JOIN plots
ON surveys.plot_id = plots.plot_id
GROUP BY plots.plot_type;
Challenge:
- Write a query that returns the number of animals caught of each genus in each plot. Order the results by plot number (ascending) and by descending number of individuals in each plot.
Solution
SELECT surveys.plot_id, species.genus, COUNT(*) AS number_indiv FROM surveys JOIN species ON surveys.species_id = species.species_id GROUP BY species.genus, surveys.plot_id ORDER BY surveys.plot_id ASC, number_indiv DESC;
Challenge:
- Write a query that finds the average weight of each rodent species (i.e., only include species with Rodent in the taxa field).
Solution
SELECT surveys.species_id, AVG(surveys.weight) FROM surveys JOIN species ON surveys.species_id = species.species_id WHERE species.taxa = 'Rodent' GROUP BY surveys.species_id;
Functions IFNULL
and NULLIF
and more
SQL includes numerous functions for manipulating data. You’ve already seen some
of these being used for aggregation (SUM
and COUNT
) but there are functions
that operate on individual values as well. Probably the most important of these
are IFNULL
and NULLIF
. IFNULL
allows us to specify a value to use in
place of NULL
.
We can represent unknown sexes with 'U'
instead of NULL
:
SELECT species_id, sex, IFNULL(sex, 'U')
FROM surveys;
The lone “sex” column is only included in the query above to illustrate where
IFNULL
has changed values; this isn’t a usage requirement.
Challenge:
- Write a query that returns 30 instead of
NULL
for values in thehindfoot_length
column.Solution
SELECT hindfoot_length, IFNULL(hindfoot_length, 30) FROM surveys;
Challenge:
- Write a query that calculates the average hind-foot length of each species, assuming that unknown lengths are 30 (as above).
Solution
SELECT species_id, AVG(IFNULL(hindfoot_length,30)) FROM surveys GROUP BY species_id;
IFNULL
can be particularly useful in JOIN
. When joining the species
and
surveys
tables earlier, some results were excluded because the species_id
was NULL
in the surveys table. We can use IFNULL
to include them again, re-writing the NULL
to
a valid joining value:
SELECT surveys.year, surveys.month, surveys.day, species.genus, species.species
FROM surveys
JOIN species
ON IFNULL(surveys.species_id,'AB') = species.species_id;
Challenge:
- Write a query that returns the number of animals caught of each genus in each plot, using
IFNULL
to assume that unknown species are all of the genus “Rodent”.Solution
SELECT plot_id, IFNULL(genus, 'Rodent') AS genus2, COUNT(*) FROM surveys LEFT JOIN species ON surveys.species_id=species.species_id GROUP BY plot_id, genus2;
The inverse of IFNULL
is NULLIF
. This returns NULL
if the first argument
is equal to the second argument. If the two are not equal, the first argument
is returned. This is useful for “nulling out” specific values.
We can “null out” plot 7:
SELECT species_id, plot_id, NULLIF(plot_id, 7)
FROM surveys;
Some more functions which are common to SQL databases are listed in the table below:
Function | Description |
---|---|
ABS(n) |
Returns the absolute (positive) value of the numeric expression n |
LENGTH(s) |
Returns the length of the string expression s |
LOWER(s) |
Returns the string expression s converted to lowercase |
NULLIF(x, y) |
Returns NULL if x is equal to y, otherwise returns x |
ROUND(n) or ROUND(n, x) |
Returns the numeric expression n rounded to x digits after the decimal point (0 by default) |
TRIM(s) |
Returns the string expression s without leading and trailing whitespace characters |
UPPER(s) |
Returns the string expression s converted to uppercase |
Finally, some useful functions which are particular to SQLite are listed in the table below:
Function | Description |
---|---|
IFNULL(x, y) |
Returns x if it is non-NULL, otherwise returns y |
RANDOM() |
Returns a random integer between -9223372036854775808 and +9223372036854775807. |
REPLACE(s, f, r) |
Returns the string expression s in which every occurrence of f has been replaced with r |
SUBSTR(s, x, y) or SUBSTR(s, x) |
Returns the portion of the string expression s starting at the character position x (leftmost position is 1), y characters long (or to the end of s if y is omitted) |
Challenge:
Write a query that returns genus names (no repeats), sorted from longest genus name down to shortest.
Solution
SELECT DISTINCT genus FROM species ORDER BY LENGTH(genus) DESC;
As we saw before, aliases make things clearer, and are especially useful when joining tables.
SELECT surv.year AS yr, surv.month AS mo, surv.day AS day, sp.genus AS gen, sp.species AS sp
FROM surveys AS surv
JOIN species AS sp
ON surv.species_id = sp.species_id;
To practice we have some optional challenges for you.
Challenge (optional):
SQL queries help us ask specific questions which we want to answer about our data. The real skill with SQL is to know how to translate our scientific questions into a sensible SQL query (and subsequently visualize and interpret our results).
Have a look at the following questions; these questions are written in plain English. Can you translate them to SQL queries and give a suitable answer?
How many plots from each type are there?
How many specimens are of each sex are there for each year?
How many specimens of each species were captured in each type of plot?
What is the average weight of each taxa?
What are the minimum, maximum and average weight for each species of Rodent?
What is the average hindfoot length for male and female rodent of each species? Is there a Male / Female difference?
What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?
Proposed solutions:
- Solution:
SELECT plot_type, COUNT(*) AS num_plots FROM plots GROUP BY plot_type;
- Solution:
SELECT year, sex, COUNT(*) AS num_animal FROM surveys WHERE sex IS NOT NULL GROUP BY sex, year;
- Solution:
SELECT species_id, plot_type, COUNT(*) FROM surveys JOIN plots USING(plot_id) WHERE species_id IS NOT NULL GROUP BY species_id, plot_type;
- Solution:
SELECT taxa, AVG(weight) FROM surveys JOIN species ON species.species_id = surveys.species_id GROUP BY taxa;
- Solution:
SELECT surveys.species_id, MIN(weight), MAX(weight), AVG(weight) FROM surveys JOIN species ON surveys.species_id = species.species_id WHERE taxa = 'Rodent' GROUP BY surveys.species_id;
- Solution:
SELECT surveys.species_id, sex, AVG(hindfoot_length) FROM surveys JOIN species ON surveys.species_id = species.species_id WHERE (taxa = 'Rodent') AND (sex IS NOT NULL) GROUP BY surveys.species_id, sex;
- Solution:
SELECT surveys.species_id, year, AVG(weight) as mean_weight FROM surveys JOIN species ON surveys.species_id = species.species_id WHERE taxa = 'Rodent' GROUP BY surveys.species_id, year;
Key Points
Use the
JOIN
command to combine data from two tables—theON
orUSING
keywords specify which columns link the tables.Regular
JOIN
returns only matching rows. Other join commands provide different behavior, e.g.,LEFT JOIN
retains all rows of the table on the left side of the command.
IFNULL
allows you to specify a value to use in place ofNULL
, which can help in joins
NULLIF
can be used to replace certain values withNULL
in resultsMany other functions like
IFNULL
andNULLIF
can operate on individual values.