SQL Aggregation and aliases
Overview
Questions
How can I summarize my data by aggregating, filtering, or ordering query results?
How can I make sure column names from my queries make sense and aren’t too long?
Objectives
Apply aggregation to group records in SQL.
Filter and order results of a query based on aggregate functions.
Employ aliases to assign new names to items in a query.
COUNT and GROUP BY
Aggregation allows us to combine results by grouping records based on value, also it is useful for calculating combined values in groups.
Let’s go to the surveys table and find out how many individuals there are. Using the wildcard * simply counts the number of records (rows):
SELECT COUNT(*)
FROM surveys;
There are many other aggregate functions included in SQL including SUM
, MAX
, MIN
, and AVG
. We can find out the average of weight for all of those individuals by using the function AVG
.
SELECT COUNT(*), AVG(weight)
FROM surveys;
Something a little more useful might be finding the number and average weight of a particular species in our survey, like DM.
SELECT COUNT(*), AVG(weight)
FROM surveys
WHERE species_id="DM";
Challenge
Write a query that returns: the total weight, average weight, minimum and maximum weights for all animals caught over the duration of the survey. Can you modify it so that it outputs these values only for weights between 5 and 10?
Solution
-- All animals SELECT SUM(weight), AVG(weight), MIN(weight), MAX(weight) FROM surveys; -- Only weights between 5 and 10 SELECT SUM(weight), AVG(weight), MIN(weight), MAX(weight) FROM surveys WHERE (weight > 5) AND (weight < 10);
Now, let’s see how many individuals were counted in each species. We do this
using a GROUP BY
clause
SELECT species_id, COUNT(*)
FROM surveys
GROUP BY species_id;
GROUP BY
tells SQL what field or fields we want to use to aggregate the data.
If we want to group by multiple fields, we give GROUP BY
a comma separated list.
Challenge
- Identify how many animals were counted in each year in total
- Identify how many animals were counted each year per species
- Determine the average weight of each species in each year
Now try to combine the queries in 2 & 3 to list how many and the average weight for each species in each year.
Solution of 1
SELECT year, COUNT(*) FROM surveys GROUP BY year;
Solution of 2
SELECT year, species_id, COUNT(species_id) FROM surveys GROUP BY year, species_id;
Solution of 3
SELECT year, species_id, AVG(weight) FROM surveys GROUP BY year, species_id;
Solution of 2 and 3
SELECT year, species_id, COUNT(*), AVG(weight) FROM surveys GROUP BY year, species_id;
Ordering Aggregated Results
We can order the results of our aggregation by a specific column, including the aggregated column. Let’s count the number of individuals of each species captured, ordered by the count, then by species_id:
SELECT species_id, COUNT(*)
FROM surveys
GROUP BY species_id
ORDER BY COUNT(*), species_id;
Aliases
As queries get more complex names can get long and unwieldy. To help make things clearer we can use aliases to assign new names to things in the query.
We can use aliases in column names or table names using AS
:
SELECT MAX(year) AS last_surveyed_year
FROM surveys;
The AS
isn’t technically required, so you could do
SELECT MAX(year) last_surveyed_year
FROM surveys;
but using AS
is much clearer so it is good style to include it.
The HAVING
keyword
In the previous episode, we have seen the keyword WHERE
, allowing to
filter the results according to some criteria. SQL offers a mechanism to
filter the results based on aggregate functions, through the HAVING
keyword.
For example, we can request to only return information about species with a count higher than 10:
SELECT species_id, COUNT(species_id)
FROM surveys
GROUP BY species_id
HAVING COUNT(species_id) > 10;
The HAVING
keyword works exactly like the WHERE
keyword, but uses
aggregate functions instead of database fields to filter.
If you use AS
in your query to rename a column, HAVING
you can use this
information to make the query more readable. For example, in the above
query, we can call the COUNT(species_id)
by another name, like
occurrences
. This can be written this way:
SELECT species_id, COUNT(species_id) AS occurrences
FROM surveys
GROUP BY species_id
HAVING occurrences > 10;
Note that in both queries, HAVING
comes after GROUP BY
. One way to
think about this is: the data are retrieved (SELECT
), which can be filtered
(WHERE
), then joined in groups (GROUP BY
); finally, we can filter again based on some
of these groups (HAVING
).
Challenge
Write a query that returns, from the
species
table, the number ofspecies
in eachtaxa
, only for thetaxa
with more than 10species
.Solution
SELECT taxa, COUNT(*) AS n FROM species GROUP BY taxa HAVING n > 10;
Key Points
Use the
GROUP BY
keyword to aggregate data.Functions like
MIN
,MAX
,AVERAGE
,SUM
,COUNT
, etc. operate on aggregated data.Aliases can help shorten long queries. To write clear and readible queries, use the
AS
keyword when creating aliases.Use the
HAVING
keyword to filter on aggregate properties.