Lecture 13 - Advanced SQL#

Announcements#

  • Homework 3 to be posted later today or by tomorrow the latest, due next week on thursday

  • Homework 4 to be released tomorrow

  • Lecture Exercise 12 to be posted at 2pm today

Today’s lecture#

  • Review of basic SQL

  • Advanced SQL

load_ext sql
%config SqlMagic.displaylimit = None
%sql --section baking
displaylimit: Value None will be treated as 0 (no limit)
Connecting to 'baking'

SELECT … FROM WHERE GROUP BY HAVING …
UNION/EXCEPT/INTERSECT
SELECT … FROM WHERE GROUP BY HAVING …

%%sql

select 
    b.baker
    , r.episodeid
    , r.result
from
    bakers b join
    results r on b.baker = r.baker
    and r.result = 'star baker'
order by
    r.episodeid;
Running query in 'baking'
9 rows affected.
baker episodeid result
Manon 1 star baker
Rahul 2 star baker
Rahul 3 star baker
Dan 4 star baker
Kim-Joy 5 star baker
Briony 6 star baker
Kim-Joy 7 star baker
Ruby 8 star baker
Ruby 9 star baker
%%sql

select 
    b.baker
    , count(*) as numepisodes
from
    bakers b join
    results r on b.baker = r.baker
    and r.result = 'star baker'
group by 
    b.baker 
order by
    numepisodes desc
    ;
    
Running query in 'baking'
6 rows affected.
baker numepisodes
Kim-Joy 2
Rahul 2
Ruby 2
Briony 1
Dan 1
Manon 1
%%sql

select 
    b.baker
    , r.baker
    , r.episodeid
from
    bakers b left join
    results r on b.baker = r.baker
    and r.result = 'star baker'
order by
    b.baker asc;
Running query in 'baking'
15 rows affected.
baker baker_1 episodeid
Antony None None
Briony Briony 6
Dan Dan 4
Imelda None None
Jon None None
Karen None None
Kim-Joy Kim-Joy 5
Kim-Joy Kim-Joy 7
Luke None None
Manon Manon 1
Rahul Rahul 3
Rahul Rahul 2
Ruby Ruby 8
Ruby Ruby 9
Terry None None
%%sql

-- for each baker, return the number of times they won star baker

select 
    b.baker
    , count(r.episodeid) as numepisodes
from
    bakers b left join
    results r on b.baker = r.baker
    and r.result = 'star baker'
group by
    b.baker
order by
    b.baker
    ;
Running query in 'baking'
12 rows affected.
baker numepisodes
Antony 0
Briony 1
Dan 1
Imelda 0
Jon 0
Karen 0
Kim-Joy 2
Luke 0
Manon 1
Rahul 2
Ruby 2
Terry 0
%%sql

-- for each baker, return the number of times they won star baker
-- the number of times they were eliminated

select 
    b.baker
    , r.episodeid as ep1, r.result as result1
    , r2.episodeid as ep2, r2.result as result2
from
    bakers b left join results r on b.baker = r.baker
           and r.result = 'star baker'
       left join results r2 on b.baker = r2.baker
           and r2.result = 'eliminated'
order by
    b.baker
    ;
Running query in 'baking'
15 rows affected.
baker ep1 result1 ep2 result2
Antony None None 3 eliminated
Briony 6 star baker 9 eliminated
Dan 4 star baker 6 eliminated
Imelda None None 1 eliminated
Jon None None 7 eliminated
Karen None None 5 eliminated
Kim-Joy 5 star baker None None
Kim-Joy 7 star baker None None
Luke None None 2 eliminated
Manon 1 star baker 8 eliminated
Rahul 3 star baker None None
Rahul 2 star baker None None
Ruby 8 star baker None None
Ruby 9 star baker None None
Terry None None 5 eliminated
%%sql

-- for each baker, return the number of times they won star baker
-- the number of times they were eliminated

select 
    b.baker
    , count(distinct r.episodeid) as numwins
    , count(distinct r2.episodeid) as numeliminated
from
    bakers b left join results r on b.baker = r.baker
           and r.result = 'star baker'
       left join results r2 on b.baker = r2.baker
           and r2.result = 'eliminated'
group by
    b.baker
order by
    b.baker asc
    ;
Running query in 'baking'
12 rows affected.
baker numwins numeliminated
Antony 0 1
Briony 1 1
Dan 1 1
Imelda 0 1
Jon 0 1
Karen 0 1
Kim-Joy 2 0
Luke 0 1
Manon 1 1
Rahul 2 0
Ruby 2 0
Terry 0 1
%%sql

-- people who were never eliminated

select
   b.baker
from
   bakers b left join results r 
       on b.baker = r.baker and r.result='eliminated'
where 
   r.baker is null ;
Running query in 'baking'
3 rows affected.
baker
Kim-Joy
Ruby
Rahul

Scalar queries#

  • Queries that return a single value can be treated as a number in any part of a query

%%sql

select avg(viewers7day)
from episodes ;
Running query in 'baking'
1 rows affected.
avg
9.369
%%sql

-- scalar query
select avg(age)
from bakers ;
Running query in 'baking'
1 rows affected.
avg
36.4166666666666667
%%sql

-- not scalar query
select avg(age), min(age), max(age)
from bakers ;
Running query in 'baking'
1 rows affected.
avg min max
36.4166666666666667 26 60
%%sql

-- find bakers who are older than average age

select *
from bakers
where age > (select avg(b2.age) from bakers b2);
Running query in 'baking'
3 rows affected.
baker fullname age occupation hometown
Jon Jon Jenkins 47 Blood courier Newport
Karen Karen Wright 60 In-store sampling assistant Wakefield
Terry Terry Hartill 56 Retired air steward West Midlands
%%sql

-- for each bakers, return how old they are and how much older
-- they are from average age

select baker
       , age
       , (select avg(age)::numeric(5,2) from bakers) as avgage
       , greatest(0, age - (select avg(b.age)::numeric(5,2) from bakers b))
from bakers
order by baker;
Running query in 'baking'
12 rows affected.
baker age avgage greatest
Antony 30 36.42 0
Briony 33 36.42 0
Dan 36 36.42 0
Imelda 33 36.42 0
Jon 47 36.42 10.58
Karen 60 36.42 23.58
Kim-Joy 27 36.42 0
Luke 30 36.42 0
Manon 26 36.42 0
Rahul 30 36.42 0
Ruby 29 36.42 0
Terry 56 36.42 19.58

Queries that return a single value, but a set of tuples#

  • We can treat them as set of values and do set queries

value IN (SET – results of a query)
value NOT IN (SET – results of a query)
value >= ALL (SET – results of a query)

value < ALL (SET – results of a query)
value >= ANY (SET – results of a query)
value = ANY (SET – results of a query)
value <> ANY (SET – results of a query)

  • Also you can check if the set is empty or non-empty

EXISTS (SET – results of a query)

– true if there are tuples returned by the query

NOT EXISTS (SET – results of a query)

– true if there are no tuples returned by the inner query

%%sql 

select *
from bakers
where age > (select age from bakers where baker ='Rahul');
Running query in 'baking'
6 rows affected.
baker fullname age occupation hometown
Briony Briony Williams 33 Full-time parent Bristol
Dan Dan Beasley-Harling 36 Full-time parent London
Imelda Imelda McCarron 33 Countryside recreation officer County Tyrone
Jon Jon Jenkins 47 Blood courier Newport
Karen Karen Wright 60 In-store sampling assistant Wakefield
Terry Terry Hartill 56 Retired air steward West Midlands
%%sql 

select age from bakers where hometown='London';
Running query in 'baking'
4 rows affected.
age
30
36
26
29
%%sql 

select baker, age , hometown
from bakers 
where age > ALL 
    (select age from bakers where hometown='London');
Running query in 'baking'
3 rows affected.
baker age hometown
Jon 47 Newport
Karen 60 Wakefield
Terry 56 West Midlands
%%sql 

select baker, age , hometown
from bakers 
where age > 
    (select max(age) from bakers where hometown='London');
Running query in 'baking'
3 rows affected.
baker age hometown
Jon 47 Newport
Karen 60 Wakefield
Terry 56 West Midlands
%%sql
-- find bakers who were never eliminated

select * from bakers
-- where baker not in (--people who are eliminated)
where baker not in (select baker from results where result='eliminated');
Running query in 'baking'
3 rows affected.
baker fullname age occupation hometown
Kim-Joy Kim-Joy Hewlett 27 Mental health specialist Leeds
Rahul Rahul Mandal 30 Research scientist Rotherham
Ruby Ruby Bhogal 29 Project manager London
%%sql
-- find bakers who were never star baker

select * from bakers
-- where baker not in (--people who are eliminated)
where baker not in 
     (select baker from results where result='star baker');
Running query in 'baking'
6 rows affected.
baker fullname age occupation hometown
Antony Antony Amourdoux 30 Banker London
Imelda Imelda McCarron 33 Countryside recreation officer County Tyrone
Jon Jon Jenkins 47 Blood courier Newport
Karen Karen Wright 60 In-store sampling assistant Wakefield
Luke Luke Thompson 30 Civil servant/house and techno DJ Sheffield
Terry Terry Hartill 56 Retired air steward West Midlands
%%sql
-- find bakers who were never eliminated


select baker, '1' from bakers
where baker not in (select baker from results where result='eliminated');

select
   b.baker, '2'
from
   bakers b left join results r 
       on b.baker = r.baker and r.result='eliminated'
where 
   r.baker is null ;
    
select baker, '3'
from bakers
except 
select baker, '3'
from results
where result = 'eliminated';
Running query in 'baking'
3 rows affected.
3 rows affected.
3 rows affected.
baker ?column?
Rahul 3
Ruby 3
Kim-Joy 3
%%sql

explain select baker, '1' from bakers
where baker not in (select baker from results where result='eliminated');
Running query in 'baking'
5 rows affected.
QUERY PLAN
Seq Scan on bakers (cost=13.25..24.63 rows=55 width=70)
  Filter: (NOT (ANY ((baker)::text = ((hashed SubPlan 1).col1)::text)))
  SubPlan 1
    -> Seq Scan on results (cost=0.00..13.25 rows=1 width=218)
          Filter: ((result)::text = 'eliminated'::text)
%%sql

-- For each baker, return the episodes in which 
-- they participated in but were not a favorite

select
    t.baker, t.episodeid
from    
    technicals  t
where
    not exists (
          select * 
          from favorites f
          where f.baker = t.baker and f.episodeid = t.episodeid);
Running query in 'baking'
63 rows affected.
baker episodeid
Kim-Joy 2
Manon 7
Terry 5
Jon 1
Ruby 5
Briony 3
Karen 2
Kim-Joy 10
Briony 7
Rahul 8
Antony 3
Briony 9
Manon 3
Briony 4
Jon 7
Antony 1
Manon 1
Dan 1
Jon 3
Manon 4
Dan 4
Luke 1
Manon 6
Dan 6
Rahul 1
Briony 6
Terry 2
Kim-Joy 5
Rahul 9
Ruby 10
Karen 5
Jon 6
Ruby 2
Rahul 3
Manon 8
Kim-Joy 6
Ruby 1
Jon 5
Terry 1
Ruby 4
Ruby 9
Manon 5
Dan 5
Rahul 10
Ruby 7
Rahul 2
Ruby 3
Briony 5
Terry 3
Kim-Joy 8
Kim-Joy 7
Antony 2
Manon 2
Briony 2
Karen 3
Ruby 8
Luke 2
Kim-Joy 3
Karen 1
Kim-Joy 1
Imelda 1
Kim-Joy 4
Karen 4
%%sql

-- For each baker, return the episodes in which 
-- they participated in but were not a favorite

select
    baker, episodeid
from    
    technicals 
where
    (baker, episodeid) not in
          (select baker, episodeid from favorites);
Running query in 'baking'
63 rows affected.
baker episodeid
Antony 1
Dan 1
Imelda 1
Jon 1
Karen 1
Kim-Joy 1
Luke 1
Manon 1
Rahul 1
Ruby 1
Terry 1
Antony 2
Briony 2
Karen 2
Kim-Joy 2
Luke 2
Manon 2
Rahul 2
Ruby 2
Terry 2
Antony 3
Briony 3
Jon 3
Karen 3
Kim-Joy 3
Manon 3
Rahul 3
Ruby 3
Terry 3
Briony 4
Dan 4
Karen 4
Kim-Joy 4
Manon 4
Ruby 4
Briony 5
Dan 5
Jon 5
Karen 5
Kim-Joy 5
Manon 5
Ruby 5
Terry 5
Briony 6
Dan 6
Jon 6
Kim-Joy 6
Manon 6
Briony 7
Jon 7
Kim-Joy 7
Manon 7
Ruby 7
Kim-Joy 8
Manon 8
Rahul 8
Ruby 8
Briony 9
Rahul 9
Ruby 9
Kim-Joy 10
Rahul 10
Ruby 10
%%sql

select b.baker
from bakers b
where not exists (select * from results r where r.baker = b.baker
                  and r.result = 'eliminated');
Running query in 'baking'
3 rows affected.
baker
Kim-Joy
Ruby
Rahul
%%sql
-- Find bakers who competed in every episode!

-- Find bakers, for whom there does not exists an episode
-- for the show, that they did not compete in
-- (there does not exist a tuple in technicals for this episode
-- and this baker)

select
    b.baker
from
    bakers b
where
    not exists (
        select * from episodes e
        where not exists (
             select * from technicals t
            where t.episodeid = e.id and t.baker = b.baker
        )
    )    ;    
Running query in 'baking'
3 rows affected.
baker
Kim-Joy
Rahul
Ruby
%%sql

explain select
    b.baker
from
    bakers b
where
    not exists (
        select * from episodes e
        where not exists (
             select * from technicals t
            where t.episodeid = e.id and t.baker = b.baker
        )
    )    ;    
Running query in 'baking'
7 rows affected.
QUERY PLAN
Nested Loop Anti Join (cost=0.00..9659.32 rows=55 width=38)
  Join Filter: (NOT (ANY ((e.id = (hashed SubPlan 2).col1) AND ((b.baker)::text = (hashed SubPlan 2).col2))))
  -> Seq Scan on bakers b (cost=0.00..11.10 rows=110 width=38)
  -> Materialize (cost=0.00..11.20 rows=80 width=4)
        -> Seq Scan on episodes e (cost=0.00..10.80 rows=80 width=4)
  SubPlan 2
    -> Seq Scan on technicals t (cost=0.00..1.75 rows=75 width=36)
%%sql
select 
    baker
from 
    technicals
group by 
    baker
having 
    count(*) = (select count(*) from episodes);
Running query in 'baking'
3 rows affected.
baker
Rahul
Ruby
Kim-Joy
%%sql

explain 
select 
    baker
from 
    technicals
group by 
    baker
having 
    count(*) = (select count(*) from episodes);
Running query in 'baking'
7 rows affected.
QUERY PLAN
HashAggregate (cost=13.13..13.29 rows=1 width=5)
  Group Key: technicals.baker
  Filter: (count(*) = (InitPlan 1).col1)
  InitPlan 1
    -> Aggregate (cost=11.00..11.01 rows=1 width=8)
          -> Seq Scan on episodes (cost=0.00..10.80 rows=80 width=0)
  -> Seq Scan on technicals (cost=0.00..1.75 rows=75 width=5)

Final notes:

  • Learn to write queries in many different ways

  • Using subqueries don’t often save cost, but may be necessary for some queries.

  • Using subqueries may not significant increase query cost as long as the subquery is uncorrelated, i.e. runs independent of the outer query. In this case, the query can be executed once and results can be saved/cached.

  • Correlated subqueries may significantly increase cost, depending on the complexity.

  • You can see the cost of a query in Postgres using the explain keyword.

  • If you are using IN or EXISTS with a subquery, often that query can be rewritten without a subquery involving a join (though not always, especially if you are counting things!).

  • Don’t forget, count and left join are very powerful constructs and can help you solve many queries.