SQL - Part 2: Advanced Features

  • In this lecture, we will learn more advanced features of SQL.

  • Examples database to be used in this lecture is given in SQL here:

    See example database to be used.

Overview

  • Remember that while SQL is a standard, there are still differences in implementations of it.
    • Writing queries that do not rely on specific features results in portable applications.
    • However, you cannot deny that some constructs may simplify your queries and performance. So, it is important to decide when to use a specific method to write a query.
  • Remember: a query is not an algorithm. It is for the most part a logical statement of what you are interested in.
    • Often, there are multiple algorithms to implement it.
    • Most DBMSs feature state of the art query optimizers (QOPT) that choose the lowest cost algorithm for a given query and database.
    • QOPT engines are very sophisticated, often operate better than even expert human judgment.
    • So, instead of trying to optimize your queries, you can try to make your queries easy to optimize: simple queries are better.
  • Once you become sophisticated in a specific DBMS, you may learn specific weaknesses and you can develop strategies to adopt for that. We will discuss some.
  • Finally, you should still follow some very simple guidelines:
    • Do not join with relations unnecessarily.
    • Do not sort (order by) or remove duplicates (distinct) unless it is needed.

Outer Join

  • A INNER JOIN B: inner join selects tuples that satisfy a join condition, eliminates all tuples that do not satisfy the join condition. A is called the left operand and B is the right operand of the join operation.
  • A LEFT OUTER JOIN B returns all tuples in the inner join as well as the tuples in A that do not join with any tuples in in B.
  • A RIGHT OUTER JOIN B returns all tuples in the inner join as well as the tuples in B that do not join with any tuples in in A.
  • A FULL OUTER JOIN B returns all tuples in the inner join as well as the tuples from A and B that do not participate in the inner join.
    • You can also use terms: JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN

Inner vs. outer join.

  • Given R(A,B) and S(B,C) with the following contents:
A B
a1 b1
a2 b2
B C
b1 c1
b3 c3
  • We get the following results:

    SELECT R.A, R.B, S.B, S.C
    FROM R JOIN S ON R.B=S.B;
    
    A B B C
    a1 b1 b1 c1
    SELECT R.A, R.B, S.B, S.C
    FROM R LEFT JOIN S ON R.B=S.B;
    
    A B B C
    a1 b1 b1 c1
    a2 b2 null null
    SELECT R.A, R.B, S.B, S.C
    FROM R RIGHT JOIN S ON R.B=S.B;
    
    A B B C
    a1 b1 b1 c1
    null null b3 c3
    SELECT R.A, R.B, S.B, S.C
    FROM R FULL JOIN S ON R.B=S.B;
    
    A B B C
    a1 b1 b1 c1
    a2 b2 null null
    null null b3 c3
  • We can use the fact that tuples that do not match have null values for the join.

Outer join examples:

  • Find for each student, total number of classes

    SELECT
       s.id
       , s.name
       , count(t.student_id) as numclasses
    FROM
       students s
       left join transcript t
       on s.id = t.student_id
    GROUP BY
       s.id
       , s.name;
    

    As count(attr) counts the total number of values, students with no classes will have zero classes.

    If we used inner join, we would have elimited all students with zero classes as they would not join with transcript.

  • Find faculty who did not teach any classes

    SELECT
       f.id
       , f.name
    FROM
       faculty f
       left join classes c
       on f.id = c.instructor_id
    WHERE
       c.instructor_id is null;
    
  • Find classes with no students. Return semester, year, section and course name.

    SELECT
       co.crsname
       , c.course_id
       , c.semester
       , c.year
       , c.section
    FROM
       (classes c
       left join transcript t
       on c.course_id = t.course_id
          and c.semester = t.semester
          and c.year = t.year
          and c.section = t.section),
       courses co
    WHERE
       t.course_id is null
       and co.id = c.course_id;
    

    This works because for each class, there is a course tuple. Otherwise, you would need to do a left join with the next table as well.

Anonymous relations

  • A query can be treated like a relation in the from clause

    It is treated like a virtual relation:

    SELECT
       f.id
       , f.name
       , f2class.numclasses
       , count(*) as numstudents
    FROM
       ( SELECT
             instructor_id AS id
             , count(*) as numclasses
         FROM
             classes
         GROUP BY
             instructor_id
         HAVING
             count(DISTINCT year) >= 2
       ) as f2class
       , classes c
       , transcript t
       , faculty f
    WHERE
       f2class.id = c.instructor_id
       and t.course_id = c.course_id
       and t.semester = c.semester
       and t.year = c.year
       and t.section = c.section
       and f.id = c.instructor_id
    GROUP BY
       f.id
       , f.name
       , f2class.numclasses;
    

    The inner query allows us to find faculty who taught in at least two years and total number of classes they taught. We can then use this information in the main query as if it was a real relation.

    This query would be very hard to write without an anonymous relation as we cannot count for different types of things with a single group by.

  • Be careful: Do not use any anonymous relations to make it simpler to write/read the query.

    SELECT
       S.d
    FROM
       (SELECT a.* FROM R WHERE b>5) as newR
       , S
    WHERE
       S.c = newR.c;
    

    Anonymous relation is not really necessary here. The same query can be written with a simple join:

    SELECT S.d FROM R,S WHERE R.b>5 and S.c=R.c;
    

    When using an anonymous view, query optimizer may miss certain optimizations, especially in older DBMS.

Scalar Queries

  • Any query that returns a single number with an aggregate function is called a scalar query.

    You can use a scalar query as if it was a number.

    SELECT count(*) FROM classes WHERE instructor_id = 4;
    
    count
    -------
    5
    (1 row)
    
    
    SELECT
       instructor_id
       , count(*)
    FROM
       classes c
    GROUP BY
       c.instructor_id
    HAVING
       count(*) >= 5;
    

    Let’s rewrite this query by substituting the above query for number 5 in the having clause.

    SELECT
      instructor_id
      , count(*)
    FROM
      classes c
    GROUP BY
      c.instructor_id
    HAVING
      count(*) >= (SELECT count(*) FROM classes WHERE instructor_id = 4);
    

Comparisons involving sets/bags

  • Many expressions in the WHERE clause (or HAVING) can compare a value against a SET

    WHERE grade IN ('A', 'B')
    WHERE year NOT IN (2015, 2016)
    
  • Substitute a query for the set: Find all faculty who never taught a class.

    SELECT
       f.id
       , f.name
    FROM
       faculty f
    WHERE
       f.id NOT IN (SELECT instructor_id FROM classes) ;
    
  • You can write equivalent queries using EXCEPT and LEFT JOIN.

Set Comparison Operators

  • There are many set comparison operators that can be used in queries. The inner query must return a single column for this to work.

    Some useful operations:

    value IN (QUERY)
    value NOT IN (QUERY)
    value > ANY (QUERY)
    value >= ALL (QUERY)
    value > ALL (QUERY)
    value = ANY (QUERY)   --> same as IN
    value <> ALL (QUERY)  --> same as NOT IN
    
  • You can also write expressions that check whether a query returns any tuples at all:

    EXISTS (QUERY) => True if Query returns at least one tuple
    NOT EXISTS (QUERY) => True if Query returns no tuples
    
  • Examples:

    5 IN (1,2,3,4)       FALSE
    5 NOT IN (1,2,3,4)   TRUE
    2 IN (1,2,3,4)       TRUE
    EXISTS (1,2,3,4)     TRUE
    NOT EXISTS (1,2,3,4) FALSE
    NOT EXISTS ()        TRUE
    5 <ALL (1,2,3,4)     FALSE
    5 >ALL (1,2,3,4)     TRUE
    
  • Example:

    SELECT
        *
    FROM
        students
    WHERE
        EXISTS (SELECT 1
                FROM classes
                WHERE semester='Spring' and year=2016);
    

    This is a kind of stupid query: if there is any class offered in Spring 2016, return all students. Otherwise, return no students.

  • Since it does not matter what we return in EXISTS/NOT EXISTS conditions (we only care whether a tuple is returned or not), we can return something simple like an integer, instead of a relation column.

Correlated Subqueries

  • More interesting queries involve correlated subqueries.

  • Find faculty with no classes:

    SELECT
        f.id
        , f.name
    FROM
        faculty f
    WHERE
        NOT EXISTS
        ( SELECT 1 FROM classes c WHERE c.instructor_id = f.id ) ;
    

    For each faculty tuple f, execute the inner subquery to find all tuples for this faculty. If there is no such tuple, then return tuple f.

    • Outer query: faculty f
    • Inner query: classes c
  • Scope of variables:

    • The inner query can reference any tuple value in the outer query (from the FROM clause), treating these values as constants for the inner query.
    • The outer query cannot access the variables of the inner query.
    • If the iner query rewrites an alias from the outer query, then the closest definition is used.

Common mistake when using a nested subquery

  • Rewrite your own alias in the inner subquery

    SELECT
       f.id
       , f.name
    FROM
       faculty f
    WHERE
       NOT EXISTS
       ( SELECT
            1
         FROM
            classes c
            , faculty f
         WHERE
            c.instructor_id = f.id ) ;
    

    This query will return a very different result than the correct query as inner faculty f overwrites the outer faculty f.

    For each faculty tuple, execute the inner query independently. If it returns no tuples, return the faculty tuple. Hence: this will return no faculty tuples.

Examples

  • Find faculty teaching at most 3 courses:

    SELECT
       f.id
       , f.name
    FROM
       faculty f
    WHERE
       3 >= (SELECT count(*) FROM classes c WHERE f.id=c.instructor_id);
    

    Note that while this is a correct query, it is likely more efficient to use a group by statement for the same purpose:

    SELECT
       f.id
       , f.name
    FROM
       faculty f
       LEFT JOIN classes c
       ON f.id = c.instructor_id
    GROUP BY
       f.id
       , f.name
    HAVING
       count(c.instructor_id) <= 3;
    
Be careful: the query would not be correct without a left join. Why?
  • Scalar queries can also be correlated (though use this as a last resort as well):

    SELECT
       s.id
       , s.name
       , (SELECT count(*)
          FROM transcript t
          WHERE t.student_id=s.id AND t.grade='A') as numAs
    FROM
       students s ;
    

    Remember: this is not likely an efficient way to write this query.

Examples

  • We will finish section with a few complex queries.

    A problem in using the transcript relation is that a student might take a class more than once. However, their grade from the last time they took the class is the one that counts.

    When computing credits completed, we need to return a single tuple for each course that corresponds to the last valid grade for that class.

    SELECT
       s.id
       , s.name
       , count(t.course_id) as coursescompleted
    FROM
       students s
       LEFT JOIN transcript t
       ON s.id = t.student_id
          AND t.grade IS NOT NULL
          AND t.grade <> 'I'
    WHERE
       NOT EXISTS
          ( SELECT 1
            FROM
               transcript t2
            WHERE
               t2.student_id = t.student_id
               AND t2.course_id = t.course_id
               AND (t2.year < t.year OR
                    (t2.year = t.year
                     AND t2.semester = 'Spring'
                     AND t.semester='Fall'))
           )
    GROUP BY
       s.id
       , s.name ;
    

FOR ALL Queries

  • What is we wanted to find students who have taken a class from all the professors in the database.

  • This is a complex query: we want to check that the set of professors for a student is equivalent to the set of all professors.

    In relational algebra, this query would need two set subtractions.

    We can represent this query logically as follows:

    Find students who took a class from ALL professors in the database.
    
    Find students s such that
         there does not exist a professor f such that
              s did not take a class from f
              (or there does not exists a tuple in transcript
               for s and f)
    
  • SQL query will also require two subqueries:

    SELECT
       s.id
       , s.name
    FROM
       students s
    WHERE
       NOT EXISTS
          (SELECT 1
           FROM faculty f
           WHERE NOT EXISTS
                 (SELECT 1
                  FROM transcript t, classes c
                  WHERE
                      t.student_id = s.id
                      AND c.instructor_id = f.id
                      AND c.course_id = t.course_id
                      AND c.semester = t.semester
                      AND c.year = t.year
                      AND c.section = t.section));
    
  • Do we really need this level of complexity? Can we do this using a count?

    Return each student if the number of different
    faculty they took classes with is equal to the number
    of different faculty in the database.
    

    Let’s write this expression:

    SELECT
       s.id
       , s.name
    FROM
       students s
       , transcript t
       , classes c
    WHERE
       s.id = t.student_id
       AND t.course_id = c.course_id
       AND t.semester = c.semester
       AND t.year = c.year
       AND t.section = c.section
    GROUP BY
       s.id
       , s.name
    HAVING
       count(DISTINCT c.instructor_id) =
              (SELECT count(*) FROM faculty) ;
    
  • Not only this query is simpler to write, it is likely much more efficient given it has no correlated subqueries.

WITH Statement (newer form of anonymous relations)

  • Postgresql implements the WITH statement, part of SQL standard. In its simplest form, WITH acts like anonymous relations. But in reality it can do a lot more.

  • The following is the identical query from above written using WITH clause:

    WITH f2class AS (
         SELECT
             instructor_id AS id
             , count(*) as numclasses
         FROM
             classes
         GROUP BY
             instructor_id
         HAVING
             count(DISTINCT year) >= 2
         )
    SELECT
       f.id
       , f.name
       , f2class.numclasses
       , count(*) as numstudents
    FROM
       classes c
       , transcript t
       , faculty f
       , f2class
    WHERE
       f2class.id = c.instructor_id
       and t.course_id = c.course_id
       and t.semester = c.semester
       and t.year = c.year
       and t.section = c.section
       and f.id = c.instructor_id
    GROUP BY
       f.id
       , f.name
       , f2class.numclasses;
    
  • However, anonymous relations can only be used in FROM while relations generated using WITH can be used in any SQL statement.

    WITH fsummary AS (
         SELECT
             student_id
             , count(*) as numcourses
         FROM
             transcript t
         WHERE
             grade <> 'I'
         GROUP BY
             student_id
         ),
         advanced_students AS (SELECT
              s.id
              , s.name
              , major
         FROM
              fsummary f
              , students s
         WHERE
              s.id = f.student_id
              AND f.numcourses > 2
         )
     SELECT DISTINCT
         M.name
     FROM
         majors M
     WHERE M.abrv in (select major from advanced_students);
    
  • Note that you are doing two things above that you cannot do in anonymous relations: advanced_students referring to the relation fsummary just defined above and the query refers to the defined relation advanced_relations in the WHERE statement, while this cannot be done in anonymous relations.

  • While WITH statement is quite powerful as a construct, be very careful to use it only if is helps you write a query that is cumbersome to write using regular SQL. The above query can be easily written with a simple block of SQL:

    SELECT DISTINCT
        m.name
    FROM
        transcript t
        , students s
        , majors m
    WHERE
        s.id = t.student_id
        AND t.grade <> 'I'
        AND m.abrv = s.major
    GROUP BY
        s.id
        , m.name
    HAVING count(*) > 2;
    
  • We will reexamine WITH when we look at advanced SQL features.

Summary

  • Most queries that use IN or EXISTS can be rewritten using simple joins. Joins are much easier to optimize.

  • Set subtraction usually can be expressed using NOT IN or NOT EXISTS.

  • Using anonymous relations in the from clause may cause the optimizer to miss some optimizations. Simpler the query, the better it is.

  • There is a subtle difference on the syntax of the two statements:

    Attribute NOT IN (select statement)
    NOT EXISTS (select statement)
    
  • For all queries usually require two NOT EXISTS.

  • SQL aggregates and outer joins are powerful constructs for formulating complex queries, even those involving some sort of negation.