SQL - Part 2: Advanced Features¶

In this lecture, we will learn more advanced features of SQL.
Examples database to be used in this lecture is given in SQL here:

See example database to be used.

Overview¶

Remember that while SQL is a standard, there are still differences in implementations of it.
- Writing queries that do not rely on specific features results in portable applications.
- However, you cannot deny that some constructs may simplify your queries and performance. So, it is important to decide when to use a specific method to write a query.
Remember: a query is not an algorithm. It is for the most part a logical statement of what you are interested in.
- Often, there are multiple algorithms to implement it.
- Most DBMSs feature state of the art query optimizers (QOPT) that choose the lowest cost algorithm for a given query and database.
- QOPT engines are very sophisticated, often operate better than even expert human judgment.
- So, instead of trying to optimize your queries, you can try to make your queries easy to optimize: simple queries are better.
Once you become sophisticated in a specific DBMS, you may learn specific weaknesses and you can develop strategies to adopt for that. We will discuss some.
Finally, you should still follow some very simple guidelines:
- Do not join with relations unnecessarily.
- Do not sort (order by) or remove duplicates (distinct) unless it is needed.

Outer Join¶

A INNER JOIN B: inner join selects tuples that satisfy a join condition, eliminates all tuples that do not satisfy the join condition. A is called the left operand and B is the right operand of the join operation.
A LEFT OUTER JOIN B returns all tuples in the inner join as well as the tuples in A that do not join with any tuples in in B.
A RIGHT OUTER JOIN B returns all tuples in the inner join as well as the tuples in B that do not join with any tuples in in A.
A FULL OUTER JOIN B returns all tuples in the inner join as well as the tuples from A and B that do not participate in the inner join.
- You can also use terms: JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN

Inner vs. outer join.¶

Given R(A,B) and S(B,C) with the following contents:

A	B
a1	b1
a2	b2

B	C
b1	c1
b3	c3

We get the following results:

SELECT R.A, R.B, S.B, S.C
FROM R JOIN S ON R.B=S.B;

A	B	B	C
a1	b1	b1	c1

SELECT R.A, R.B, S.B, S.C
FROM R LEFT JOIN S ON R.B=S.B;

A	B	B	C
a1	b1	b1	c1
a2	b2	null	null

SELECT R.A, R.B, S.B, S.C
FROM R RIGHT JOIN S ON R.B=S.B;

A	B	B	C
a1	b1	b1	c1
null	null	b3	c3

SELECT R.A, R.B, S.B, S.C
FROM R FULL JOIN S ON R.B=S.B;

A	B	B	C
a1	b1	b1	c1
a2	b2	null	null
null	null	b3	c3

We can use the fact that tuples that do not match have null values for the join.

Outer join examples:¶

Find for each student, total number of classes
```
SELECT
   s.id
   , s.name
   , count(t.student_id) as numclasses
FROM
   students s
   left join transcript t
   on s.id = t.student_id
GROUP BY
   s.id
   , s.name;
```
As count(attr) counts the total number of values, students with no classes will have zero classes.

If we used inner join, we would have elimited all students with zero classes as they would not join with transcript.

Find faculty who did not teach any classes

SELECT
   f.id
   , f.name
FROM
   faculty f
   left join classes c
   on f.id = c.instructor_id
WHERE
   c.instructor_id is null;

Find classes with no students. Return semester, year, section and course name.

SELECT
   co.crsname
   , c.course_id
   , c.semester
   , c.year
   , c.section
FROM
   (classes c
   left join transcript t
   on c.course_id = t.course_id
      and c.semester = t.semester
      and c.year = t.year
      and c.section = t.section),
   courses co
WHERE
   t.course_id is null
   and co.id = c.course_id;

This works because for each class, there is a course tuple. Otherwise, you would need to do a left join with the next table as well.

Anonymous relations¶

A query can be treated like a relation in the from clause

It is treated like a virtual relation:

SELECT
   f.id
   , f.name
   , f2class.numclasses
   , count(*) as numstudents
FROM
   ( SELECT
         instructor_id AS id
         , count(*) as numclasses
     FROM
         classes
     GROUP BY
         instructor_id
     HAVING
         count(DISTINCT year) >= 2
   ) as f2class
   , classes c
   , transcript t
   , faculty f
WHERE
   f2class.id = c.instructor_id
   and t.course_id = c.course_id
   and t.semester = c.semester
   and t.year = c.year
   and t.section = c.section
   and f.id = c.instructor_id
GROUP BY
   f.id
   , f.name
   , f2class.numclasses;

The inner query allows us to find faculty who taught in at least two years and total number of classes they taught. We can then use this information in the main query as if it was a real relation.

This query would be very hard to write without an anonymous relation as we cannot count for different types of things with a single group by.

Be careful: Do not use any anonymous relations to make it simpler to write/read the query.
```
SELECT
   S.d
FROM
   (SELECT a.* FROM R WHERE b>5) as newR
   , S
WHERE
   S.c = newR.c;
```
Anonymous relation is not really necessary here. The same query can be written with a simple join:
```
SELECT S.d FROM R,S WHERE R.b>5 and S.c=R.c;
```
When using an anonymous view, query optimizer may miss certain optimizations, especially in older DBMS.

Scalar Queries¶

Any query that returns a single number with an aggregate function is called a scalar query.

You can use a scalar query as if it was a number.

SELECT count(*) FROM classes WHERE instructor_id = 4;

count
-------
5
(1 row)


SELECT
   instructor_id
   , count(*)
FROM
   classes c
GROUP BY
   c.instructor_id
HAVING
   count(*) >= 5;

Let’s rewrite this query by substituting the above query for number 5 in the having clause.

SELECT
  instructor_id
  , count(*)
FROM
  classes c
GROUP BY
  c.instructor_id
HAVING
  count(*) >= (SELECT count(*) FROM classes WHERE instructor_id = 4);

Comparisons involving sets/bags¶

Many expressions in the WHERE clause (or HAVING) can compare a value against a SET
```
WHERE grade IN ('A', 'B')
WHERE year NOT IN (2015, 2016)
```

Substitute a query for the set: Find all faculty who never taught a class.

SELECT
   f.id
   , f.name
FROM
   faculty f
WHERE
   f.id NOT IN (SELECT instructor_id FROM classes) ;

You can write equivalent queries using EXCEPT and LEFT JOIN.

Set Comparison Operators¶

There are many set comparison operators that can be used in queries. The inner query must return a single column for this to work.

Some useful operations:

value IN (QUERY)
value NOT IN (QUERY)
value > ANY (QUERY)
value >= ALL (QUERY)
value > ALL (QUERY)
value = ANY (QUERY)   --> same as IN
value <> ALL (QUERY)  --> same as NOT IN

You can also write expressions that check whether a query returns any tuples at all:

EXISTS (QUERY) => True if Query returns at least one tuple
NOT EXISTS (QUERY) => True if Query returns no tuples

Examples:

5 IN (1,2,3,4)       FALSE
5 NOT IN (1,2,3,4)   TRUE
2 IN (1,2,3,4)       TRUE
EXISTS (1,2,3,4)     TRUE
NOT EXISTS (1,2,3,4) FALSE
NOT EXISTS ()        TRUE
5 <ALL (1,2,3,4)     FALSE
5 >ALL (1,2,3,4)     TRUE

Example:

SELECT
    *
FROM
    students
WHERE
    EXISTS (SELECT 1
            FROM classes
            WHERE semester='Spring' and year=2016);

This is a kind of stupid query: if there is any class offered in Spring 2016, return all students. Otherwise, return no students.

Since it does not matter what we return in EXISTS/NOT EXISTS conditions (we only care whether a tuple is returned or not), we can return something simple like an integer, instead of a relation column.

Correlated Subqueries¶

More interesting queries involve correlated subqueries.
Find faculty with no classes:
```
SELECT
    f.id
    , f.name
FROM
    faculty f
WHERE
    NOT EXISTS
    ( SELECT 1 FROM classes c WHERE c.instructor_id = f.id ) ;
```
For each faculty tuple f, execute the inner subquery to find all tuples for this faculty. If there is no such tuple, then return tuple f.
- Outer query: faculty f
- Inner query: classes c
Scope of variables:
- The inner query can reference any tuple value in the outer query (from the FROM clause), treating these values as constants for the inner query.
- The outer query cannot access the variables of the inner query.
- If the iner query rewrites an alias from the outer query, then the closest definition is used.

Common mistake when using a nested subquery¶

Rewrite your own alias in the inner subquery
```
SELECT
   f.id
   , f.name
FROM
   faculty f
WHERE
   NOT EXISTS
   ( SELECT
        1
     FROM
        classes c
        , faculty f
     WHERE
        c.instructor_id = f.id ) ;
```
This query will return a very different result than the correct query as inner faculty f overwrites the outer faculty f.

For each faculty tuple, execute the inner query independently. If it returns no tuples, return the faculty tuple. Hence: this will return no faculty tuples.

Examples¶

Find faculty teaching at most 3 courses:

SELECT
   f.id
   , f.name
FROM
   faculty f
WHERE
   3 >= (SELECT count(*) FROM classes c WHERE f.id=c.instructor_id);

Note that while this is a correct query, it is likely more efficient to use a group by statement for the same purpose:

SELECT
   f.id
   , f.name
FROM
   faculty f
   LEFT JOIN classes c
   ON f.id = c.instructor_id
GROUP BY
   f.id
   , f.name
HAVING
   count(c.instructor_id) <= 3;

Be careful: the query would not be correct without a left join. Why?

Scalar queries can also be correlated (though use this as a last resort as well):

SELECT
   s.id
   , s.name
   , (SELECT count(*)
      FROM transcript t
      WHERE t.student_id=s.id AND t.grade='A') as numAs
FROM
   students s ;

Remember: this is not likely an efficient way to write this query.

Examples¶

We will finish section with a few complex queries.

A problem in using the transcript relation is that a student might take a class more than once. However, their grade from the last time they took the class is the one that counts.

When computing credits completed, we need to return a single tuple for each course that corresponds to the last valid grade for that class.

SELECT
   s.id
   , s.name
   , count(t.course_id) as coursescompleted
FROM
   students s
   LEFT JOIN transcript t
   ON s.id = t.student_id
      AND t.grade IS NOT NULL
      AND t.grade <> 'I'
WHERE
   NOT EXISTS
      ( SELECT 1
        FROM
           transcript t2
        WHERE
           t2.student_id = t.student_id
           AND t2.course_id = t.course_id
           AND (t2.year < t.year OR
                (t2.year = t.year
                 AND t2.semester = 'Spring'
                 AND t.semester='Fall'))
       )
GROUP BY
   s.id
   , s.name ;

FOR ALL Queries¶

What is we wanted to find students who have taken a class from all the professors in the database.

This is a complex query: we want to check that the set of professors for a student is equivalent to the set of all professors.

In relational algebra, this query would need two set subtractions.

We can represent this query logically as follows:

Find students who took a class from ALL professors in the database.

Find students s such that
     there does not exist a professor f such that
          s did not take a class from f
          (or there does not exists a tuple in transcript
           for s and f)

SQL query will also require two subqueries:

SELECT
   s.id
   , s.name
FROM
   students s
WHERE
   NOT EXISTS
      (SELECT 1
       FROM faculty f
       WHERE NOT EXISTS
             (SELECT 1
              FROM transcript t, classes c
              WHERE
                  t.student_id = s.id
                  AND c.instructor_id = f.id
                  AND c.course_id = t.course_id
                  AND c.semester = t.semester
                  AND c.year = t.year
                  AND c.section = t.section));

Do we really need this level of complexity? Can we do this using a count?

Return each student if the number of different
faculty they took classes with is equal to the number
of different faculty in the database.

Let’s write this expression:

SELECT
   s.id
   , s.name
FROM
   students s
   , transcript t
   , classes c
WHERE
   s.id = t.student_id
   AND t.course_id = c.course_id
   AND t.semester = c.semester
   AND t.year = c.year
   AND t.section = c.section
GROUP BY
   s.id
   , s.name
HAVING
   count(DISTINCT c.instructor_id) =
          (SELECT count(*) FROM faculty) ;

Not only this query is simpler to write, it is likely much more efficient given it has no correlated subqueries.

Summary¶

Most queries that use IN or EXISTS can be rewritten using simple joins. Joins are much easier to optimize.
Set subtraction usually can be expressed using NOT IN or NOT EXISTS.
Using anonymous relations in the from clause may cause the optimizer to miss some optimizations. Simpler the query, the better it is.

There is a subtle difference on the syntax of the two statements:

Attribute NOT IN (select statement)
NOT EXISTS (select statement)

For all queries usually require two NOT EXISTS.
SQL aggregates and outer joins are powerful constructs for formulating complex queries, even those involving some sort of negation.