# Lecture 15 — Sets¶

## Overview¶

• Example: finding all individuals listed in the Internet Movie Database (IMDB)
• A solution based on lists
• Sets and set operations
• A solution based on sets.
• Efficiency and set representation

Reading is Section 11.1 of Practical Programming.

## Finding All Persons in the IMDB file¶

• We are given a file extracted from the Internet Movie Database (IMDB) called imdb_data.txt containing, on each line, a person’s name, a movie name, and a year. For example,

Kishiro, Yukito   | Battle Angel    | 2016

• Goal:

• Find all persons named in the file
• Count the number of different persons named.
• Ask if a particular person is named in the file
• The challenge in doing this is that many names appear multiple times.

• First solution: store names in a list. We’ll start from the following code, posted on the Piazza in lec15_find_names_start.py, which is part of a Lecture 15 zip file.

imdb_file = input("Enter the name of the IMDB file ==> ").strip()
name_list = []
for line in open(imdb_file, encoding = "ISO-8859-1"):
words = line.strip().split('|')
name = words.strip()


and complete the code in class.

• The challenge is that we need to check that a name is not already in the list before adding it.

• You may access the data files and the starting code .py file from the Resources page of the Piazza site.

## How To Test?¶

• The file imdb_data.txt has about 260K entries. How will we know our results are correct?
• Even if we restrict it to movies released in 2010-2012 (the file imdb_2010-12.txt), we still have 25K entries!
• We need to generate a smaller file with results we can test by hand
• I have generated hanks.txt for you and will use it to test our program before testing on the larger files.

## What Happens?¶

• Very slow on the large files because we need to scan through the list to see if a name is already there.
• We’ll write a faster implementation based on Python sets.
• We’ll start with the basics of sets.

## Sets¶

• A Python set is an implementation of the mathematical notion of a set:
• No order to the values (and therefore no indexing)
• Contains no duplicates
• Contains whatever type of values we wish; including values of different types.
• Python set methods are exactly what you would expect.
• Each has a function call syntax and many have operator syntax in addition.

## Set Methods¶

• Initialization comes from a list, a range, or from just set():

>>> s1 = set()
>>> s1
set()
>>> s2 = set(range(0,11,2))
>>> s2
{0, 2, 4, 6, 8, 10}
>>> v = [4, 8, 4, 'hello', 32, 64, 'spam', 32, 256]
>>> s3 = set(v)
>>> s3
{32, 64, 4, 'spam', 8, 256, 'hello'}

• The actual methods are

• s.add(x) — add an element if it is not already there

• s.clear() — clear out the set, making it empty

• s1.difference(s2) — create a new set with the values from s1 that are not in s2.

• Python also has and “operator syntax” for this:
s1 - s2

• s1.intersection(s2) — create a new set that contains only the values that are in both sets. Operator syntax:

s1 & s2

• s1.union(s2) — create a new set that contains values that are in either set. Operator syntax:

s1 | s2

• s1.issubset(2) —- are all elements of s1 also in s2? Operator syntax:

s1 <= s2

• s1.issuperset(s2) — are all elements of s2 also in s1? Operator syntax:

s1 >= s2

• s1.symmetric_difference(s2) — create a new set that contains values that are in s1 or s2 but not in both.

s1 ^ s2

• x in s - evaluates to True if the value associated with x is in set s.

• We will explore the intuitions behind these set operations by considering

• s1 to be the set of actors in comedies,
• s2 to be the set of actors in action movies

and then consider who is in the sets

s1 - s2

s1 & s2

s1 | s2

s1 ^ s2


## Exercises¶

1. Sets should be relatively intuitive, so rather than demo them in class, we’ll work through these as an exercise:

>>> s1 = set(range(0,10))
>>> s1

>>> s2 = set(range(4,20,2))
>>> s2

>>> s1 - s2

>>> s1 & s2

>>> s1 | s2

>>> s1 <= s2

>>> s3 = set(range(4,20,4))
>>> s3 <= s2


## Back to Our Problem¶

• We’ll modify our code to find the actors in the IMDB. The code is actually very simple and only requires a few set operations.

## Side-by-Side Comparison of the Two Solutions¶

• Neither the set nor the list is ordered. We can fix this at the end by sorting.
• The list can be sorted directly.
• The set must be converted to a list first. The function sorted does this for us.
• What about speed? The set version is MUCH FASTER — to the point that the list version is essentially useless on a large data set.
• We’ll use some timings to demonstrate this quantitatively
• We’ll then explore why in the rest of this lecture.

## Comparison of Running Times for Our Two Solutions¶

• List-based solution:
• Each time before a name is added, the code — through the method in — scans through the entire list to decide if it is there.
• Thus, the work done is proportional to the size of the list.
• The overall running time is therefore roughly proportional to the square of the number of entries in the list (and the file).
• Letting the mathematical variable $$N$$ represent the length of the list, we write this more formally as $$O(N^2)$$, or “the order of N squared”
• Set-based code
• For sets, Python uses a technique called hashing to restrict the running time of the add method so that it is independent of size of the set.
• The details of hashing are covered in CSCI 1200, Data Structures.
• The overall running time is therefore roughly proportional to the length of the set (and number of entries in the file).
• We write this as $$O(N)$$.
• We will discuss this type of analysis more later in the semester.
• It is covered in much greater detail in Data Structures and again in Intro. to Algorithms.

## Discussion¶

• Python largely hides the details of the containers — set and list in this case — and therefore it is hard to know which is more efficient and why.
• For programs applied to small problems involving small data sets, efficiency rarely matters.
• For longer programs and programs that work on larger data sets, efficiency does matter, sometimes tremendously. What do we do?
• In some cases, we still use Python and choose the containers and operations that make the code most efficient.
• In others, we must switch to programming languages, such as C++, that generate and use compiled code.

## Summary¶

• Sets in Python realize the notion of a mathematical set, with all the associated operations.
• Operations can be used as method calls or, in many cases, operators.
• The combined core operations of finding if a value is in a set and adding it to the set are much faster when using a set than the corresponding operations using a list.
• We will continue to see examples of programming with sets when we work with dictionaries.

## Extra Practice Problems¶

1. Write Python code that implements the following set functions using a combination of loops, the in operator, and the add function. In each case, s1 and s2 are sets and the function call should return a set.
1. union(s1,s2)
2. intersection(s1,s2)
3. symmetric_difference(s1,s2)