Lecture 15 — Sets¶

Overview¶

Example: finding all individuals listed in the Internet Movie Database (IMDB)
A solution based on lists
Sets and set operations
A solution based on sets.
Efficiency and set representation

Reading is Section 11.1 of Practical Programming.

Finding All Persons in the IMDB file¶

We are given a file extracted from the Internet Movie Database (IMDB) called imdb_data.txt containing, on each line, a person’s name, a movie name, and a year. For example,
```
Kishiro, Yukito   | Battle Angel    | 2016
```
Goal:
- Find all persons named in the file
- Count the number of different persons named.
- Ask if a particular person is named in the file
The challenge in doing this is that many names appear multiple times.

First solution: store names in a list. We’ll start from the following code, posted on the Piazza in lec15_find_names_start.py, which is part of a Lecture 15 zip file.

imdb_file = input("Enter the name of the IMDB file ==> ").strip()
name_list = []
for line in open(imdb_file, encoding = "ISO-8859-1"):
    words = line.strip().split('|')
    name = words[0].strip()

and complete the code in class.

The challenge is that we need to check that a name is not already in the list before adding it.
You may access the data files and the starting code .py file from the Resources page of the Piazza site.

How To Test?¶

The file imdb_data.txt has about 260K entries. How will we know our results are correct?
Even if we restrict it to movies released in 2010-2012 (the file imdb_2010-12.txt), we still have 25K entries!
We need to generate a smaller file with results we can test by hand
- I have generated hanks.txt for you and will use it to test our program before testing on the larger files.

What Happens?¶

Very slow on the large files because we need to scan through the list to see if a name is already there.
We’ll write a faster implementation based on Python sets.
We’ll start with the basics of sets.

Sets¶

A Python set is an implementation of the mathematical notion of a set:
- No order to the values (and therefore no indexing)
- Contains no duplicates
- Contains whatever type of values we wish; including values of different types.
Python set methods are exactly what you would expect.
- Each has a function call syntax and many have operator syntax in addition.

Set Methods¶

Initialization comes from a list, a range, or from just set():

>>> s1 = set()
>>> s1
set()
>>> s2 = set(range(0,11,2))
>>> s2
{0, 2, 4, 6, 8, 10}
>>> v = [4, 8, 4, 'hello', 32, 64, 'spam', 32, 256]
>>> s3 = set(v)
>>> s3
{32, 64, 4, 'spam', 8, 256, 'hello'}

The actual methods are
- s.add(x) — add an element if it is not already there
- s.clear() — clear out the set, making it empty
- s1.difference(s2) — create a new set with the values from s1 that are not in s2.
  - Python also has an “operator syntax” for this:
```
s1 - s2
```
- s1.intersection(s2) — create a new set that contains only the values that are in both sets. Operator syntax:
```
s1 & s2
```
- s1.union(s2) — create a new set that contains values that are in either set. Operator syntax:
```
s1 | s2
```
- s1.issubset(s2) —- are all elements of s1 also in s2? Operator syntax:
```
s1 <= s2
```
- s1.issuperset(s2) — are all elements of s2 also in s1? Operator syntax:
```
s1 >= s2
```
- s1.symmetric_difference(s2) — create a new set that contains values that are in s1 or s2 but not in both.
```
s1 ^ s2
```
- x in s - evaluates to True if the value associated with x is in set s.
We will explore the intuitions behind these set operations by considering
- s1 to be the set of actors in comedies,
- s2 to be the set of actors in action movies
and then consider who is in the sets
```
s1 - s2

s1 & s2

s1 | s2

s1 ^ s2
```

Exercises¶

Sets should be relatively intuitive, so rather than demo them in class, we’ll work through these as an exercise:

>>> s1 = set(range(0,10))
>>> s1

>>> s1.add(6)
>>> s1.add(10)

>>> s2 = set(range(4,20,2))
>>> s2

>>> s1 - s2

>>> s1 & s2

>>> s1 | s2

>>> s1 <= s2

>>> s3 = set(range(4,20,4))
>>> s3 <= s2

Back to Our Problem¶

We’ll modify our code to find the actors in the IMDB. The code is actually very simple and only requires a few set operations.

Side-by-Side Comparison of the Two Solutions¶

Neither the set nor the list is ordered. We can fix this at the end by sorting.
- The list can be sorted directly.
- The set must be converted to a list first. The function sorted does this for us.
What about speed? The set version is MUCH FASTER — to the point that the list version is essentially useless on a large data set.
- We’ll use some timings to demonstrate this quantitatively
- We’ll then explore why in the rest of this lecture.

Comparison of Running Times for Our Two Solutions¶

List-based solution:
- Each time before a name is added, the code — through the method in — scans through the entire list to decide if it is there.
- Thus, the work done is proportional to the size of the list.
- The overall running time is therefore roughly proportional to the square of the number of entries in the list (and the file).
- Letting the mathematical variable \(N\) represent the length of the list, we write this more formally as \(O(N^2)\), or “the order of N squared”
Set-based code
- For sets, Python uses a technique called hashing to restrict the running time of the add method so that it is independent of size of the set.
  - The details of hashing are covered in CSCI 1200, Data Structures.
- The overall running time is therefore roughly proportional to the length of the set (and number of entries in the file).
- We write this as \(O(N)\).
We will discuss this type of analysis more later in the semester.
- It is covered in much greater detail in Data Structures and again in Intro. to Algorithms.

Discussion¶

Python largely hides the details of the containers — set and list in this case — and therefore it is hard to know which is more efficient and why.
For programs applied to small problems involving small data sets, efficiency rarely matters.
For longer programs and programs that work on larger data sets, efficiency does matter, sometimes tremendously. What do we do?
- In some cases, we still use Python and choose the containers and operations that make the code most efficient.
- In others, we must switch to programming languages, such as C++, that generate and use compiled code.

Summary¶

Sets in Python realize the notion of a mathematical set, with all the associated operations.
Operations can be used as method calls or, in many cases, operators.
The combined core operations of finding if a value is in a set and adding it to the set are much faster when using a set than the corresponding operations using a list.
We will continue to see examples of programming with sets when we work with dictionaries.

Extra Practice Problems¶

Write Python code that implements the following set functions using a combination of loops, the in operator, and the add function. In each case, s1 and s2 are sets and the function call should return a set.
1. union(s1,s2)
2. intersection(s1,s2)
3. symmetric_difference(s1,s2)