Lecture 15 — Sets ==================== Overview -------- - Example: finding all individuals listed in the Internet Movie Database (IMDB) - A solution based on lists - Sets and set operations - A solution based on sets. - Efficiency and set representation Reading is Section 11.1 of *Practical Programming*. Finding All Persons in the IMDB file ------------------------------------ - We are given a file extracted from the Internet Movie Database (IMDB) called ``imdb_data.txt`` containing, on each line, a person’s name, a movie name, and a year. For example, :: Kishiro, Yukito | Battle Angel | 2016 - Goal: - Find all persons named in the file - Count the number of different persons named. - Ask if a particular person is named in the file - The challenge in doing this is that many names appear multiple times. - First solution: store names in a list. We’ll start from the following code, posted on the Piazza in ``lec15_find_names_start.py``, which is part of a Lecture 15 zip file. :: imdb_file = input("Enter the name of the IMDB file ==> ").strip() name_list = [] for line in open(imdb_file, encoding = "ISO-8859-1"): words = line.strip().split('|') name = words[0].strip() and complete the code in class. - The challenge is that we need to check that a name is not already in the list before adding it. - You may access the data files and the starting code .py file from the Resources page of the Piazza site. How To Test? ------------ - The file ``imdb_data.txt`` has about 260K entries. How will we know our results are correct? - Even if we restrict it to movies released in 2010-2012 (the file ``imdb_2010-12.txt``), we still have 25K entries! - We need to generate a smaller file with results we can test by hand - I have generated ``hanks.txt`` for you and will use it to test our program before testing on the larger files. What Happens? ------------- - Very slow on the large files because we need to scan through the list to see if a name is already there. - We’ll write a faster implementation based on Python *sets*. - We’ll start with the basics of sets. Sets ---- - A Python set is an implementation of the mathematical notion of a set: - No order to the values (and therefore no indexing) - Contains no duplicates - Contains whatever type of values we wish; including values of different types. - Python set methods are exactly what you would expect. - Each has a function call syntax and many have operator syntax in addition. Set Methods ----------- - Initialization comes from a list, a range, or from just ``set()``: :: >>> s1 = set() >>> s1 set() >>> s2 = set(range(0,11,2)) >>> s2 {0, 2, 4, 6, 8, 10} >>> v = [4, 8, 4, 'hello', 32, 64, 'spam', 32, 256] >>> s3 = set(v) >>> s3 {32, 64, 4, 'spam', 8, 256, 'hello'} - The actual methods are - ``s.add(x)`` — add an element if it is not already there - ``s.clear()`` — clear out the set, making it empty - ``s1.difference(s2)`` — create a new set with the values from ``s1`` that are not in ``s2``. - Python also has an "operator syntax" for this: :: s1 - s2 - ``s1.intersection(s2)`` — create a new set that contains only the values that are in **both** sets. Operator syntax: :: s1 & s2 - ``s1.union(s2)`` — create a new set that contains values that are in either set. Operator syntax: :: s1 | s2 - ``s1.issubset(s2)`` —- are all elements of ``s1`` also in ``s2``? Operator syntax: :: s1 <= s2 - ``s1.issuperset(s2)`` — are all elements of ``s2`` also in ``s1``? Operator syntax: :: s1 >= s2 - ``s1.symmetric_difference(s2)`` — create a new set that contains values that are in ``s1`` or ``s2`` but **not in both**. :: s1 ^ s2 - ``x in s`` - evaluates to ``True`` if the value associated with ``x`` is in set ``s``. - We will explore the intuitions behind these set operations by considering - ``s1`` to be the set of actors in *comedies*, - ``s2`` to be the set of actors in *action movies* and then consider who is in the sets :: s1 - s2 s1 & s2 s1 | s2 s1 ^ s2 Exercises --------- #. Sets should be relatively intuitive, so rather than demo them in class, we’ll work through these as an exercise: :: >>> s1 = set(range(0,10)) >>> s1 >>> s1.add(6) >>> s1.add(10) >>> s2 = set(range(4,20,2)) >>> s2 >>> s1 - s2 >>> s1 & s2 >>> s1 | s2 >>> s1 <= s2 >>> s3 = set(range(4,20,4)) >>> s3 <= s2 Back to Our Problem ------------------- - We’ll modify our code to find the actors in the IMDB. The code is actually very simple and only requires a few set operations. Side-by-Side Comparison of the Two Solutions -------------------------------------------- - Neither the set nor the list is ordered. We can fix this at the end by sorting. - The list can be sorted directly. - The set must be converted to a list first. The function ``sorted`` does this for us. - What about speed? The set version is **MUCH FASTER** — to the point that the list version is essentially useless on a large data set. - We'll use some timings to demonstrate this quantitatively - We’ll then explore why in the rest of this lecture. Comparison of Running Times for Our Two Solutions ------------------------------------------------- - List-based solution: - Each time before a name is added, the code — through the method ``in`` — scans through the entire list to decide if it is there. - Thus, the work done is proportional to the size of the list. - The overall running time is therefore roughly proportional to the ``square`` of the number of entries in the list (and the file). - Letting the mathematical variable :math:`N` represent the length of the list, we write this more formally as :math:`O(N^2)`, or “the order of N squared” - Set-based code - For sets, Python uses a technique called *hashing* to restrict the running time of the ``add`` method so that it is *independent of size of the set*. - The details of hashing are covered in CSCI 1200, Data Structures. - The overall running time is therefore roughly proportional to the length of the set (and number of entries in the file). - We write this as :math:`O(N)`. - We will discuss this type of analysis more later in the semester. - It is covered in much greater detail in Data Structures and again in Intro. to Algorithms. Discussion ---------- - Python largely hides the details of the containers — set and list in this case — and therefore it is hard to know which is more efficient and why. - For programs applied to small problems involving small data sets, efficiency rarely matters. - For longer programs and programs that work on larger data sets, efficiency does matter, sometimes tremendously. What do we do? - In some cases, we still use Python and choose the containers and operations that make the code most efficient. - In others, we must switch to programming languages, such as C++, that generate and use compiled code. Summary ------- - Sets in Python realize the notion of a mathematical set, with all the associated operations. - Operations can be used as method calls or, in many cases, operators. - The combined core operations of finding if a value is in a set and adding it to the set are **much faster when using a set** than the corresponding operations using a list. - We will continue to see examples of programming with sets when we work with dictionaries. Extra Practice Problems ----------------------- #. Write Python code that implements the following set functions using a combination of loops, the ``in`` operator, and the ``add`` function. In each case, ``s1`` and ``s2`` are sets and the function call should return a set. #. ``union(s1,s2)`` #. ``intersection(s1,s2)`` #. ``symmetric_difference(s1,s2)``