Lecture 15 — Dictionaries, Part 1

Overview

  • More on IMDB
  • Dictionaries and dictionary operations
  • Solutions to the problem of counting the movies each individual is involved in
  • Other applications

How Many Movies is Each Person Involved In?

  • Goals:
    • Count movies for each person.
    • Who is the busiest?
    • What movies do two people have in common?
  • Best solved with the notion of a dictionary, but we’ll at least consider how to use a list.

List-Based Solution — Straightforward Version

  • Core data structure is a list of two-item lists, each giving a person’s name and the count of movies.

  • For example, after reading the first seven lines of our shortened hanks.txt file, we would have the list

    [ ["Hanks, Jim", 3], ["Hanks, Colin", 1],
      ["Hanks, Bethan", 1], ["Hanks, Tom", 2] ]
    
  • Just like our solution from the sets lectures, we can start from the following code:

    imdb_file = raw_input("Enter the name of the IMDB file ==> ").strip()
    count_list = []
    for line in open(imdb_file):
        words = line.strip().split('|')
        name = words[0].strip()
    
  • Like our list solution for finding all IMDB people, this solution is VERY slow — once again O(N^2) (“order of N squared”).

List-Based Solution — Faster Version Based on Sorting

  • Append each name to the end of the list without checking if it is already there.
  • After reading all of the movies, sort the entire resulting list
    • As a result, all instances of each name will now be next to each other.
  • Go back through the list, counting the occurrence of each name
  • This solution will be much faster than the first, but it is also much more involved to write than the one we are about to write using dictionaries

Introduction to Dictionaries

  • Association between “keys” (like words in an English dictionary) and “values” (like definitions in an English dictionary). The values can be anything.

  • Examples:

    >>> heights = dict()    # initialization 1
    >>> heights = {}        # initialization 2, only one or the other is necessary
    >>> heights['belgian horse'] = 162.6
    >>> heights['indian elephant'] = 280.0
    >>> heights['tiger'] = 91.0
    >>> heights['lion'] = 97.0
    >>> heights
    {'tiger': 91.0, 'belgian horse': 162.6, 'indian elephant': 280.0,
     'lion': 97.0}
    >>> 'tiger' in heights
    True
    >>> 'giraffe' in heights
    False
    >>> heights.keys()
    ['tiger', 'belgian horse', 'indian elephant', 'lion']
    
  • Details:

    • Two initializations; either would work.
    • Syntax is very much like the subscripting syntax for lists, except dictionary subscripting/indexing uses keys instead of integers!
    • The keys, in this example, are animal species (or subspecies) names; the values are floats.
    • The in method tests only for the presence of the key, like looking up a word in the dictionary without checking its definition.
    • The keys are NOT ordered.
  • Just as in sets, the implementation uses hashing of keys.

    • Conceptually, sets are dictionaries without values.

Exercise

Hand-write or type each of the following:

  1. Form a dictionary called countries that associates the population with each of the following countries:

    • Algeria 37,100,000
    • Canada 34,945,200
    • Uganda 32,939,800
    • Morocco 32,696,600
    • Sudan 30,894,000
  2. Assuming that all of this has been done, what is the output of the following, when typed into the Python interpreter?

    >>> print len(countries)
    
    
    >>> print countries
    
    
    >>> print countries.keys()
    
    
    >>> print sorted(countries.keys())    # can you guess what this does?
    

Back to Our IMDB Problem

  • Even though our coverage of dictionaries has been brief, we already have enough tools to solve our problem of counting movies.

  • Once again we’ll use the following as a starting point

    imdb_file = raw_input("Enter the name of the IMDB file ==> ").strip()
    count_list = []
    for line in open(imdb_file):
        words = line.strip().split('|')
        name = words[0].strip()
    
  • We will impose an ordering on the output by sorting the keys.

  • We’ll test first on our smaller data set and then again later on our larger ones.

Key Types

  • Thus far, the keys in our dictionary have been strings.
  • Keys can be any “hashable” type — string, int, float, booleans.
    • Lists, sets and other dictionaries can not be keys.
  • Strings are by far the most common key type
  • We will see an example of integers as the key type by the end of these notes.
  • Float and boolean are general poor choices. Can you think why?

Value Types

  • So far, the values in our dictionaries have been integers and floats.

  • But, any type can be the values

    • boolean
    • int
    • float
    • string
    • list
    • tuple
    • set
    • other dictionaries
  • Here is an example using our IMDB code and a set:

    >>> people = dict()
    >>> people['Hanks, Tom'] = set()
    >>> people['Hanks, Tom'].add('Big')
    >>> people['Hanks, Tom'].add('Splash')
    >>> people['Hanks, Tom'].add('Forest Gump')
    >>> print people['Hanks, Tom']
    set(['Big', 'Splash', 'Forest Gump'])
    
  • Here is another example where we store the continent and the population for a country instead of just the population:

    countries.clear()
    countries['Algeria'] =  (37100000, 'Africa')
    countries['Canada'] = (34945200, 'North America' )
    countries['Uganda'] = (32939800, 'Africa')
    countries['Morocco'] = (32696600, 'Africa')
    countries['Sudan'] = (30894000, 'Africa')
    
  • We access the values in the entries using two consecutive subscripts. For example,

    name = "Canada"
    print "The population of %s is %d" %(name, countries[name][0])
    print "It is in the continent of",  countries[name][1]
    

Removing Values: Sets and Dictionaries

  • For a set:
    • discard removes the specified element, and does nothing if it is not there
    • remove removes the specified element, but fails (throwing an exception) if it is not there
  • For a dictionary, it is the del function.
  • For both sets and dictionaries, the clear method empties the container.
  • We will look at toy examples in class

Other Dictionary Methods

  • The following dictionary methods are useful, but not so much as the ones we’ve discussed.
    • get
    • pop
    • popitem
    • update
  • Use the help function in Python to figure out how to use them and to find other dictionary methods.

Exercises

  1. Write code to discover who is the busiest individual in the IMDB.
  2. Write a function that takes the IMDB dictionary — which associates strings representing names with integers representing the count of movies — and an integer representing a min_count, and removes all individuals from the dictionary involved in fewer than min_count movies.

Summary of Dictionaries

  • Associate “keys” with “values”
  • Feels like indexing, except we are using keys instead of integer indices.
  • Makes counting and a number of other operations simple and fast.
  • Keys can be any “hashable” value, usually strings, sometimes integers.
  • Values can any type whatsoever.