Lecture 13 — Data from Files and Web Pages ============================================== Overview -------- - Files on your computer - Opening and reading files (review) - Closing (new) - Writing files (new) - Accessing files across the web - Parsing basics - Parsing html Our discussion is only loosely tied to Chapter 8 of the text. Review — String operations often used in file parsing -------------------------------------------------------- Let's review and go over some very common string operations that are particularly useful in parsing files. - Remove characters from the beginning, end or both sides of a string with lstrip, rstrip and strip:: >>> x = "red! Let's go red! Go red! Go red!" >>> x.strip("red!") " Let's go red! Go red! Go " >>> x.lstrip("red!") " Let's go red! Go red! Go red!" >>> x.rstrip("red!") "red! Let's go red! Go red! Go " >>> " Go red! ".strip() 'Go red!' Space is the character removed by default. - Split a string using a delimiter, and get a list of strings. Space is the default delimiter:: >>> x = "Let's go red! Let's go red! Go red! Go red!" >>> x.split() ["Let's", 'go', 'red!', "Let's", 'go', 'red!', 'Go', 'red!', 'Go', 'red!'] >>> x.split("!") ["Let's go red", " Let's go red", ' Go red', ' Go red', ''] >>> x.split("red!") ["Let's go ", " Let's go ", ' Go ', ' Go ', ''] It returns the strings before and after the delimiter string in a list. - Find the first location of a substring in a string, return -1 if not found. You can also optionally give a starting and end point to search from:: >>> x "Let's go red! Let's go red! Go red! Go red!" >>> x.find('red') 9 >>> x.find('Red') -1 >>> x.find('red',10) 23 >>> x.find('red',10,12) -1 >>> 'red' in x True >>> 'Red' in x False Opening and Reading Files ---------------------------- - Given the name of a file as a string, we can open it to read: :: f = open('abc.txt') This is the same as :: f = open('abc.txt','r') - Variable ``f`` now “points” to the first line of file ``abc.txt``. - The ``'r'`` tells Python we will be reading from this file — this is the default. - We can read in data through three primary methods. First, :: line = f.readline() reads in the next line up to and including the end-of-line character, and “advances” ``f`` to point to the next line of file ``abc.txt``. - By contrast, :: s = f.read() reads the entire **remainder** of the input file as a single string, - storing the one (big) string in ``s``, and - advancing ``f`` to the end of the file! - When you are at the end of a file, ``f.read()`` and ``f.readline()`` will both return ``""`` (empty string). Reading the contents of a file ------------------------------ - The most common way to read a file is as follows: :: f = open('abc.txt') for line in f: print line - This for loop will equivalent to the following: :: f = open('abc.txt') for each line in the file: line is assigned the string corresponding to the contents of the line, including the new line - You can combine the above steps into a single for loop: :: for line in open('abc.txt'): .... Closing and Reopening Files ---------------------------- - The code below closes and reopens a file :: f = open('abc.txt') # Insert whatever code is need to read from the file # and use its contents ... f.close() f = open('abc.txt') - ``f`` now points again to the beginning of the file. - This can be used to read the same file multiple times. Writing to a File ------------------ - In order to write to a file we must first open it and associate it with a file variable, e.g. :: f_out = open("outfile.txt","w") - The ``"w"`` signifies *write mode* which causes Python to completely delete the previous contents of ``outfile.txt`` (if the file previously existed). - It is also possible to use *append mode*: :: f_out = open("outfile.txt","a") which means that the contents of ``outfile.txt`` are kept and new output is added to the end of the file. - Write mode is much more common than append mode. - To actually write to a file, we use the ``write`` method: :: f_out.write("Hello world!") - Each call to ``write`` passes only a **single string**. - Unlike what happens when using ``print``, spacing and newline characters are required explicitly - The string may be formatted - You must close the files you write! Otherwise, the changes you made will not be recorded!! :: f_out.close() Part 1 Exercise --------------- #. Given the file ``census_data.txt``: :: Location 2000 2011 New York State 18,976,811 19,378,102 New York City 8,008,686 8,175,133 What are the value of variables ``line1``, ``line2``, ``line3``, and ``line4`` after the following code executes? :: f = open("census_data.txt") line1 = f.readline() line2 = f.read() line3 = f.readline() f.close() f = open("census_data.txt") line4 = f.readline() #. For the same data above, what does the following program produce? :: f = open('census_data.txt') s = f.read() line_list = s.split('\n') print len(line_list) #. Write code to print all the lines in the above file except for the header line (the first line). #. Given a file containing test scores, one per line, write Python code to write a second file with the scores output in decreasing order, one per line, with the index on each line. For example, if the input file contains:: 75 98 21 66 83 then the output file should contain:: 0: 98 1: 83 2: 75 3: 66 4: 21 This can be done in 10 or fewer lines of Python code. Opening Static Web Pages ------------------------ - We can use the :mod:`urllib` module to access web pages. - We did this with our very first "real" example: :: import urllib words_file = urllib.urlopen(words_url) - Once we have ``words_file`` we can use the ``read``, ``readline``, and ``close`` methods just like we did with “ordinary” files. - When the web page is dynamic, we usually need to work through a separate API (application program interface) to access the contents of the web site. Recall the Flickr example. Parsing ------- - Before writing code to read a data file or to read the contents of a web page, we must know the format of the data in the file. - The work of reading a data file or a web page is referred to as *parsing*. - Files can be of a fixed well-known format - Python code - C++ code - HTML (HyperText Markup Language, used in all web pages) - JSON (Javascript Object Notation, a common data exchange format) - RDF (resource description framework) - Often there is a parser module for these formats that you can simply use instead of implementing them from scratch - For code, parsers check for syntax errors. Short tour of data formats ---------------------------- - Python code: - Each statement is on a separate line - Changes in indentation are used to indicate entry/exit to blocks of code, e.g. within ``def``, ``for``, ``if``, ``while``... - HTML: Basic structure is a mix of text with commands that are inside "tags" ``< ... >``. Example::