In this project, you will write a program that reads election data from a web site and analyzes it to see how precinct-level party preference correlates across different races (US President, US Senate, US House, and State House). For example, to what extent is it true that the more Republican-leaning a precinct was in voting for President, the more it also favored the Republican candidate for the US House of Representatives? Do some pairs of races correlate more strongly than others? The data you will work with comes from the November 4, 2008 general election and originally was on the Minnesota Secretary of State's web site; you'll be using a copy on the Gustavus web site. The techniques you use will be very similar to those shown in Section 5.3.2 of your textbook, where they are applied to stock market data.
You are to do this project individually.
The data for each of the four different races is available individually from the following URLs:
Notice that the four URLs have a lot in common. Your procedure will be told which two to use by being passed strings like 'USPresPct' or 'ussenatepct'. Your procedure should put this together with the standard prefix that all the URLs share and the .txt suffix.
Here is the very first line of the USPresPct file, to illustrate the format:
MN;01;0005;0101;US PRESIDENT & VICE PRESIDENT;;0301;JOHN MCCAIN AND SARAH PALIN;;;R;54;54;499;48.921599999999998;1020
As you can see, the various fields are separated with semicolons. The first three fields indicate what precinct this data concerns: precinct number 5 within county 1 of the state of Minnesota. Another field that is important for your project is the one containing just the letter R, which is the code for the Republican party. The other party that shows up consistently across the data sets is the Democratic-Farmer-Labor party, indicated by the three-letter code DFL. The final field of interest is the second-to-last one, which in this example contains approximately 48.92. This is the percentage of the vote in this particular precinct that was cast for this particular candidate.
As you start comparing the various data files, you'll discover two complicating factors:
The precincts are not listed in the same order in all the files. (The ushousepct file groups the precincts by congressional district.) Thus, if you are going to match up corresponding data from two files, you will have to sort them into a consistent order first. Since the three fields that indicate the precinct come at the start of each line, one possibility would be to sort the entire lines.
The ussenatepct file contains one extra precinct, precinct number 5 within a mythical county number 89. (Minnesota actually only has 87 counties.) This fake precinct contains extra votes that came from disputed ballots that were added after the initial vote counting process was over. If they had been added to their actual precincts, it would have in many cases been possible to attribute them to specific voters whose names had become known. This extra precinct can be ignored in the same way as the code in the textbook's Listing 5.6 ignores extra stock data from a stock that goes back further in history.
Start by trying out my slight variant of the book's stockCorrelate procedure (Listing 5.6). For example, by evaluating stockCorrelate('IBM', 'MSFT') you ought to be able to determine that the historical stock prices of IBM and Microsoft are highly correlated, that is, they have a correlation coefficient close to 1. In the stockCorrelate procedure, I incorporated a couple fixes from the authors' blog, and I also made a few other small changes that I will discuss in class. I also packaged this procedure together with the supporting statistical functions that the authors had included elsewhere in the book.
Save out a copy of the program and rename stockCorrelate to voteCorrelate. You should also change a few names within the procedure so that they stay descriptive: elections are not designated by ticker symbols, and the numbers we will be considering are no longer the price at the "close" of the market. In the following steps, you'll make the other more substantive changes.
Update voteCorrelate to get its data from the URLs listed under the heading Data Source.
Update voteCorrelate to reflect the fact that unlike with the stock data, the data files have no header lines describing what is in each column.
Update voteCorrelate to reflect how the columns of data on each line are separated.
Update voteCorrelate to focus in on a single party of your choice. (I suggest you choose either R or DFL, since the others appear rather sporadically.) Discard all data that isn't for this one party.
Update voteCorrelate so that instead of asserting that the two data items came from the stock market on the same date, it asserts that they came from voting in the same county and precinct.
Update voteCorrelate to reflect which column contains the number we are interested in.
You could now try your voteCorrelate procedure out. If you use two data sets where the precincts are listed in the same order, you ought to get a correlation coefficient. If, on the other hand, you use two data sets where the precincts are listed in different orders, you ought to get an error message about an assertion.
Update voteCorrelate to put the precincts into a standard order.
Use your procedure to get correlation coefficients for all six pairs of different races.
Of the six pairs of races, which is most strongly correlated? Which is most weakly? Of the four races, is there one that you can single out as being weakly correlated to the other three?
For this project, you should copy and paste the code of your procedure directly into an email message that you send to max@gustavus.edu. You should also paste in the input and output from the Python Shell window showing six uses of your procedure, one for each pair of different races. Indicate the pairs with the strongest and weakest correlations and give your answer regarding whether there is a particular race that is weakly correlated with the other three.
You will earn two points for accomplishing each of the specific tasks listed above as numbers 2-8 and 10-12. However, if you accomplish any of these tasks in a way that is excessively complex, you will only earn one of the two points.