Tuesday, October 27, 2015

They Might Be Random Fingertips

I have a bit of an obsession with Fingertips, a sequence of 21 unrelated songs mostly between 4 and 12 seconds long from They Might Be Giants' 1992 album Apollo 18. The liner notes say to put the CD on random shuffle so that the brief snippets play in a random order.

There are 51,090,942,171,709,440,000 ways to permute 21 items -- that's a little excessive, so I made YouTube videos of 100 such permutations. I selected two video clips for each track, mostly by searching for the track name in YouTube, and assigned them randomly, so there should be some visual surprises upon multiple viewings.

Here's a link to the YouTube channel I made to hold all 100 videos.

And here is Random Fingertips #1 of 100:

• • •

Monday, September 28, 2015

Methodology for comparison of Pope Francis address with presidential inaugural addresses

This post goes deeper into the methodology of this post on Prooffreader.com.

The code is in this gist.

Here is the graphic for that post:

DATASET: Links are in the gist. Note that Pope Francis's address is the one published and given to the media, not the one actually delivered. This may well be the case for inaugural addresses as well.

TFIDF: A standard technique in NLP (natural language processing): Wikipedia entry. I removed the default English stopwords in the TextBlob module (although, of course, TFIDF will give low scores to stopwords anyway.)

CALCULATION OF SIMILARITY BY PARTY: The average cosine similarities for presidents by party was:

republican 0.825
democratic 0.832
other 0.891
range: 0.743-0.968

In order to create a metric that would be useful when hacking the t-SNE, below, I calculated the % of presidents from each party who numbered among the top half of similarity scores:

republican 0.417
democratic 0.500
other 1.000

TSNE: My favorite manifold learning algorithm, but it's not without its problems, mostly in terms of reproducibility. This is less of a problem when it's used to show similarities in a group in general, because it spreads the error inherent in such dimensionality reduction around to all points (a different way each time it's run), so in general it minimizes error at each point. But this is not what I wanted here. The thrust of this analysis is to portray the similarity between one point (the pope) and every other point (the presidents); I therefore needed to privilege this point (and similarity vector), minimize its error the most, and spread whatever error I saved among the other points. In other words, the president-pope distances would be more accurate than the president-president distances.

There are several possible solutions to this problem; I went with brute-force hack! I simply re-ran the t-SNE over and over until (a) the pope point was more or less in the center, and (b) the percentages of republicans, democrats and others closest to the pope was reasonably close to that in the similarity matrix.

Oh, and there was a third criterion: the t-SNE had to look aesthetically pleasing, in a more or less globular shape without too many points overlapping so the mouseovers weren't annoying.

Then it just became a matter of tuning. If my criteria were too strict, I'd never find a solution. Less strict, and I'd find a solution every few minutes, but they might not look nice and I'd have to start again. Finally, I went with a solution (% democrats in top half - % republicans in top half > 4%, %other in top half - % democrats in top half > 8%) that took about 50 iterations and less than a minute to find, and ran it a few times till the first time I got something that 'looked nice'.

If anyone has ideas for a less hacky way to solve this problem (and yes, I tried using graphs, but they ended up just too darn symmetrical-looking), I'd love to hear it!

TOP THREE CHARACTERISTIC SIMILAR WORDS MOUSEOVER: The top three most characteristic words shared between the pope and each president was another hack; my first thought was to use a simple Dunning log-likelihood test, with one corpus the pope's speech concatenated with the president in question's speech, and the other corpus the speeches of all of the other presidents. Dunning usually does a good job of determining words that are overrepresented in one corpus, and penalizes words that are both too rare and to common to be statistically significant. But of course, the pope used a bunch of words that no president used, like 'Merton', whom he mentioned five times. So I tried a few variations on the theme, and ended up using, for corpus 1, the maximum frequency (I used a bag of words, not a TFIDF) of each word that both the pope and the president used. Again, it's a little hacky. I also added +1 to every word count in both corpuses so there were no divide-by-zero errors and the algorithm didn't ignore words that the pope and president used but no other president used -- these are definitely of interest in this analysis.

Again, anyone who has a better solution, I'm all ears!

TOP POPE WORDS & PRESIDENTS WHO USED THEM: I took the top 100 words used by the pope in terms of tf-idf, so it penalized words he used rarely, and words few or no presidents used. Then I determined the tf-idf for each word for each president, listed the top three presidents in terms of tf-idf, and indicated how many presidents in total used the word.

INTERPRETATION OF PARTY RESULTS: Here is a plot of the rank of cosine similarities to Jaccard similarities (which is simply the ratio of the intersection of terms used to the union of terms used) between the pope and each president. The rank is perfectly preserved, indicating that the preponderance of 'other' presidents among the most similar might be due to some combination of a large intercept of words in common and a small union of total words. In other words, maybe early presidents had low lexical diversity, short speeches, or both.

Plotting length of speeches and lexical diversity (number of unique words divided by total number of words) vs. cosine distance, however, shows that the green 'other' dots tend to be towards the left, but they're still spread out somewhat along the top third of the graph, indicating that perhaps the pope did indeed genuinely share a vocabulary with them more, just possible not to the extent that the cosine similarities along might indicate.

Again, anyone with better tools to analyze this is welcome to contribute!

• • •

Monday, August 17, 2015

A Python script to make choropleth grid maps

In May 2015, there was a sudden fad in the Dataviz community (on Twitter, anyway) for hexagonal grid-type choropleth maps. A choropleth (not "chloropleth") is a map in which areas are filled in with a color whose intensity and/or hue is proportional to a quantity; we've all seen them. The problem with traditional choropleths is they are dependent on area so that, for example, a choropleth of the United States will emphasize relatively deserted Wyoming over highly populated Massachusetts.

One recent partial solution to this has been to create choropleths with equal representations of subunits with squares or hexagons. They're still not proportional to population (attempts to solve this last hurdle generally result in ugly and/or unrecognizable maps), but they're pretty cool nonetheless.

As soon as the hub-bub started, I thought it would be relatively painless and an interesting exercise to code up a Python script that would make these choropleths. Since they're geometric, it's a cinch to output SVG vector markup. I got about 90% of the way through this project in three weeks, and then a new job and some health issues interfered. but I finally found a weekend to finish it off, in at least a beta, v.0.1 way.

The script is hosted on my GitHub, and here's the IPython notebook/Jupyter tutorial that goes along with it. (There's also a script called Colorbin that's supposed to remove the hassle from associating colors with quantities for the choropleths).

Feedback is welcome. Here are some examples of what the script has produced:

• • •

Saturday, August 8, 2015

Dataset: Single word frequencies per decade from Google Books

I have crunched a public English language dataset in order to remove information that is least likely to be of interest to users, and I offer it for anyone to download:

[1 GB] Google 1-grams English v.20120701 by decade, lowercase, no parts of speech (zipped csv)

The original dataset is Creative Commons Attribution 3.0 Unported License.


Google Ngram Viewer is an online tool to track the uses of English words from the 16th century to 2008, with the following caveats, among others:

  • It only contains words that were used at least 40 times in any given year, in order to preserve copyright (so you can't tell exactly what book a given word appeared in). This means, for example, if a word occurs 40 times in 1970, 39 times in 1971 and 41 times in 1972, in the database the word will occur 40 times in 1970, 0 times in 1971 and 41 times in 1972.
  • The database is mostly based on library books, so it is heavily biased towards the types of books found in libraries; this includes, for example, directories of names. It is also biased towards the availability of books in a given year, so, for example, 1994 will be much more representatively represented (so to speak) than 1731.
  • A lot of the older books have many, many typos, and many books have the wrong date. I wrote an amusing (I hope) blog post about this, 
  • Culturomics, the hyperbolically named organization that prepared the data and made the search tool, warns that data before 2000 should not be compared to data from 2000-2008; they don't explain why, but a reasonable hypothesis would be that the availability of electronic documents drastically changed the nature of the underlying documents and thus their word frequencies.
I have downloaded the over 6 GB of data, converted the words to lowercase, removed part-of-speech tags, converted dates to the decade year and aggregated the results. That means, for example, all of the following 72 entries were collapsed into only one word, "after":

aFTer  AftEr      aftEr_ADP  AfTER_DET   AFteR_NOUN
afteR  AfTEr      AftEr_ADP  aFter_DET   AFTER_NUM
AfTer  after      aFter_ADP  AfTer_NOUN  after_PRT
AFtER  AFter      after_ADP  AfTER_NOUN  AFTER_PRT
afTer  after_ADJ  afTer_ADP  afTer_NOUN  AfTER_VERB
AftER  After_ADJ  after_ADV  aFter_NOUN  AFTER_VERB
AfteR  After_ADP  AFter_DET  aFTER_NOUN  

And then, for example, the 10 entries for "after" for 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998 and 1999 are aggregated to only one entry, 1990.

This makes the dataset much easier to use, small enough to hold in memory for most computers, and it smooths out some of the weirdness like all of the different capitalizations.

If you end up using this data, I'd love it if you dropped me a line. Enjoy.
• • •

Wednesday, May 13, 2015

Let's include Europe in The Great Grid Map Debate

I've been enjoying what Nathan Yau calls 'The Great Grid Map Debate', as different chloropleths of the United States with equal-area states have been proposed. I thought we should make this a little less American-centric, so I tried my hand at Europe:
Really tiny countries like Andorra, Liechtenstein, Monaco and Vatican City were left out; I extended Europe as far to the east as the most genererous definitions of Europe I could find allowed, they can always be left out depending on the dataset. I hope I got all the ISO three-letter abbreviations right, I had memorized the three-letter IOC abbreviations when I was a kid and they kept jumping into my head.

Here's the NPR graphic that started it all:

And here's my current favorite, a four-hex grid, posted by Jason Emory Parker of The Post and Courier this morning:

• • •

Thursday, May 7, 2015

Methodology for Most characteristic words in pro- and anti-feminist tweets

Here I discuss the methodology used in the prooffreader.com post Most characteristic words in pro- and anti-feminist tweets. I both describe what we did (my team and I) and explain the reasoning behind it. I try not to make it too technical.

The team consisted of myself, Zafarali Ahmed, Jerome Boisvert-Chouinard, Dave Gurnsey, Nancy Lin, and Reda Lotfi. We won the Data Science and the Natural Language Processing prizes at the Montreal Big Data Week Hackathon on April 19, 2015.

The code for this post is available in my Github. The code we created during the hackathon is in teammate Zaf's Github, and includes some features we'd like to eventually include, such tweet frequency tracking and topic modelling. The script downloads the dataset from my website; if you want, you can download it yourself for perusal here (it's 248 MB). If you just want to see the log-likelihood results, they're here. And if you'd like to see if you agree with our manual curation, the set is here. Finally, the actual code that searches Twitter is here, I didn't put it in the Github because it's embarassingly poorly hacked together, but it works (which gives me not enough incentive to clean it up).

1. Data collection: I wrote a simple Python script that uses the free Twitter Search API and runs every 15 minutes. I have a list of about 20 search terms I'm interested in; I randomly collect the last 100 tweets of each until the limit maxes out. Additionally, at random intervals about once per day, I collect the results for only one search term, as many results as possible. Between January and April 2015, I collected 988,000 tweets containing "feminism", "feminist" and/or "feminists".

2. Curation: We used a simple script to manually curate 1,000 randomly chosen tweets into three categories: pro-feminist, anti-feminist, and other (e.g. neither, neutral, a news report, we couldn't tell, not in English, etc.)

You may have heard of sentiment analysis; this is what we called 'attitude analysis' and it's more difficult. A typical pipeline of sentiment analysis uses curated movie reviews and makes inferences from texts containing words like 'hate', 'love', 'like', 'terrific', 'awful', i.e. positive, negative, subjective and/or non-subjective language.

For example, look at these two hypothetical tweets:
    1. Man, do I ever hate feminists.
    2. I hate that my mom does not like the word 'feminism'.

Sentiment analysis will typically consider these tweets similar due to their use of 'hate', but they're diametrically opposed in attitude. By using machine learning we try to see if other words ('man', 'ever', 'feminists' vs 'feminism', 'mom', etc.) are good predictors of the underlying attitude. (We verified that sentiment analysis was not up to the task, you can see it in the Github repo).

The manual curation was time-consuming, especially since we really wanted to get it right. If we were in any doubt (e.g. if we thought the tweet might be sarcastic) we classified it as 'other'. The final count was about 50% neither, 25% pro and 25% anti.

3. Removal of 'low-effort' tweets. We wanted tweets that people spent time writing, and chose their words deliberately (if not always carefully). In other words, we wanted to eliminate anything that could be tweeted with just a click of a button, like retweets or those 'Share me' links on websites. From our original 988,000 we were left with 390,000 after this adjustment. (People sure do retweet/share a lot). (BTW, to minimize the shared tweets, we just eliminated duplicates, leaving only one copy in the database).

4. Tokenizing. Since tweets are limited to 140 characters, we did not eliminate stopwords (articles, pronouns, linking verbs, common prepositions, etc.) in the theory that they were carefully chosen and significant. We eliminated very common punctuation, but considered most non-alphabetic characters (quotation marks, question marks, special characters, emoji, etc.) as separate tokens. (This let us see if one group quoted, questioned, exclaimed, etc. more than another). We also considered 'curly quotes' different from 'straight quotes', since they are used most often in copied-and-pasted text, but we did not see a significant difference between groups once duplicate tweets were removed. (We saw a big difference in favor of pro-feminism before the removal, indicating they are the ones who are 'sharing' from traditional news sites.)

5. Classification. We used machine learning to classify the 390,000 tweets as pro- or anti-feminist (or other) based on our 1,000 tweets. We used a bag-of-words approach, i.e. taking the set of all words used in all tweets, and seeing which words are used in individual tweets and whether any of those individual words are good predictors of the class (pro-, anti-, other) of the tweet. We used a Naive Bayes classifier because it's commonly used in NLP (Natural Language Processing) for bag-of-words tasks. We sequestered 25% of our curated tweets to test our classifier; it classified about 90% of the non-sequestered tweets correctly, but only about 40% of the sequestered tweets correctly. This is called 'overfitting', and the obvious solution is to curate more tweets. (Maybe Mechanical Turk could help?). Still, one important measure of false-positive rates was around 60%, which is pretty good for twitter data, indicating that we weren't getting a lot of misclassifications inside two classes of interest, pro- and anti-; most of our errors were false negatives, i.e. real pro- and anti- tweets being classified as 'other'. We can live with that; false positives in this scenario are far worse than false negatives.

6. We calculated the log-likelihood of every word/token that appeared in both pro- and anti- tweets at least 10 times, as a measure of how characteristic they were to one class as opposed to the other. It's a measure of significance, i.e. a form of p-value, which can be misused, but we're using it properly.

My best explanation in layman's terms is, we choose a word and see how common it is in each dataset. The log-likelihood is a measure of how 'surprised' we would be if we mixed up the two datasets and divided them randomly, and then came up with a similarly unequal distribution. In other words, it's a measure of the odds that our observation is due to chance rather than the true nature of our datasets.

Log-likelihood is a handy measure because it takes into account both the ratio and the absolute values of frequency differences between sets. In other words, if "goldilocks" (to pick a word totally at random) appeared 10 times in the anti-feminist tweets and 20 times in the pro-feminist tweets, it would have a lower log-likelihood than if it appeared 100 times and 200 times, respectively, even though the ratio between them is the same. How much lower depends on the total size of the dataset; we're more 'surprised' to find differences in large datasets (which are more predictable) than in small ones.

It's been my experience that log-likelihood is pretty robust with imperfectly classified data, like this. In other words, the most characteristic words might not be exactly the same words in the same order if the classification were better, but there would be very little substantive change, i.e. words changing their relative values spectacularly.

I'm interested in comments and question, feel free! I have a thick skin and I acknowledge I'm capable of mistakes!

• • •

Monday, May 4, 2015

A simple progress 'bar' for IPython Notebooks

Doing data science, I often start loop functions without a clear idea of how long they'll take. When working with exceptionally huge datasets, it can be hours. That's why I created this quick and dirty progress bar (okay, there's no bar, it's a counter to 100) so I can judge whether I should wait, get up to make some coffee, go do laundry, or spin up a large AWS instance to get the job done.

The Github repo is here. If you're wondering how I made an IPython notebook look somewhat pretty on a blog, I blogged about it here

# code for the progress bar

import time

class ProgressBar: 
    def __init__(self, loop_length):
        import time
        self.start = time.time()
        self.increment_size = 100.0/loop_length
        self.curr_count = 0
        self.curr_pct = 0
        self.overflow = False
        print('% complete: ', end='')
    def increment(self):
        self.curr_count += self.increment_size
        if int(self.curr_count) > self.curr_pct:
            self.curr_pct = int(self.curr_count)
            if self.curr_pct <= 100:
                print(self.curr_pct, end=' ')
            elif self.overflow == False:
                print("\n* Count has gone over 100%; likely either due to:\n  - an error in the loop_length specified when " + \
                      "progress_bar was instantiated\n  - an error in the placement of the increment() function")
                print('Elapsed time when progress bar full: {:0.1f} seconds.'.format(time.time() - self.start))
                self.overflow = True

    def finish(self):
        if 99 <= self.curr_pct <= 100: # rounding sometimes makes the maximum count 99.
            print("100", end=' ')
            print('\nElapsed time: {:0.1f} seconds.\n'.format(time.time() - self.start))
        elif self.overflow == True:
            print('Elapsed time after end of loop: {:0.1f} seconds.\n'.format(time.time() - self.start))
            print('\n* End of loop reached earlier than expected.\nElapsed time: {:0.1f} seconds.\n'.format(time.time() - self.start))
# normal usage, on my slow crappy laptop

loop_length = 1000000

pbar = ProgressBar(loop_length)
for i in range(loop_length):
    # your code goes here
% complete: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 100 
Elapsed time: 1.7 seconds.

# here's what happens if the loop lengths are mismatched so that
# the progress bar expects fewer iterations than there are

loop_length = 1000000

pbar = ProgressBar(loop_length/2)
for i in range(loop_length):
    # your code goes here
% complete: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 
* Count has gone over 100%; likely either due to:
  - an error in the loop_length specified when progress_bar was instantiated
  - an error in the placement of the increment() function
Elapsed time when progress bar full: 0.8 seconds.
Elapsed time after end of loop: 1.5 seconds.

# here's what happens if the loop lengths are mismatched so that
# the progress bar expects more iterations than there are

loop_length = 1000000

pbar = ProgressBar(loop_length*2)
for i in range(loop_length):
    # your code goes here
% complete: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
* End of loop reached earlier than expected.
Elapsed time: 1.6 seconds.

• • •

Tuesday, April 28, 2015

Fan Commentary for 'Birdman' (2014 film)

How did we end up here?

Birdman is one of the more interesting films I've seen in a while; for one thing, it's presented as if it were filmed all in one take (digitally edited together), which means a lot of the traditional film visual vocabulary has to be substituted for things that will work with a single camera. Plus there's magical realism, meta-elements, references to other films and literature, some thematic ambiguity... all in all, it's exactly the kind of thing I could enthuse about or critique or speculate about for the whole of the film, and I did.

There's a YouTube version, where I took one screenshot every thirty seconds just to give it some video; the idea is you sync it up with the movie (I count down to a place near the beginning) and listen while you watch, but feel free to just watch the video if you're familiar enough with the film.

There's also a downloadable MP3/Podcast version hosted at yourlisten.

• • •