The code is in this gist.
Here is the graphic for that post:
DATASET: Links are in the gist. Note that Pope Francis's address is the one published and given to the media, not the one actually delivered. This may well be the case for inaugural addresses as well.
TFIDF: A standard technique in NLP (natural language processing): Wikipedia entry. I removed the default English stopwords in the TextBlob module (although, of course, TFIDF will give low scores to stopwords anyway.)
CALCULATION OF SIMILARITY BY PARTY: The average cosine similarities for presidents by party was:
republican 0.825 democratic 0.832 other 0.891 range: 0.743-0.968
In order to create a metric that would be useful when hacking the t-SNE, below, I calculated the % of presidents from each party who numbered among the top half of similarity scores:
republican 0.417 democratic 0.500 other 1.000
TSNE: My favorite manifold learning algorithm, but it's not without its problems, mostly in terms of reproducibility. This is less of a problem when it's used to show similarities in a group in general, because it spreads the error inherent in such dimensionality reduction around to all points (a different way each time it's run), so in general it minimizes error at each point. But this is not what I wanted here. The thrust of this analysis is to portray the similarity between one point (the pope) and every other point (the presidents); I therefore needed to privilege this point (and similarity vector), minimize its error the most, and spread whatever error I saved among the other points. In other words, the president-pope distances would be more accurate than the president-president distances.
There are several possible solutions to this problem; I went with brute-force hack! I simply re-ran the t-SNE over and over until (a) the pope point was more or less in the center, and (b) the percentages of republicans, democrats and others closest to the pope was reasonably close to that in the similarity matrix.
Oh, and there was a third criterion: the t-SNE had to look aesthetically pleasing, in a more or less globular shape without too many points overlapping so the mouseovers weren't annoying.
Then it just became a matter of tuning. If my criteria were too strict, I'd never find a solution. Less strict, and I'd find a solution every few minutes, but they might not look nice and I'd have to start again. Finally, I went with a solution (% democrats in top half - % republicans in top half > 4%, %other in top half - % democrats in top half > 8%) that took about 50 iterations and less than a minute to find, and ran it a few times till the first time I got something that 'looked nice'.
If anyone has ideas for a less hacky way to solve this problem (and yes, I tried using graphs, but they ended up just too darn symmetrical-looking), I'd love to hear it!
TOP THREE CHARACTERISTIC SIMILAR WORDS MOUSEOVER: The top three most characteristic words shared between the pope and each president was another hack; my first thought was to use a simple Dunning log-likelihood test, with one corpus the pope's speech concatenated with the president in question's speech, and the other corpus the speeches of all of the other presidents. Dunning usually does a good job of determining words that are overrepresented in one corpus, and penalizes words that are both too rare and to common to be statistically significant. But of course, the pope used a bunch of words that no president used, like 'Merton', whom he mentioned five times. So I tried a few variations on the theme, and ended up using, for corpus 1, the maximum frequency (I used a bag of words, not a TFIDF) of each word that both the pope and the president used. Again, it's a little hacky. I also added +1 to every word count in both corpuses so there were no divide-by-zero errors and the algorithm didn't ignore words that the pope and president used but no other president used -- these are definitely of interest in this analysis.
Again, anyone who has a better solution, I'm all ears!
TOP POPE WORDS & PRESIDENTS WHO USED THEM: I took the top 100 words used by the pope in terms of tf-idf, so it penalized words he used rarely, and words few or no presidents used. Then I determined the tf-idf for each word for each president, listed the top three presidents in terms of tf-idf, and indicated how many presidents in total used the word.
INTERPRETATION OF PARTY RESULTS: Here is a plot of the rank of cosine similarities to Jaccard similarities (which is simply the ratio of the intersection of terms used to the union of terms used) between the pope and each president. The rank is perfectly preserved, indicating that the preponderance of 'other' presidents among the most similar might be due to some combination of a large intercept of words in common and a small union of total words. In other words, maybe early presidents had low lexical diversity, short speeches, or both.
Plotting length of speeches and lexical diversity (number of unique words divided by total number of words) vs. cosine distance, however, shows that the green 'other' dots tend to be towards the left, but they're still spread out somewhat along the top third of the graph, indicating that perhaps the pope did indeed genuinely share a vocabulary with them more, just possible not to the extent that the cosine similarities along might indicate.
Again, anyone with better tools to analyze this is welcome to contribute!