Monday, August 17, 2015

A Python script to make choropleth grid maps

In May 2015, there was a sudden fad in the Dataviz community (on Twitter, anyway) for hexagonal grid-type choropleth maps. A choropleth (not "chloropleth") is a map in which areas are filled in with a color whose intensity and/or hue is proportional to a quantity; we've all seen them. The problem with traditional choropleths is they are dependent on area so that, for example, a choropleth of the United States will emphasize relatively deserted Wyoming over highly populated Massachusetts.

One recent partial solution to this has been to create choropleths with equal representations of subunits with squares or hexagons. They're still not proportional to population (attempts to solve this last hurdle generally result in ugly and/or unrecognizable maps), but they're pretty cool nonetheless.

As soon as the hub-bub started, I thought it would be relatively painless and an interesting exercise to code up a Python script that would make these choropleths. Since they're geometric, it's a cinch to output SVG vector markup. I got about 90% of the way through this project in three weeks, and then a new job and some health issues interfered. but I finally found a weekend to finish it off, in at least a beta, v.0.1 way.

The script is hosted on my GitHub, and here's the IPython notebook/Jupyter tutorial that goes along with it. (There's also a script called Colorbin that's supposed to remove the hassle from associating colors with quantities for the choropleths).

Feedback is welcome. Here are some examples of what the script has produced:

• • •

Saturday, August 8, 2015

Dataset: Single word frequencies per decade from Google Books

I have crunched a public English language dataset in order to remove information that is least likely to be of interest to users, and I offer it for anyone to download:

[1 GB] Google 1-grams English v.20120701 by decade, lowercase, no parts of speech (zipped csv)

The original dataset is Creative Commons Attribution 3.0 Unported License.


Google Ngram Viewer is an online tool to track the uses of English words from the 16th century to 2008, with the following caveats, among others:

  • It only contains words that were used at least 40 times in any given year, in order to preserve copyright (so you can't tell exactly what book a given word appeared in). This means, for example, if a word occurs 40 times in 1970, 39 times in 1971 and 41 times in 1972, in the database the word will occur 40 times in 1970, 0 times in 1971 and 41 times in 1972.
  • The database is mostly based on library books, so it is heavily biased towards the types of books found in libraries; this includes, for example, directories of names. It is also biased towards the availability of books in a given year, so, for example, 1994 will be much more representatively represented (so to speak) than 1731.
  • A lot of the older books have many, many typos, and many books have the wrong date. I wrote an amusing (I hope) blog post about this, 
  • Culturomics, the hyperbolically named organization that prepared the data and made the search tool, warns that data before 2000 should not be compared to data from 2000-2008; they don't explain why, but a reasonable hypothesis would be that the availability of electronic documents drastically changed the nature of the underlying documents and thus their word frequencies.
I have downloaded the over 6 GB of data, converted the words to lowercase, removed part-of-speech tags, converted dates to the decade year and aggregated the results. That means, for example, all of the following 72 entries were collapsed into only one word, "after":

aFTer  AftEr      aftEr_ADP  AfTER_DET   AFteR_NOUN
afteR  AfTEr      AftEr_ADP  aFter_DET   AFTER_NUM
AfTer  after      aFter_ADP  AfTer_NOUN  after_PRT
AFtER  AFter      after_ADP  AfTER_NOUN  AFTER_PRT
afTer  after_ADJ  afTer_ADP  afTer_NOUN  AfTER_VERB
AftER  After_ADJ  after_ADV  aFter_NOUN  AFTER_VERB
AfteR  After_ADP  AFter_DET  aFTER_NOUN  

And then, for example, the 10 entries for "after" for 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998 and 1999 are aggregated to only one entry, 1990.

This makes the dataset much easier to use, small enough to hold in memory for most computers, and it smooths out some of the weirdness like all of the different capitalizations.

If you end up using this data, I'd love it if you dropped me a line. Enjoy.
• • •