Tuesday, May 27, 2014

Methodology and analysis of letter distributions blog post

Link to prooffreader.com blog post: Graphing the distribution of English letters towards the beginning, middle or end of words

Here's the graphic (click to enlarge):

The code for this is in this Github repo, or you can view the IPython notebook at nbviewer.

1. Dataset

As I explained on prooffreader.com, I used the Brown corpus, which may not be everyone's first choice for some applications, but when you get down to the level of letter-by-letter analysis it seems fine. Here is a comparison of Brown and the same analysis done on the much larger and painstakingly curated the COHA corpus from Brigham Young University (for which I have an academic license):

They're pretty identical; the letter z shows the most difference, probably because its rarity and increased frequency of use in foreign words makes it more sensitive to differences between corpora (for example, as a quick check, COHA has many more words with the 'sz' bigram, but the few that are in Brown are at much greater relative frequency because of the smaller size of the corpus).

2. Quantization

Graphing discrete data is nothing new, but this is the first time I've tried to graph ordinal data (i.e. 1st, 2nd, ..., last) with such widely different scales (the most common word, "the", has three letters, literally a beginning, middle and end, and then there are some words with more than 10 letters). I approached it as sort of a variation on the Spearman correlation; it's not the actual numbers that matter, but the trends and comparisons among and between them.

Here's where binning is an effective and intuitive approach; to make words of different size comparable I just used a proportionate scale; here's an example for a four-letter word with five bins:
Letter f, bin 2 would have 25% of the word's frequency added to it, and letter o, bin 2, 75%. (We want to weight the analysis for word frequency, so "four" has a larger contribution than "fourscore", otherwise we're de facto weighting the analysis towards rare words).

Two-letter words were split between two bins, but one-letter words were omitted entirely; they don't fit the beginning-end paradigm. Here is the effect the word "a" would have had:
The effect isn't dramatic, but it's misleading towards the right side of the graph. Should the letter "a" contribute to the distribution at both the beginning and the end of a word? I really don't think so. "Aa", though (which does appear in both corpora) would.

3. Number of bins

Since the exact center of a word is an important, logical and instinctual reference, it made sense to have an odd number of bins so that this point was straddled. With any histogram, too few bins means you smear out the data and hide its patterns, and too many bins means you're showing spurious noise instead of signal. My gut instinct was to have fewer bins, somewhere around the average word length (which was 4.45 letters), to reflect the somewhat fudged, semi-quantitative and ordinal nature of the data and not lead the viewer to think the data is more finely grained than it is.  So I started out with five bins, and it's a good thing I've learned to verify my gut instincts, because I think it works better with more; I ended up using 15 bins. Here's a comparison of them all:

You could certainly make a case for choosing fewer than 15, and the "optimal" number of bins depends on (a) the individual letter, and (b) what you consider optimal. I think 3 or 5 bins hides some interesting details like the fact that "x" is much more often the second letter of a word than the first; I chose 15 instead of 7 because it looks like many letters are trending towards the shape they have in 15, and noise only starts to disrupt signal there at letters like "n" (there's a spurious peak before the middle of "n" that first appears at 15 bins; it disappears if you leave the #2 word overall, "and," out of the analysis. But perhaps that's too early to be worrying about spurious peaks; there are more at 21 bins, e.g. the "f" in "before", the #89 word, and they're rampant at 51.)

4. Ordinal (y) axes

There's a fundamental problem with comparing histograms: if the y axes aren't identical, then the area the bars take up make wider distributions look more significant because the space under the histogram is greater. In this example, all of the data in one bar on the left is distributed in nine bars; the total amount of blue ink is the same, so the graphs are comparable (pretty much; the human brain has proportion-perception issues, which we'll get to later.) If you autoscale the nine bars, however, you seem to be communicating that there's more "stuff" in that graph than in the first, even though they're showing the same total quantity.

Keeping the y scale immutable is a good approach when there isn't too much variation in that scale, but in this case the frequency range of the most common letter, "e", is over 100 times that of the least common, "z". Here's a grid of all the approaches I considered (and if you have any suggestions I didn't consider, I'd be very interested to hear them). If you look at the top left, "a", and bottom left, "z", you'll see the "z" has been scaled down so far you can't see it at all.

In these cases, the common solution is to use a log scale; you can see it in the second column. The results are fine, there's nothing wrong with it at all. The only problem is: it's a data transformation that does not serve the story of the data (in my opinion). Allow me to explain; I'll just take two letters from the above, "j" and "l"; the former is very skewed toward the beginning of words, the latter is more evenly distributed.

Logarithms are counterintuitive, and data scientists who are used to them tend to forget how difficult it was at first to internalize how they work, even if they picked up the algorithms quickly. (Look at the spacing on a slide rule). I've presented log-scale graphs to scientists with Ph.D.s, and spent more time explaining the ramifications of the logarithms than the actual data.

Also, is the difference in scale the "story" of the graph? (Graphs are inherently narrative; the ease in which one can make them perfectly accurate yet entirely misleading shows this; I always say a graph has the same scope, strengths and weaknesses as a paragraph.) We can make out the "j", but its scale is compressed; is the purpose of this graph to show the relative letter frequencies, or to show their distribution? It's the latter. The two properties are interrelated, but I chose to divorce them as much as possible, while keeping the frequency information as a color scale.

The color scale accomplishes two things: it keeps the frequency information yet abstracts it to a semi-quantitative form (color differences are inherently semiquantitative, the human brain cannot determine whether something is "twice as red", just "more red".) So the information is there, but it's not tightly married to the actual individual graphs. I think color-scaled graphing works best when it's additional relevant data that's not absolutely crucial to understanding the visualization.

And yet, autoscaling (normalizing the y axis so it is always from 0 to 100% of the maximum frequency of each letter) did not accomplish this goal; we're back to the histogram scaling problem. In the middle column, it's obvious "l" is more frequent than "j", because it has more ink (so to speak).

Well, we could always integrate the data and normalize the area under the curve, so that every plot has exactly the same amount of ink, as in the second-last column. Now we're up against a perception problem that I alluded to earlier; there's the same amount of ink, but because the 'j' has it so clustered together, it looks like it's a lot more.

So I went with a compromise: the average of equal height and equal area. Often compromises in data visualization obscure data and its meaning, but I think it works in this case. The last column's values are not comparable between plots -- but the data itself is semi-quantitiative binning of ordinals, so it's in herently non-comparable on a fine scale to begin with, and I think this approach accomplishes the goal of the graph.

Feel free to disagree!

5. Ordinal (y) and abscissa (x) labeling

I chose not to label the axes; anything I put would be potentially misleading, I think it's best to leave the viewer's interpretation an intuitive one, yet make the information easily available if it's needed. A few arrow labels were enough.

As always, comments are welcome. I don't claim to have the optimal solution (if there even is one, which I highly doubt), but I put a lot of thought into finding an optimal solution, and I'd like to know if I could have done it better.

• • •


  1. Great post. The notebook link is 404, though.

  2. Whoops! I reorganized that repo and forgot to fix the link! Thanks for letting me know, I fixed it.

  3. Article contains so many fruitful information which will be liked by the readers as in my opinion
    this is the best article in this category.
    Online synonyms

  4. Gaining Python certifications will validate your skills and advance your career.
    python certification

  5. Great Article… I love to read your articles because your writing style is too good, its is very very helpful for all of us. Do check Six Sigma Training in Bangalore | Six Sigma Training in Dubai & Get trained by an expert who will enrich you with the latest trends.

  6. JavaScript is the most widely deployed language in the world
    Javascript Interview Questions

  7. I think it could be more general if you go through this blog machine learning course in bangalore

  8. Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites! Now please do visit our website which will be very helpful.
    machine learning course bangalore

  9. Very informative post ! There is a lot of information here that can help any business get started with a successful social networking campaign !

  10. Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!

  11. Hi there,
    Nice Article I really enjoyed this post Thanks For Sharing, check out this
    Ascent WORLD is one of the World’s leading ‘total solutions’ providers, offering a simple, cost-effective route to ISO Certification in India & Sri Lanka. Ascent offers a wide range of certifications, like ISO 9001, ISO 14001, ISO 45001, HACCP, CE Mark and more.

    With Ascent world, ISO Certification is no longer a painful process

  12. I have been reading for the past two days about your blogs and topics, still on fetching! Wondering about your words on each line was massively effective. Techno-based information has been fetched in each of your topics. Sure it will enhance and fill the queries of the public needs. Feeling so glad about your article. Thanks…!
    magento training course in chennai
    magento training institute in chennai
    magento 2 training in chennai
    magento development training
    magento 2 course
    magento developer training

  13. Hi there,
    Nice Article I really enjoyed this post Thanks For Sharing,
    Obtain ISO 9001 Certification It enhances your product & service quality, Increases marketing opportunities, Reduces your costs and much more.

    ISO 9001 Consultant in Mumbai

  14. Hi there,
    Nice Article I really enjoyed this post Thanks For Sharing, check out this
    Ascent ASSOCIATES is one of Sri Lanka’s leading ‘total solutions’ providers, offering a simple, cost-effective route to ISO Certification in Colombo, Kandy, Galle, Dambulla, Sri Jayawardenepura Kotte & all over Sri Lanka.

    ISO Certification in Sri lanka

  15. Hi there,
    Very useful article, thanks for sharing this post. Check this out

    ISO 14001 Certification

  16. Hi there,
    Nice Article I really enjoyed this post Thanks for Sharing check this out

    ISO 9001 Certification in Sri Lanka

  17. Hi there,
    Nice Article I really enjoyed this post Thanks for Sharing check this out

    Ascent Emirates is one of the United Arab Emirates’s leading ‘total solutions’ providers, offering a simple, cost-effective route to ISO Certification in Dubai, Ras al-Khaimah, Ajman, Abu Dhabi, Al Ain, Fujairah – United Arab Emirates

  18. Hi there,
    Nice Article I really enjoyed this post Thanks for Sharing check this out

    Ascent Saudi is one of the Saudi Arabia’s leading ‘total solutions’ providers, offering a simple, cost-effective route to ISO Certification in Riyadh, Jeddah, Mecca, Dammam – Saudi Arabia.

  19. After reading your article I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your article.
    Digital marketing course mumbai