Tuesday, May 27, 2014

Methodology and analysis of letter distributions blog post

Link to prooffreader.com blog post: Graphing the distribution of English letters towards the beginning, middle or end of words

Here's the graphic (click to enlarge):

The code for this is in this Github repo, or you can view the IPython notebook at nbviewer.

1. Dataset

As I explained on prooffreader.com, I used the Brown corpus, which may not be everyone's first choice for some applications, but when you get down to the level of letter-by-letter analysis it seems fine. Here is a comparison of Brown and the same analysis done on the much larger and painstakingly curated the COHA corpus from Brigham Young University (for which I have an academic license):


They're pretty identical; the letter z shows the most difference, probably because its rarity and increased frequency of use in foreign words makes it more sensitive to differences between corpora (for example, as a quick check, COHA has many more words with the 'sz' bigram, but the few that are in Brown are at much greater relative frequency because of the smaller size of the corpus).

2. Quantization

Graphing discrete data is nothing new, but this is the first time I've tried to graph ordinal data (i.e. 1st, 2nd, ..., last) with such widely different scales (the most common word, "the", has three letters, literally a beginning, middle and end, and then there are some words with more than 10 letters). I approached it as sort of a variation on the Spearman correlation; it's not the actual numbers that matter, but the trends and comparisons among and between them.

Here's where binning is an effective and intuitive approach; to make words of different size comparable I just used a proportionate scale; here's an example for a four-letter word with five bins:
Letter f, bin 2 would have 25% of the word's frequency added to it, and letter o, bin 2, 75%. (We want to weight the analysis for word frequency, so "four" has a larger contribution than "fourscore", otherwise we're de facto weighting the analysis towards rare words).

Two-letter words were split between two bins, but one-letter words were omitted entirely; they don't fit the beginning-end paradigm. Here is the effect the word "a" would have had:
The effect isn't dramatic, but it's misleading towards the right side of the graph. Should the letter "a" contribute to the distribution at both the beginning and the end of a word? I really don't think so. "Aa", though (which does appear in both corpora) would.

3. Number of bins

Since the exact center of a word is an important, logical and instinctual reference, it made sense to have an odd number of bins so that this point was straddled. With any histogram, too few bins means you smear out the data and hide its patterns, and too many bins means you're showing spurious noise instead of signal. My gut instinct was to have fewer bins, somewhere around the average word length (which was 4.45 letters), to reflect the somewhat fudged, semi-quantitative and ordinal nature of the data and not lead the viewer to think the data is more finely grained than it is.  So I started out with five bins, and it's a good thing I've learned to verify my gut instincts, because I think it works better with more; I ended up using 15 bins. Here's a comparison of them all:

You could certainly make a case for choosing fewer than 15, and the "optimal" number of bins depends on (a) the individual letter, and (b) what you consider optimal. I think 3 or 5 bins hides some interesting details like the fact that "x" is much more often the second letter of a word than the first; I chose 15 instead of 7 because it looks like many letters are trending towards the shape they have in 15, and noise only starts to disrupt signal there at letters like "n" (there's a spurious peak before the middle of "n" that first appears at 15 bins; it disappears if you leave the #2 word overall, "and," out of the analysis. But perhaps that's too early to be worrying about spurious peaks; there are more at 21 bins, e.g. the "f" in "before", the #89 word, and they're rampant at 51.)

4. Ordinal (y) axes

There's a fundamental problem with comparing histograms: if the y axes aren't identical, then the area the bars take up make wider distributions look more significant because the space under the histogram is greater. In this example, all of the data in one bar on the left is distributed in nine bars; the total amount of blue ink is the same, so the graphs are comparable (pretty much; the human brain has proportion-perception issues, which we'll get to later.) If you autoscale the nine bars, however, you seem to be communicating that there's more "stuff" in that graph than in the first, even though they're showing the same total quantity.


Keeping the y scale immutable is a good approach when there isn't too much variation in that scale, but in this case the frequency range of the most common letter, "e", is over 100 times that of the least common, "z". Here's a grid of all the approaches I considered (and if you have any suggestions I didn't consider, I'd be very interested to hear them). If you look at the top left, "a", and bottom left, "z", you'll see the "z" has been scaled down so far you can't see it at all.

In these cases, the common solution is to use a log scale; you can see it in the second column. The results are fine, there's nothing wrong with it at all. The only problem is: it's a data transformation that does not serve the story of the data (in my opinion). Allow me to explain; I'll just take two letters from the above, "j" and "l"; the former is very skewed toward the beginning of words, the latter is more evenly distributed.


Logarithms are counterintuitive, and data scientists who are used to them tend to forget how difficult it was at first to internalize how they work, even if they picked up the algorithms quickly. (Look at the spacing on a slide rule). I've presented log-scale graphs to scientists with Ph.D.s, and spent more time explaining the ramifications of the logarithms than the actual data.

Also, is the difference in scale the "story" of the graph? (Graphs are inherently narrative; the ease in which one can make them perfectly accurate yet entirely misleading shows this; I always say a graph has the same scope, strengths and weaknesses as a paragraph.) We can make out the "j", but its scale is compressed; is the purpose of this graph to show the relative letter frequencies, or to show their distribution? It's the latter. The two properties are interrelated, but I chose to divorce them as much as possible, while keeping the frequency information as a color scale.

The color scale accomplishes two things: it keeps the frequency information yet abstracts it to a semi-quantitative form (color differences are inherently semiquantitative, the human brain cannot determine whether something is "twice as red", just "more red".) So the information is there, but it's not tightly married to the actual individual graphs. I think color-scaled graphing works best when it's additional relevant data that's not absolutely crucial to understanding the visualization.

And yet, autoscaling (normalizing the y axis so it is always from 0 to 100% of the maximum frequency of each letter) did not accomplish this goal; we're back to the histogram scaling problem. In the middle column, it's obvious "l" is more frequent than "j", because it has more ink (so to speak).

Well, we could always integrate the data and normalize the area under the curve, so that every plot has exactly the same amount of ink, as in the second-last column. Now we're up against a perception problem that I alluded to earlier; there's the same amount of ink, but because the 'j' has it so clustered together, it looks like it's a lot more.

So I went with a compromise: the average of equal height and equal area. Often compromises in data visualization obscure data and its meaning, but I think it works in this case. The last column's values are not comparable between plots -- but the data itself is semi-quantitiative binning of ordinals, so it's in herently non-comparable on a fine scale to begin with, and I think this approach accomplishes the goal of the graph.

Feel free to disagree!

5. Ordinal (y) and abscissa (x) labeling

I chose not to label the axes; anything I put would be potentially misleading, I think it's best to leave the viewer's interpretation an intuitive one, yet make the information easily available if it's needed. A few arrow labels were enough.

As always, comments are welcome. I don't claim to have the optimal solution (if there even is one, which I highly doubt), but I put a lot of thought into finding an optimal solution, and I'd like to know if I could have done it better.


• • •

97 comments:

  1. Great post. The notebook link is 404, though.

    ReplyDelete
  2. Whoops! I reorganized that repo and forgot to fix the link! Thanks for letting me know, I fixed it.

    ReplyDelete
  3. Article contains so many fruitful information which will be liked by the readers as in my opinion
    this is the best article in this category.
    Online synonyms

    ReplyDelete
  4. Gaining Python certifications will validate your skills and advance your career.
    python certification

    ReplyDelete
  5. Great Article… I love to read your articles because your writing style is too good, its is very very helpful for all of us. Do check Six Sigma Training in Bangalore | Six Sigma Training in Dubai & Get trained by an expert who will enrich you with the latest trends.

    ReplyDelete
  6. JavaScript is the most widely deployed language in the world
    Javascript Interview Questions

    ReplyDelete
  7. I think it could be more general if you go through this blog machine learning course in bangalore

    ReplyDelete
  8. Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites! Now please do visit our website which will be very helpful.
    machine learning course bangalore

    ReplyDelete
  9. Very informative post ! There is a lot of information here that can help any business get started with a successful social networking campaign !

    ReplyDelete
  10. Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!

    ReplyDelete
  11. Hi there,
    Nice Article I really enjoyed this post Thanks For Sharing, check out this
    Ascent WORLD is one of the World’s leading ‘total solutions’ providers, offering a simple, cost-effective route to ISO Certification in India & Sri Lanka. Ascent offers a wide range of certifications, like ISO 9001, ISO 14001, ISO 45001, HACCP, CE Mark and more.

    With Ascent world, ISO Certification is no longer a painful process

    ReplyDelete
  12. I have been reading for the past two days about your blogs and topics, still on fetching! Wondering about your words on each line was massively effective. Techno-based information has been fetched in each of your topics. Sure it will enhance and fill the queries of the public needs. Feeling so glad about your article. Thanks…!
    magento training course in chennai
    magento training institute in chennai
    magento 2 training in chennai
    magento development training
    magento 2 course
    magento developer training

    ReplyDelete
  13. Hi there,
    Nice Article I really enjoyed this post Thanks For Sharing,
    Obtain ISO 9001 Certification It enhances your product & service quality, Increases marketing opportunities, Reduces your costs and much more.

    ISO 9001 Consultant in Mumbai

    ReplyDelete
  14. Hi there,
    Nice Article I really enjoyed this post Thanks For Sharing, check out this
    Ascent ASSOCIATES is one of Sri Lanka’s leading ‘total solutions’ providers, offering a simple, cost-effective route to ISO Certification in Colombo, Kandy, Galle, Dambulla, Sri Jayawardenepura Kotte & all over Sri Lanka.

    ISO Certification in Sri lanka

    ReplyDelete
  15. Hi there,
    Very useful article, thanks for sharing this post. Check this out

    ISO 14001 Certification

    ReplyDelete
  16. Hi there,
    Nice Article I really enjoyed this post Thanks for Sharing check this out

    ISO 9001 Certification in Sri Lanka

    ReplyDelete
  17. Hi there,
    Nice Article I really enjoyed this post Thanks for Sharing check this out

    Ascent Emirates is one of the United Arab Emirates’s leading ‘total solutions’ providers, offering a simple, cost-effective route to ISO Certification in Dubai, Ras al-Khaimah, Ajman, Abu Dhabi, Al Ain, Fujairah – United Arab Emirates

    ReplyDelete
  18. Hi there,
    Nice Article I really enjoyed this post Thanks for Sharing check this out

    Ascent Saudi is one of the Saudi Arabia’s leading ‘total solutions’ providers, offering a simple, cost-effective route to ISO Certification in Riyadh, Jeddah, Mecca, Dammam – Saudi Arabia.

    ReplyDelete
  19. After reading your article I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your article.
    Digital marketing course mumbai

    ReplyDelete
  20. WAN Technicians are experts who resolve problems relating to an organization’s wide area network (WAN), whether it be onsite or in the field. They further evaluate existing network systems. Technicians also supervise the maintenance, installation, and operation of a wide area network as well as related computer hardware and software.

    ReplyDelete
  21. I have to search sites with relevant information on given topic ExcelR Machine Learning Courses and provide them to teacher our opinion and the article.

    ReplyDelete
  22. Such a wonderful article and I feel that it is best to write more on this topic. Thank you so much because i learn a lot of ideas about it. Keep posting...
    Digital Marketing Course In Kolkata
    Web Design Course In Kolkata
    SEO Course In Kolkata

    ReplyDelete
  23. Such a very useful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. machine learning courses in Bangalore

    ReplyDelete
  24. India Post office Recruitment 2020 has various postal circle across the country. It has various opportunities like Staff Driver, Post Man, Multi Tasking Staff(MTS), Postal Assistant, Mail Guard, Gramin Dak Sevak(GDS)...

    ReplyDelete
  25. I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!. machine learning courses in Bangalore

    ReplyDelete
  26. This Was An Amazing ! I Haven't Seen This Type of Blog Ever ! Thankyou For Sharing, data science courses

    ReplyDelete
  27. Awesome article, it was exceptionally helpful! I simply began in this and I'm becoming more acquainted with it better. The post is written in very a good manner and it contains much useful information for me. Thank you very much and I will look for more postings from you.

    digital marketing blog
    skartec's digital marketing blog
    skartec digital marketing academy
    skartec digital marketing
    best seo service in chennai
    best seo services in chennai

    ReplyDelete
  28. I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!. machine learning courses in Bangalore

    ReplyDelete
  29. Excellent! I love to post a comment that "The content of your post is awesome" Great work!

    digital marketing courses

    ReplyDelete
  30. cool stuff you have and you keep overhaul every one of us
    digital marketing courses

    ReplyDelete
  31. keep up the good work. this is an Ossam post. This is to helpful, i have read here all post. i am impressed. thank you. this is our machine learning courses in Mumbai
    machine learning courses in Mumbai | https://www.excelr.com/machine-learning-course-training-in-mumbai

    ReplyDelete
  32. wonderful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article resolved my all queries.
    Data science Interview Questions
    Data Science Course

    ReplyDelete
  33. I have express a few of the articles on your website now, and I really like your style of blogging. I added it to my favorite’s blog site list and will be checking back soon…
    More Info of Machine Learning

    ReplyDelete
  34. This material makes for great reading. It's full of useful information that's interesting,well-presented and easy to understand. I like articles that are well done.
    Best Data Science training in Mumbai

    Data Science training in Mumbai

    ReplyDelete
  35. if you want to learn digital marketing in mumbai. excelr solutions providing best AI course in mumbai.for more details click here

    digital marketing courses mumbai

    ReplyDelete
  36. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work....machine learning courses in bangalore

    ReplyDelete
  37. This is a wonderful article, Given so much info in it, These type of articles keeps the users interest in the website, and keep on sharing more ... good luck.

    best digital marketing course in mumbai

    ReplyDelete
  38. Hey, i liked reading your article. You may go through few of my creative works here
    Route29auto
    Mthfrsupport

    ReplyDelete
  39. This is a wonderful article, Given so much info in it, These type of articles keeps the users interest in the website, and keep on sharing more ... good luck.

    Correlation vs Covariance

    ReplyDelete
  40. One of the best blog that I have read. Interesting blog with great facts. People will love this blog. You will also love our blog. Visit Here for more interesting blog.

    ReplyDelete
  41. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance

    ReplyDelete
  42. Your writing style says a lot about who you are and in my opinion I'd have to say you're insightful. This article reflects many of my own thoughts on this subject. You are truly unique.
    SAP training in Kolkata
    SAP training Kolkata
    Best SAP training in Kolkata
    SAP course in Kolkata
    SAP training institute Kolkata

    ReplyDelete
  43. This is really very nice post you shared, i like the post, thanks for sharing..
    Data Science Institute in Bangalore

    ReplyDelete
  44. It was a very good post indeed. I thoroughly enjoyed reading it in my lunch time. Will surely come and visit this blog more often. Thanks for sharing.
    Data Science Course in Bangalore

    ReplyDelete
  45. I am hoping the same best effort from you in the future as well. In fact your creative writing skills has inspired me.
    Data Science Training in Bangalore

    ReplyDelete


  46. hey if u looking best Current affair in hindi but before that what is current affair ? any thing that is coming about in earth political events. things that act on the meeting, if need this visit our website and get best Current affair in hindi

    ReplyDelete
  47. This is really very nice post you shared, i like the post, thanks for sharing
    earth

    ReplyDelete
  48. This is a wonderful article. if your looking for cool chic car accessories of ladies and check it out @ auto x tools
    mars

    ReplyDelete
  49. Very interesting blog. Many blogs I see these days do not really provide anything that attracts others, but believe me the way you interact is literally awesome.You can also check my articles as well.

    Data Science In Banglore With Placements
    Data Science Course In Bangalore
    Data Science Training In Bangalore
    Best Data Science Courses In Bangalore
    Data Science Institute In Bangalore

    Thank you..

    ReplyDelete
  50. This is my first time i visit here. I found so many entertaining stuff in your blog, especially its discussion. From the tons of comments on your articles, I guess I am not the only one having all the leisure here! Keep up the good work. I have been meaning to write something like this on my website and you have given me an idea.

    Data Science Course

    ReplyDelete
  51. Through this post, I know that your good knowledge in playing with all the pieces was very helpful. I notify that this is the first place where I find issues I've been searching for. You have a clever yet attractive way of writing.

    Data Science Training

    ReplyDelete
  52. It is perfect time to make some plans for the future and it is time to be happy. I’ve read this post and if I could I desire to suggest you few interesting things or tips. Perhaps you could write next articles referring to this article. I want to read more things about it!…
    Machine Learning Courses The web site is lovingly serviced and saved as much as date. So it should be, thanks for sharing this with us.

    ReplyDelete
  53. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance
    Simple linear regression
    data science interview questions

    ReplyDelete
  54. wow, great, I was wondering how to cure acne naturally. and found your site by google, learned a lot, now i’m a bit clear. I’ve bookmark your site and also add rss. keep us updated.data science certification

    ReplyDelete
  55. I’m excited to uncover this page. I need to to thank you for ones time for this particularly fantastic read!! I definitely really liked every part of it and i also have you saved to fav to look at new information in your site.Learn Best Data Science Training in Hyderabad

    ReplyDelete
  56. I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!

    data science interview questions

    ReplyDelete
  57. I have recently started read this blog, the info you provide on this post has helped me a lot. Thanks for all of your time & work.Learn best Data Science Course in Hyderabad

    ReplyDelete
  58. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.Learn best Business Analytics Course in Hyderabad

    ReplyDelete
  59. I have recently started read this blog, the info you provide on this post has helped me a lot. Thanks for all of your time & work.Learn best Data Science Course in Hyderabad

    ReplyDelete
  60. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.
    data science course
    business analytics course
    data analytics course

    ReplyDelete
  61. Very awesome!!! When I seek for this I found this website at the top of all blogs in search engine.data science course in malaysia

    ReplyDelete
  62. Great article like this require readers to think as they read. I took my time when going through the points made in this article. I agree with much this information.

    SAP training in Kolkata
    SAP training Kolkata
    Best SAP training in Kolkata
    SAP course in Kolkata

    ReplyDelete
  63. So this is what happens when a writer does the homework needed to write quality material. Thank you very much for sharing this wonderful content.
    Data Science training in Mumbai
    Data Science course in Mumbai
    SAP training in Mumbai

    ReplyDelete
  64. You have written a very informative article with great quality content and well laid out points. I agree with you on many of your views and you've got me thinking.
    Data Science training in Mumbai
    Data Science course in Mumbai
    SAP training in Mumbai

    ReplyDelete
  65. I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it.
    Data Science Course in Hyderabad

    ReplyDelete
  66. I don’t think many of websites provide this type of information.data science course in delhi

    ReplyDelete
  67. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
    Data Science Training in Hyderabad

    ReplyDelete
  68. Attend The Data Science Courses From ExcelR. Practical Data Science Courses Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Science Courses.
    Data Science Courses

    ReplyDelete
  69. Machine Learning Course in Raipur, Machine Learning Training in raipur, Machine Learning online course, 360DigiTMG Machine Learning Course
    Very impressive and interesting blog found to be well written in a simple manner that everyone will understand and gain the enough knowledge from your blog being more informative is an added advantage for the users who are going through it. Once again nice blog keep it up.

    360DigiTMG Machine Learning Course

    ReplyDelete