Monday, December 15, 2014

Most popular songs containing most decade-specific words in Billboard's popular music charts

This post is an adjunct to my dataviz on, "Most decade-specific words in Billboard popular songs titles, 1890-2014".

Here's the viz, click to enlarge:

The code I used to download and process it is nicely formatted with NBViewer as an IPython notebook on my GitHub. The data comes not from Billboard itself, but from; I don't know much about the data source, but it certainly looks thorough and painstaking, and up to date.

Here are some observations about the methodology, written with both the non-data-expert and the cognoscenti in mind.

What's "keyness" all about? Keyness is a common approach when comparing word frequencies between two sets; it's particularly useful when you're comparing two sets of unequal size (in this case, I was comparing, for example, all of the words from the 2010s, to every word in the entire dataset). I used the log-likelihood method, which returns a measure of the statistical significance of finding that word in that decade. A keyness of  about 11 means there's only a 0.1% chance that you would get the same result higher picking words from the entire collection at random instead of restricting yourself to the words in that subset (decade), and about 14 is a 0.01% chance. There's a good, intermediate-technical description of log-likelihood here.

The contour charts: I made this with Excel sparklines since it was so easy. The absolute scale of the y axis is different for each one, otherwise the less popular words would be invisible. If you look at some of the contours, you'll see that there are actually higher bars in other decades than the highlighted one; that's because "keyness" considers other factors than just the heights of the bars (such as the relative number of words in the decade).

Why is "f**k" censored? It was like that in the dataset, I'm no prude. Fuck fuck fuck fuck fuck.

The binning effect: It's natural to bin decades together, but it's completely arbitrary in a statistical sense. The popular songs in 1960s far, far more resembled those of the 1950s than they did of what one thinks of as "sixties music". What this means for this dataset is that words that tend to respect this artificial boundary will be overrepresented. For example, if a word was particularly popular between 1972 and 1979, it will have a higher keyness than one that was popular across a decade break from 1976 to 1983. That's just a choice that has to be made to make for an easy-to-grasp analysis, if I were analyzing things more important than song titles I'd be more rigorous in this regard.

Why the top five from each decade instead of the top keyness overall?  Basically because it was more interesting this way. The database goes back to 1890, and there are fewer songs overall back then with more uncommon words, which means they have higher keyness and any such list would be full of songs nobody's heard of. The top word in the 2010s, "we", is number 63 overall, and over half of the words about it are from the first three decadess.

I considered leaving out entire decades, but the number of songs in them was small but not negligible, as you can see from the following chart (that I made in about 14 seconds with Chartbuilder):

In the end, I went with the more interesting approach. Data visualization is narrative by nature, don't let anyone tell you otherwise.

Finally, here's a table with the most popular song for each decade containing the top-five word in question. "Most popular" is decided by a metric particular to the dataset source, but it seems thorough and defensible.

Dec.  Word      Ky. Max.  Most popular song
2010s We         22 1.4%  Rihanna, "We Found Love" (2011)
      Yeah       18 0.2%  Austin Mahone, "Mmm Yeah" (2014)
      Hell       18 0.3%  Avril Lavigne, "What The Hell" (2011)
      F**k       15 0.1%  Cee Lo Green, "F**K You (Forget You)" (2011)
      Die        14 0.2%  Ke$ha, "Die Young" (2012)
2000s U          71 1.1%  Usher, "U Got It Bad" (2001)
      Like       28 1.1%  T.I., "Whatever You Like" (2008)
      Breathe    25 0.2%  Faith Hill, "Breathe" (2000)
      It         24 2.4%  Usher, "U Got It Bad" (2001)
      Ya         19 0.7%  OutKast, "Hey Ya!" (2003)
1990s U          49 1.1%  Sinead O'Connor, "Nothing Compares 2 U" (1990)
      You        28 5.1%  Stevie B, "Because I Love You (The Postman Song)" (1990)
      Up         21 1.0%  Brandy, "Sittin' Up In My Room" (1996)
      Get        20 1.0%  En Vogue, "My Lovin' (You're Never Gonna Get It)" (1992)
      Thang      18 0.2%  Dr. Dre, "Nuthin' But A "G" Thang" (1993)
1980s Love       48 3.8%  Joan Jett & The Blackhearts, "I Love Rock 'N Roll" (1982)
      Fire       24 0.5%  Billy Joel, "We Didn't Start The Fire" (1989)
      Don't      20 1.6%  Human League, The, "Don't You Want Me" (1982)
      Rock       14 0.7%  Joan Jett & The Blackhearts, "I Love Rock 'N Roll" (1982)
      On         14 3.2%  Bon Jovi, "Livin' On A Prayer" (1987)
1970s Woman      33 0.6%  The Guess Who, "American Woman" (1970)
      Disco      31 0.4%  Johnnie Taylor, "Disco Lady" (1976)
      Rock       24 0.7%  Elton John, "Crocodile Rock" (1973)
      Music      24 0.6%  Wild Cherry, "Play That Funky Music" (1976)
      Dancin'    20 0.5%  Leif Garrett, "I Was Made For Dancin'" (1979)
1960s Baby       51 1.9%  Supremes, The, "Baby Love" (1964)
      Twist      24 0.7%  Joey Dee & the Starliters, "Peppermint Twist - Part 1" (1962)
      Little     16 4.0%  Steve Lawrence, "Go Away Little Girl" (1963)
      Twistin'   15 0.4%  Chubby Checker, "Slow Twistin'" (1962)
      Lonely     14 0.5%  Bobby Vinton, "Mr. Lonely" (1964)
1950s Christmas  31 0.8%  Art Mooney & Orch., "(I'm Getting) Nuttin' For Christmas" (1955)
      Penny      18 0.4%  Dinah Shore & Tony Martin, "A Penny A Kiss" (1951)
      Mambo      15 0.5%  Perry Como, "Papa Loves Mambo" (1954)
      Rednosed   15 0.3%  Gene Autry, "Rudolph, the Red-Nosed Reindeer" (1950)
      Three      15 0.5%  Browns, The, "The Three Bells" (1959)
1940s Polka      50 0.4%  Kay Kyser & Orch., "Strip Polka" (1942)
      Serenade   35 0.7%  Andrews Sisters, "Ferry Boat Serenade" (1940)
      Boogie     28 0.6%  Will Bradley & Orch., "Scrub Me, Mama, With a Boogie Beat" (1941)
      Blue       26 1.6%  Tommy Dorsey & Frank Sinatra, "In The Blue Of Evening" (1943)
      Christmas  22 0.8%  Bing Crosby, "White Christmas" (1942)
1930s Moon       79 1.4%  Glenn Miller & Orch., "Moon Love" (1939)
      In         38 6.5%  Ted Lewis & His Band, "In A Shanty In Old Shanty Town"  (1932)
      Swing      34 0.5%  Ray Noble & Orch., "Let's Swing It" (1935)
      Sing       34 1.4%  Benny Goodman & Martha Tilton, "And the Angels Sing" (1939)
      A          30 5.8%  Ted Lewis & His Band, "In A Shanty In Old Shanty Town" (1932)
1920s Blues     153 3.1%  Paul Whiteman & Orch., "Wang Wang Blues" (1921)
      Pal        42 0.9%  Al Jolson, "Little Pal" (1929)
      Sweetheart 27 0.9%  Isham Jones & Orch., "Nobody's Sweetheart" (1924)
      Rose       25 1.4%  Ted Lewis & His Band, "Second Hand Rose" (1921)
      Mammy      23 1.0%  Paul Whiteman & Orch., "My Mammy" (1921)
1910s Gems       70 1.1%  Victor Light Opera Co., "Gems from 'Naughty Marietta'" (1912)
      Rag        52 1.2%  Original Dixieland Jazz Band, "Tiger Rag" (1918)
      Home       43 2.9%  Henry Burr, "When You're a Long, Long Way from Home" (1914)
      Land       41 0.6%  Al Jolson," Hello Central, Give Me No Man's Land" (1918)
      Old        38 3.7%  Harry Macdonough, "Down by the Old Mill Stream" (1912)
1900s Uncle      58 4.5%  Cal Stewart, "Uncle Josh's Huskin' Bee Dance" (1901)
      Old        58 3.7%  Haydn Quartet, "In the Good Old Summer Time" (1903)
      Josh       44 3.7%  Cal Stewart, "Uncle Josh On an Automobile" (1903)
      Reuben     38 1.4%  S. H. Dudley, "When Reuben Comes to Town" (1901)
      When       33 3.8%  George J. Gaskin, "When You Were Sweet Sixteen" (1900)
1890s Uncle      59 4.5%  Cal Stewart, "Uncle Josh's Arrival in New York" (1898)
      Casey      54 3.3%  Russell Hunting, "Michael Casey Taking the Census" (1892)
      Josh       53 3.7%  Cal Stewart, "Uncle Josh at the Opera" (1898)
      Old        26 3.7%  Dan Quinn, "A Hot Time in the Old Town" (1896)
      Michael    24 2.7%  Russell Hunting, "Michael Casey Taking the Census" (1892)

• • •