About a week or so ago, there was a hearty discussion on twitter from well-known music bloggers about the controversial 7.6 rating by Pitchfork of Toro y Moi’s excellent debut LP Causers of This. Since I am guilty of being more of a mathematician than a writer, I decided that this was a great opportunity to dive right into the numbers and do a brief statistical study of Pitchfork’s rankings from a period of one complete year and see where exactly Chaz Bundick’s 7.6 grade stacked up in comparison to his peers. After sifting through the data most of yesterday afternoon, I have to say there are some pretty interesting finds (including some statistical anomalies) behind Pitchfork’s rating system for albums.
Before beginning, I feel I should make a brief mention on how the data was collected. Initially, I was going to write a script to go through Pitchfork’s Record Reviews, logging each numbered grade between February 24, 2009 and February 24, 2010. However, knowing that p4k has an affinity for rating reissues and compilations very favorably (an unbelievable 30 reissued albums scored higher than the highest rated contemporary album — chalk that up to the Beatles, Neil Young, and Radiohead re-releases), I figured the only sure fire way to get accurate data on non-reissued material was to look into each review, see if it fits my criteria for a new release, and jot down the score. A cumbersome process to say the least! There were several things I decided to omit when classifying an album as “original”: soundtracks, label compilations, live recordings, and of course reissues. This left a relatively large sample size of 1,025 records of newly released, original albums to run analysis on. Is this result error free? Of course not — no doubt I tallied a handful of albums as “original” when they weren’t and vice versa. However, with the sample size large enough and my propensity to err small, any stray mistakes can be deemed statistically insignificant. The following is a histogram plotting the number of occurrences of each rating (click for larger view):
If you are a frequent follower of p4k, then most of the plot doesn’t come as a surprise. The bulk of the histogram centers around the 6.5-8.5 range with a score of 7.0 being the most common rating (51 times). Also, because pitchfork tends to not publish reviews on horrendously bad albums, it’s a no brainer to see the plot negatively skewed significantly. Similarly, exceptionally performing albums (i.e. 8.7 and above) are also relatively rare events.
Probably one of the most interesting results of the histogram is seeing whole number ratings occurring significantly more often than its x.9 and x.1 neighbors — in fact enough to be considered a statistical anomaly. Notice how the peaks at 6.0, 7.0, and 8.0 are noticeably higher (almost twice as high in some instances) than 5.9, 6.9, and 7.9 respectively. My theory behind this is that when it comes to “on the fence” reviews, p4k tends to give the benefit of the doubt to the artist. Knowing that perceptively a rating with a unit higher whole number looks more impressive (also explains why things are priced $6.99 rather than $7.00 — we subconsciously think it is a lot less), they tend to bump up the score more often to show a more positive review. Now if it is true that individual critics are responsible for giving an album a score rather than a collective following a loose outline of established “rules”, then this result is very interesting both from a mathematical and a sociological point of view.
To see a better idea of the break-down of scores and a loose determination of percentiles, a box plot was performed (click for larger view):
This plot tells us a couple of things, most notably establishing a line between OK albums and great albums. One can see from the plot that the 1st quartile, representing the “top” 25% of rankings occurs at the 7.6 line. What this means is that our beloved Toro y Moi album would be statistically defined as on the border of the upper tier. Confirming our natural inclination that a majority of albums are rated around the “7″ mark, the box of the boxplot, representing the middle 50% of scores, occurs from 6.1 – 7.6. The final interesting part is that if an album scores below 3.9, it’s considered a statistical outlier (meaning Lil’ Wayne can breathe easy knowing his rock album just made the cut). Refining the results further into 10% percentiles, the following is established:
In my opinion, the above table gives a better way for bands to determine the meaning of their p4k rating than what the actual numerical score can provide. Take for example a hypothetical review of 7.7. Without any context, it is a rather meaningless number which invokes a wide-range of opinions (C-grade, “better than most”, underwhelming, etc…). However, when comparing it to a large sample of past albums’ ratings and seeing that it is in the 60th percentile — meaning it is better than 60% of the albums they’ve graded — then you understand the score a lot better.
The final thing I’ll mention is a couple of points when looking over their Best New Music selections and the seemingly arbitrary way they assign the label. With how much significance is attached to a BNM nod (record sales, exposure, tour upgrades), it was rather unsettling noticing some trends that seemed to pop up:
- All albums scoring 8.6 and higher was automatically made Best New Music.
- If you are a metal fan, you’ve gotten royally screwed over and overlooked by p4k. Only two albums were selected for BNM within the past year: Sunn O))))’s Monoliths & Dimensions and Isis’s Wavering Radiant (both with scores of 8.5). Adding insult to injury was that out of the 15 albums that scored an 8.5, 11 of them made BNM. Two of the four that didn’t make the cut were metal-related records (Baroness’s Blue Record and Converge’s Axe to Fall) — both occurring on days when no other record made BNM.
- Another one of the four albums that ranked 8.5 and was not stamped with a BNM was contemporary jazz musician Jon Hassel’s LP verbosely entitled Last Night the Moon Came Dropping Its Clothes in the Street, supplying another example of a high performing album from a more obscure genre getting the shaft. In p4k’s defense, Yacht’s superb See Mystery Lights was BNMed that day which leads me to my next point…
- If you release a great record, make sure you don’t get reviewed on the same day as another great record. I don’t have an individual statistic for this, but I often saw high scoring albums (8.2-8.5) not get a BNM because another even better (or same ranking, just more hyped) album was reviewed the same day.
- If you are a hyped record or are an established act, you have a better shot of getting a Best New Music when you are on the cusp. Now this seems kind of obvious, but there were some egregious instances where this occurred. Of the 41 albums that scored an 8.1 and 8.2, five were chosen as BNM: Surfer Blood’s Astro Coast, Atlas Sound’s Logos Cass McCombs’s Catacombs, Bill Callahan’s Sometimes I Wish We Were An Eagle, and Wavves’s S/T
- Yeah, I have no idea what they were thinking BNM-ing that Mos Def record (the lowest score and, out of 36 records that scored an 8.0, it was the only one to get BNM-ed).
This was a fun project which allowed me to brush up on some of my Matlab skillz. In the future, I would like to dive deeper and provide a more detailed analysis, but that will have to wait until I get some free time. If you have comments or would like to speculate on p4ks ratings, or if you have any insight on how they are determined (individual vs. collective), just leave a comment. If you want a copy of my data so you could run your own analysis, I would be happy to supply it to you (EDIT :: You can download the data set here).
Tags: Album Reviews, Graphs, Pitchfork



the percentile understanding makes complete sense. ive been thinking about pitchfork reviews this way for a while now, but i always thought that 7.7 was the cut off for a good record. seems as though i was a percentage point off. interesting little study you have done here.
another interesting aspect of the ratings system is how it ratings translate to year end lists. both number ratings as well as the BNM tag. ive always been confused how a record like neko case, which scored i think a 7.6 or 7.7, found its way so high on the list, but an album like sunset rubdown’s dragonslayer, which got an 8.5 i think and a BNM tag, failed to make the year end top 50 list. i would like to see how many BNM tag’s dont make year end lists, and how many sub 8.0 ratings do. im certain there is some correlation to type of press the album got from other outlets, etc.
this is why i will always favour cokemachineglow over p4k, because when one of their reviewers gets it ‘wrong’, they almost always publish a counter point within a week. if i remember correctly, i think p4k did this once in 2008 when they published that article about albums they either overlooked or misjudged, cause i distinctly remember them coping to the fact that they totally underrated the black mountain album. it would be refreshing to see more of this, especially since p4k’s influence over the direction of music seems to be so huge these days.
anyways, interesting work here. look forward to seeing some other things like this in the future.
This is very cool work. Clearly a very time intensive process and you’ve taken the time to do it right. Well done.
You should consider sending p4k an email suggesting they re-distribute their ratings on a percentile system to the public and keep their present system internal. You should also ask for 20% (5.85) of all profits they see in the next year due to increased traffic.
There was also the time that Pitchfork completely retracted their initially unfavorable review of Sufjan Stevens’ “Greetings from Michigan” album and put up a much more positive review in its place. Unfortunately, I haven’t been able to find that original review… I’d really like to see it again. Anyone else remember this?
Chris, I agree, seeing a breakdown of year end lists would be interesting, and even a comparison of year end lists vs decade end lists, for instance The Rapture’s Echoes spiraling down from #1 in 2003, to #38 in the 2000-2004 list, and then #57 on the 2000s list. I’m not knocking Pitchfork for this though, I mean after 6 years you may view an album differently in the context of the decade, but would be curious to see the stats behind it.
Here is an archive of some old reviews, like the 0.8 Belle & Sebastian – Boy With The Arab Strap review, that are no longer on the site: http://web.archive.org/web/20011119181922/pitchforkmedia.com/record-reviews/b/belle-and-sebastian/boy-with-the-arab-strap.shtml
The original Sufjan review (which I don’t remember) isn’t on there though.
Oh, another interesting one, Basement Jaxx’s Rooty got a 3.8 in 2001 but then for the Best of the 2000s list it ranked #33.
all ratings are arbitrary
this was phenomenal. keep up the good work!
Pitchfork is a collection of writers with different opinions. Somehow the guy who hated Rooty got to write the review. Then all the people who loved it voted it up years later. Unbunch ya panties.
[...] Monday, March 1, 2010 OPP – 3/1 (iTunes, Pitchfork, more) var addthis_pub="earfarmmusic"; A statistical look at Pitchfork’s ratings [...]
[WORDPRESS HASHCASH] The comment’s server IP (64.13.192.21) doesn’t match the comment’s URL host IP (64.13.192.113) and so is spam.
LOL @ hipster trash who use a website to determine what’s cool.
It would also be interesting to see how ratings correspond with how much advertising is done for each record. You could take a look at the number of bought ads from each label over the course of a year and compare that to how their releases score. I can guarantee you that the more ads a label buys the higher their release ratings are. P4K is a business first and foremost.
[...] Pitchfork’s rating system gets a proper statistical analysis. [Part Time Music] [...]
Yeah the thing that grinds my gears about p4k is that they are often dishonest. Not in what they tell you, but in the things that they don’t even ponder about. Their reviews are whatever, but I believe that a great review generates discussion. P4K seems to try to end the discussion. I’m not saying that they should install forums/comments (they shouldn’t, for many reasons), but they should at least strive for more a more balanced analysis. Maybe some point/counterpoint type of thing. For ex., not everyone on the staff could have possibly loved that Wavves album, right? Right?
No one can deny that they are an important part of music at this point. Is it for the wrong or right reasons, though?
By the way, I liked this analysis, nice job.
[...] actually put together some data about Pitchfork’s ratings. It’s pretty interesting. You should have a look. (Warning: contains lots of [...]
[WORDPRESS HASHCASH] The comment’s server IP (72.233.96.149) doesn’t match the comment’s URL host IP (76.74.254.123) and so is spam.
Mike: I would love for you to inarguably prove that Pitchfork’s label ad sales actually influence their ratings. That is a hefty claim to lob at them without indisputable evidence. Until you can, it sounds like a forum-ready conspiracy theory.
Chris: I think you mean “incomplete” instead of “dishonest”. Point/counterpoint would be great — they have some of the brighter music critics around. I disagree, though, that they “end” the discussion. To counterpoint you assertion, I believe they state their case and bring it to a larger audience. If they could provide some sort of “this writer liked this and this didn’t and here’s a reasonable discussion about it” balance, that would be great! I wonder, though, if people would dismiss it as too “snobby” without even giving it a fair chance.
damn, this is a great article. good point about pitchfork ignoring metal… when you go on other websites that are public ratings oriented, metal is often near the top of ratings, and the albums are not usually the ones pitchfork deems ok… such as rateyourmusic.com or sputnikmusic.com, metal is always much more highly regarded, and you don’t see as much bias with re-releases and such. I think pitchfork should just drop their ratings because they are disingenuous.
try comparing given score to their ranking in year end and decade-end lists. there’s a lot of discrepency!
Fascinating and insightful. Puts it all in a different light…
[...] am easily impressed by graphs and percentages. That is why I am fond of the article “Pitchfork // A Statistical Look at Their Ratings“ from the music blog Part-Time Music. The author does a solid job of conveying that the [...]
Thanks for the insight and taking the time to code all of this up. As someone who does data analysis this raises a lot of questions. Interesting to me is, as you point out, the propensity to rate a record as a whole number. It is odd that, despite Pitchfork’s desire to use a quasi-continuum for their rating (e.g. 7.1,7.2,etc.) their reviewers tend to subconsciously return to intervals, which reminds me of some political science work on “feeling thermometers” where people tend to place themselves on the 10s. At any rate, this screws up the variance if we want to do some modelling. Lots of hypotheses to test here after you convince someone to collect more data. Again, thanks.
Thanks, I really love statistics and pitchfork, so this was perfect!
Chris / Cale: I remember vividly what happened with the Sufjan record. There was never a bad review that was retracted. What you see on the site now is the original review with the original score. What happened, which supports one of the major points in this analysis, is that the record was reviewed positively and about two weeks later, when the editor heard it, it was given the “Best New Music” tag. The review with the added tag was sitting there front and center on a Monday morning.
I always find it funny when people knock p4k. All they are are a bunch of people with opinions on music. It just so happens, if you don’t like p4k, that your opinions and tastes differ than those of p4k. I just treat p4k like a friend who’s into music and has more knowledge than I do. They offer suggestions, I listen and then agree or disagree. It just so happens I tend to agree with them more often than not. We just have similar tastes. This became readily apparent when I could easily guess their scores of a CD based off my own feelings for the same CD and usually come within a few tenths or them. So I trust p4k but they aren’t the end-all to music and they have never tried to be.
Always thought this review was strange:
http://pitchfork.com/reviews/albums/12454-the-59-sound/
It’s a pretty good album and was well-hyped in other circles, but I was certainly surprised to see it buried as a 2nd or 3rd review on the page AND receive an 8.6 rating. That’s got to be one of the highest without scoring a BNM. I believe back when this was reviewed Pitchfork was also using that strange “recommended” tag as well… in any case, interesting reading.
[...] begets predictability, though, and then things like this happen. That’s fine when Pitchfork reviews underground/unknown music (which is still its [...]
I don’t have any qualms with albums getting a rating and then appearing in a place on their year-end list that doesn’t quite correspond with that initial rating. Feelings change. If you rate an album released in February within its first week after release, there’s still 9-10 months in the year to re-play that album. It may sound silly, but something like a change in season can really sway the perception of an album.
I think people read too much into ratings and reviews as a whole. In the case of P4K, people are especially harsh. They have their reasons, I suppose. But the way I view these things is this: I don’t need any entity to “tell me something is cool”. But I’m also a busy dude. I appreciate having different outlets — P4K being just one of them — that monitor this stuff and slap a number on it. These sites are like a musical GPS for me. Sometimes I get to my destination and agree that its great, other times I get there and wish I hadn’t bothered. But I appreciate having things brought to my attention that I might have otherwise missed.
I was really happy to come across this article because it articulated for me a few things that I sensed by regularly reading pitchfork but couldn’t prove. Good work!
On an unrelated note…I think it’s great that pitchfork hands out their BNM tags based on feel, on instinct, rather than scores. It is true that well rated albums will be left out but it’s their ranking system to begin with. They aren’t saying ‘this album scored highest’, but they are saying that they know their audience, their core demographic and they think that the average pitchfork reader will like this. I’d be opposed to this system if they didn’t recommend me fantastic music on a regular basis. I disagree with them probably 25% of the time but they get me moreso than any other music magazine out there.
It is for people like me that p4k chose to leave certain metal albums off the BNM list and while I can appreciate that is unfair, I thank them for it anyways.
Pretty ridiculous assigning grades and numbers to music in the first place. Maybe read the review to get an idea of what the music offers. Other than “oh look, this album offers me 7.6 utils of satisfaction” and then refusing to read the actual review.
Thanks for the interesting article! That must have been some hard work. Very interesting!
Thank you very much for this analysis and your inclusion of the dataset. Though I love p4k, the website can be problematic on a personal level in how blindly I trust them and on the great of an impact they can have on the music industry at large.