Accurately Predicting Triple J's Hottest 100 of 2015
January 31, 2016 • nickw
In 2014, a prediction was accurately made for the Hottest 100 of 2013. The results were posted on warmest100.com.au.
The author of the prediction in 2014 managed to acquire accurate results because Triple J featured a social share button on their voting page, which posted your votes to your Facebook in text form. The author scraped results from public Facebook posts and aggregated all the votes. They managed to obtain 1.3%
(1779
entries) of the expected total vote.
Consequently, voting for the Hottest 100 2014 and 2015 did not contain such a feature. Fortunately, voters still felt the need to share these results with their friends, and taking a screen shot or a photo of their screen and posting to social media was a concrete alternative. Using these images posted to Instagram, I was able to accurately predict the results of Triple J’s Hottest 100 of 2015.
Some Cool Stats Before You Continue
- Triple J Tallied
2094350
Votes (209435
Entries) for Hottest 100 2015 - I collected a sample size of
~2.5%
of all entries7191
images initially collected- I categorised
5529
images as votes ~4900
images contained the words “vote/votes/voting”
- My Top 3 Results were
100%
accurate
You’ll probably find this article interesting, but if you’re super eager, you can Skip To The Results.
Taking Advantage of Social Media
I decided to only target votes that were posted to Instagram, since a high majority of the pictures hashtagged with #hottest100
were in fact votes, and there was a reasonably high volume of them, and most publicly accessible.
I required means to acquire all pictures that had been posted to Instagram. Instagram have an official API, however you are required to have your API app usage approved before it can interface with non-sandbox users. Additionally, Instagram impose a rate limit on non-approved apps, as well as approved apps. I did not have time to waste, and wanted results immediately, so I found an alternative.
Fortunately, Instagram exposes a non-public API through their website ajax loading when you browse to a hashtag. By imitating the web browser with a simple python script using the requests
library I managed to download all images from the latest until a cut off date that I specified (the day voting opened).
After scraping the hashtag #hottest100
, I expanded my search to #hottest1002015
and #triplejhottest100
.
Processing Images
After downloading 7191
images from Instagram, I needed to find an accurate way to filter out the images that were not votes.
I’ve had previous experience with using PIL
in Python, so using PIL
, I wrote a simple script to sort the photos into 2 categories; photos that appeared white-ish, and photos that were not.
A good vote looked like this:
Unfortunately, not every image ended up in the right folder, and I ended up with both false negatives and false positives, however I wasn’t too concerned about false positives, as my OCR processing step would exclude them. Instead, I was more concerned about false negatives.
As the image processing and sorting continued, I manually moved false negatives to the positives folder. I calculated about 5%
of the non-matching photos were incorrectly classified, however this was due to them being pictures taken of computer screens, similar to the photo below:
Some image statistics:
7191
images collected initially1662
images categorised as non-votes5529
images categorised as votes~4900
images contained the wordsvote/votes/voting
Improving OCR Performance
After experimenting on raw photos from Instagram, I found that OCR accuracy was not very accurate. To remediate this, I utilised Imagemagick to flatten image definition to improve text results.
Bringing in Tesseract (OCR)
After weeding out the junk, I still needed to turn these images into readable text.
Using Google’s Tesseract library, I slowly processed all the images and extracted the text from them.
Unfortunately, due to the layout of the Hottest 100 voting website the two columns were broken up inconsistently over the results.
Some were processed as:
... Flight Facilities Hayden James Hermilude Major Lazer RUFUS Weeknd, The ZHU x Skrillex x THEY. Jarryd James Disclosure Kendrick Lamar Heart Attack {FL Owl Eyes) (Radio Edit) Something About You The Buzz (Ft. Malaya/Young Tapz} Lean On (Ft. Mé/DJ Snake} Innerbloom ...
And others processed as:
... Lucky Luke 1 Day Mosquito Coast Call My Name Tn ka Right By You Tuka L.D.T.E. Half Moon Run Trust Spring King City Tame Impala Let It Happen Saskwatch I‘ll Be Fine Jungle Giants. T Kooky Eyes he ...
And others just did not process at all, due to resolution, colour, skewing, or simply because they were a photo of a computer screen:
Parsing the Results
I processed the results line by line, and call these “terms”. These such terms could contain a single song title, a single artist, an artist name with song name, or just junk overhang from a previous line. Initially there were 31062
uncategorised terms.
I processed each term and aggregated number of results for each. This worked really well for songs with short names that were less prone to error, such as Hoops
, however did not correctly capture terms where artist name and song name occurred on the same line, or where the OCR library interpreted a few characters incorrectly.
OCR Inaccuracy & Levenshtein
Even with photo enhancements, the OCR accuracy was somewhat subpar for some votes. Some l
’s were interpreted as t
’s, i
’s as l
’s, etc. Additionally, the longer the name of the song, the more prone to error it was.
A technique that can be used to fix these spelling errors of single/multi character errors is the Levenshtein algorithm for edit distance. Using this algorithm, we can compare 2 strings and determine how many edits need to be made to make the strings equal each other.
In order to perform this kind of matching, we needed an accurate list of songs that were released this year, along with a list of artists that released music this year.
Using Spotify To Help
To acquire an accurate list of songs released this year, I used Spotify and crawled various playlists from 2015. These included Spotify Charts, Triple J Hitlist, and various other genre-alike playlists.
In the end I ended up with a songs list with 1781 songs
, and an artists list with 1229 artists
. After the Hottest 100 aired, I compared the results of the countdown to the songs found in my list, and only 6 songs that occurred in the hottest 100 were not in my “truth” list.
During list gathering, I made sure to convert all unicode characters to their ASCII counterparts, so that characters with accents and similar would be matched correctly.
Continuing Processing
Now carrying reasonably accurate artists and songs lists we continue categorisation and processing. The processing algorithm worked in the following way:
- Load all terms from every image’s
.txt
OCR result. Every line is a “term”. - Clean all the terms by turning them into lowercase and stripping whitespace.
- Loop through each term:
- If term exists in our known songs list, move the term to the songs aggregation and count the votes.
- If term exists in our known artists list, move the term to the artists aggregation and count the votes.
- If couldn’t find it in either of those:
- Loop through all artists in our artist known artist list.
- Check if the term starts with the current artist. If it does split it into artist and unknown term. Add the votes to the artist aggregation.
- If matched artist, check if the new unknown term exists in the songs list, if it does, add it to the songs aggregation. If not, add it back to the unknown. break loop.
- If it didn’t have a prefixed artist, just add it back to the unknown terms.
- Loop through all artists in our artist known artist list.
At this stage, we have a reasonably accurate aggregation of results. We have not yet used Levenshtein string matching. We now have 27294
uncategorised terms, down from 31062
uncategorised terms. So far our results:
However, we still haven’t aggregated any votes that had spelling errors due to OCR inaccuracies.
Employing the Levenshtein algorithm, we continue to process the unknown terms. I configure matching to allow lenience based on the length of the term - the maximum edits that were allowed was 2/5 * length of term
. The process continues:
- For all unknown terms:
- Check
term length > 3
. Break if<= 3
. Can’t match a short string. - Match Songs:
- Loop through all songs in known songs list:
- Compare current song to current term. Get edit distance.
- If edit
distance == 1
, move votes for this term to the guessed song in our songs aggregation, then continue to the next term. - Add distance to a dictionary of value/distances
- Using our value/distances dictionary, find the closest match that satisfies our
2/5 * len(term)
rule. If it matches, move the votes for this term to the guessed song in our songs aggregation, then continue to the next term.
- Loop through all songs in known songs list:
- Match Artists using the same method.
- Check
Some of the results of string matching, providing some reasonably accurate re-matching.
After performing this additional processing, I ended up with 18509
uncategorised terms, down from 27294
uncategorised terms.
That means we were able to successfully categorize 8785
terms via the Levenshtein distance algorithm!
Quite an improvement, however still not great. Some of the terms there weren’t able to be categorised which caught my attention included:
Paying special attention to The Less | Know The
, if I were to add it’s sum to our results, it would have placed 4th, however, the results we already have look reasonably accurate.
Final Results
Some Notes
Run
appeared so high on the leaderboard because both Seth Sentry and Alison Wonderland released similar tracks titled RUN/Run. Since I lowercased all comparisons and removed special characters, these votes merged.
Improving the Analysis
After reviewing the method used for analysis, I have identified a few places for improvement that could possibly improve the results.
- Improved Levenshtein Algorithm. The Levenshtein algorithm is great for calculating edit distance, however I could not weigh edits of similar characters such as t’s, i’s and l’s less, thus improving matching due to OCR inaccuracies. I expect that string matching could have been significantly improved if this was explored.
- Songs that had long titles, such as
The Less I Know The Better
generally were split across multiple lines. This caused their aggregation to not sum correctly. It would be good if I could determine if a song was split across two lines. - Songs that were in the format of
artist song
and were spelt incorrectly were most likely not picked up by string matching, as we only matched against songs and artists individually. In order to improve matching for this, an additional list for joined songs/artists could have been used and compared against for remaining terms.
Some Cool Stats
- Triple J Tallied
2094350
Votes (209435
Entries) - I collected a sample size of
~2.5%
of all entries- I collected
7191
images collected initially - I categorised
5529
images as votes ~4900
images contained the words “vote/votes/voting”
- I collected
- My Top 3 Results were
100%
accurate