100 Warm Tunas 2018 Prediction Analysis

Over the space of 6 weeks, 100 Warm Tunas collected a large sum of data and chugged away at it to make some predictions about what the Hottest 100 of 2018 would look like.

In summary,

  • We collected 6,234 entries (13.6% decrease since 2017 🔻).
  • We tallied 58,463 votes across these entries (12.9% decrease since 2017 🔻).
  • 3.00% of votes were collected via Instagram direct message.
  • Triple J counted 2,758,584 votes.
  • Therefore, we collected a sample of 2.12%.
  • We successfully predicted #1
  • We predicted 7 out of the top 10 songs.
  • We predicted 15 out of the top 20 songs.
  • We predicted 83 out of the 100 songs played in the countdown.
  • Throughout December and January, 100warmtunas.com was loaded over 105,000 times by over 28,000 users.

You can read the full technical prediction analysis over at the 100 Warm Tunas news site.

100 Warm Tunas has a new home!

100 Warm Tunas has found a new home this year! This year’s results, along with the past 2 years can now be found at 100warmtunas.com. The existing domain (100-warm-tunas.nickwhyte.com) will simply perform a 301 permanent redirect to the new domain, so all existing inbound links should be unaffected.

100 Warm Tunas 2017 Prediction Analysis

Over the space of 6 weeks, 100 Warm Tunas collected a large sum of data and chugged away at it to make some predictions about what the Hottest 100 of 2017 would look like. Along the way we encountered a bug in the collection process, however data was backfilled and showed that I had collected a sample size around the same as in 2016.

Summary

  • 100 Warm Tunas collected 7,216 entries (7.3% less than 2016 🔻)
  • 100 Warm Tunas tallied 67,085 votes across these entries (2.6% more than 2017 🔺). This is due to improvements in 100 Warm Tunas’ counting and recognition process.
  • Triple J counted 2,386,133 votes.
  • Therefore, 100 Warm Tunas, collected a sample of 2.8%. Not bad! (The same as in 2016).
  • Warm Tunas predicted 8 out of the top 10 songs (Same as 2016) (Ignoring order)
  • Warm Tunas predicted 16 out of the top 20 songs (3 less than in 2016, where 19 out of 20 were predicted) (Ignoring order).
  • Warm Tunas predicted 83 out of the 100 songs played in the countdown. (1 less than in 2016) (Ignoring order)

Overall, even though the sample size was reasonably consistent between 2016 and 2017, it is clear that the results collected in 2016 were more accurate.

Technical Analysis

The results this year definitely show a more accurate 1st place prediction (predicting HUMBLE. to win), as opposed to last year where the top two positions were placed out of order, however looking at the data, it looks as though all other aspects of the prediction stayed almost the same.

To start this analysis, lets take a look at the top 10 of the official countdown and match it up with their predicted places in Warm Tunas:

Artist Title ABC Rank Tunas Rank Difference
Kendrick Lamar HUMBLE. 1 1 0
Gang Of Youths Let Me Down Easy 2 3 1
Angus & Julia Stone Chateau 3 6 3
Methyl Ethel Ubu 4 4 0
Gang Of Youths The Deepest Sighs, The Frankest Shadows 5 2 3
Lorde Green Light 6 8 2
PNAU Go Bang 7 5 2
Thundamentals Sally {Ft. Mataya} 8 10 2
Vance Joy Lay It On Me 9 15 6
Gang Of Youths What Can I Do If The Fire Goes Out? 10 13 3
BROCKHAMPTON SWEET 11 7 4
Peking Duk & AlunaGeorge Fake Magic 12 16 4
Khalid Young Dumb & Broke 13 24 11
Lorde Homemade Dynamite 14 30 16
Vera Blue Regular Touch 15 11 4
Jungle Giants, The Feel The Way I Do 16 32 16
Baker Boy Marryuna {Ft. Yirrmal} 17 12 5
Ball Park Music Exactly How You Are 18 14 4
Killers, The The Man 19 19 0
Peking Duk Let You Down {Ft. Icona Pop} 20 38 18

Lets pull apart this table and grab some statistics about how we did with our prediction:

Predicted Out Of Top N Percentage
8 10 80.0%
16 20 80.0%
22 30 73.3%
33 40 82.5%
42 50 84.0%
50 60 83.3%
62 70 88.6%
68 80 85.0%
78 90 86.7%
83 100 83.0%

So from the above data, it’s apparent that once again:

  • The average error for the top ten ranks was 2.2 positions (an increase from 2016’s 1.9 positions)
  • Warm Tunas predicted 8 out of the top 10 songs
  • Warm Tunas predicted 16 out of the top 20 songs
  • Warm Tunas predicted 83 out of the 100 songs played in the countdown.

That’s not a bad result at all!

The average rank prediction error, grouped into divisions of 10 is provided below. It shows that it’s difficult to predict where songs will place once you leave the top 50:

ABC Position Warm Tunas Avg Error
1-10 1.9000
11-20 8.2000
21-30 14.3000
31-40 12.5000
41-50 15.2000
51-60 24.7000
61-70 18.2000
71-80 29.9000
81-90 34.1000
91-100 29.5000

To compare Warm Tuna’s predictions vs actual rankings, a scatter plot has been provided below. We can see as we get closer to rank 1, the 100 Warm Tunas prediction gets better and converges upon the actual rankings played out on the day.

Fortunately this year around, 100 Warm Tunas was able to successfully predict the winner of the countdown. The reason this prediction was able to be made was because the sample collected clearly indicated HUMBLE. as an outlier. – an entire 5% higher than the next track, predicted to place 2nd.

Anyway, that’s a wrap. See you later this year for 100 Warm Tunas 2018 edition!

100 Warm Tunas 2017 Update 🔥💯

100 Warm Tunas has been happily chugging away for the last month or so. I’ve obtained a fair amount of media coverage too.

A couple of days back, I posted the site to the triplej subreddit. Someone replied to the post telling me my vote count was significantly less than what they had been counting by hand, which made me somewhat suspicious – was there a bug in my Instagram scraping library that I built?

Well, after a bit of debugging early this morning, I found that there was indeed a bug. Not a bug with my scraping library, but rather a bug with how I was using the library:

-    for page in ig.fetch_pages('triplej', per_page=10):
+    for page in ig.fetch_pages(hashtag, per_page=10):
         for post in page.posts:
             if post.is_video:
                 logger.info("Skipping {} because it's a video".format(post.shortcode))

For those who are programmers, you’ll probably spot the issue here. For those who aren’t, the issue is that I have been using a hardcoded string to collect Instagram votes, when I thought I was collecting a handful of hashtags.

This has now been rectified and I have kicked off a full re-scrape to back-fill the data.

100 Warm Tunas 2017

Last year I predicted the top 3 in Triple J’s hottest 100 (ignoring order). This year I’m back at it once again with an updated webpage and a Spotify playlist.

Results are collected, optimised, and processed multiple times per day. Instagram images tagged with #hottest100 and a few others are included for counting.

Happy voting!

Feel free to check out the results from 2016, 2016 results analysis, and the process taken in 2015.

100 Warm Tunas 2016 Prediction Analysis

It’s been a long time since the Hottest 100 of 2016 was aired. Unfortunately, I never really got around to publishing some analysis I performed on the prediction results. Fortunately, I managed to find some time recently!

Looking from afar, the results don’t look fantastic (when you compare them to my results from 2015 at least). The prediction unfortunately predicted the top two places out of order, however did manage to predict the third place correctly.

Lets take a look at the Top 10 of Triple J’s list and match it up with 100 Warm Tunas:

Triple J Rank vs Tuna Rank

Looking at this we see most predictions we can find some learnings:

  • The average error for the top ten rank was 1.9 rank positions.
  • If 100 Warm Tunas ignored rank and simply guessed the top ten, it would have predicted 8 of the top 10 songs.
  • If 100 Warm Tunas ignored rank and simply guessed the top 3 songs to win, it would have predicted all 3 songs. Woo!

Lets dive into a chart that shows error for all ranks:

Rank error per position

From this chart, we can deduce that the further away from position 1 we become, the higher the error. This information alone isn’t very useful. We can get a better understanding of error by finding the average for each ranking group:

Average Rank error per group

As we get closer to rank 1, the results become more and more accurate, however they are not perfect. This is more obvious if we use a scatter plot to compare Triple J ranks against Warm Tunas predictions:

Triple J vs Tunas Scatter Plot

It’s clear now that as we get closer to rank 1, the 100 Warm Tunas prediction gets better and converges upon the actual rankings played out on the day. However, unfortunately this year the difference between rank 1 and rank 2 was way too close to call - just 0.67% of voting volume was separating the two. A difference that was not enough to provide an accurate prediction of the winner.

Overall, whilst 100 Warm Tunas 2016 did get the two top positions out of order, it’s understandable as to why this happened. Hopefully this year there is a greater difference between ranks, giving further ability to predict the winner in position #1.

Reverse Engineering a 433MHz Motorised Blind RF Protocol

I’ve been doing a fair bit of DIY home automation hacking lately across many different devices - mostly interested in adding DIY homekit integrations. A couple of months ago, my dad purchased a bulk order of RAEX 433MHz RF motorised blinds to install around the house, replacing our existing manual roller blinds.

RAEX Motorised Blind

Note: If you are based in Australia, you can purchase these in bulk or individually via www.raexaustralia.com (Full disclosure – my father runs the site).

The blinds are a fantastic addition to the house, and allow me to be super lazy opening/closing my windows, however in order to control them you need to purchase the RAEX brand remotes. RAEX manufacture many different types of remotes, of which, I have access to two of the types, depicted below:

R Type Remote

R Type Remote (YRL2016)

X Type Remote

X Type Remote (YR3144)

Having a remote in every room of the house isn’t feasible, since many channels would be unused on these remotes and thus a waste of $$$ purchasing all the remotes. Instead, multiple rooms are programmed onto the same remote. Unfortunately due to this, remotes are highly contended for.

An alternate solution to using the RAEX remotes is to use a piece of hardware called the RM Pro. This allows you to control the remotes via your smartphone using their app

RM Pro Home Screen
RM Pro Blind Control Screen

The app is slow, buggy and for me, doesn’t fit well into the home-automation ecosystem. I want my roller blinds to be accessible via Apple Homekit.

In order to control these blinds, I knew I’d need to either:

  1. Reverse engineer how the RM Pro App communicated with the RM Pro and piggy-back onto this
  2. Reverse engineer the RF protocol the remotes used to communicate with the blinds.

I attempted option 1 for a little while, but ruled it out as I was unable to intercept the traffic used to communicate between the iPhone and the hub. Therefore, I began my adventure to reverse engineer the RF protocol.

I purchased a 433MHz transmitter/receiver pair for Arduino on Ebay. In case that link stops working, try searching Ebay for 433Mhz RF transmitter receiver link kit for Arduino.

Initial Research

A handful of Google searches didn’t yield many results for finding a technical specification of the protocol RAEX were using.

  • I could not find any technical specification of the protocol via FCC or patent lookup
  • Emailed RM Pro to obtain technical specification; they did not understand my English.
  • Emailed RAEX to obtain technical specification; they would not release without confidentiality agreement.
  • I did find that RFXTRX was able to control the blind via their BlindsT4 mode, which appears to also work for Outlook Motion Blinds.
  • After opening one of the remotes and identifying the micro-controllers in use, I was unable to find any documentation explaining a generic RF encoding scheme being used.
  • It may have been possible to reverse engineer the firmware on a remote by taking an I2C dump of the ROM chip. It seems similar remotes allow dumping at any point after boot

Capturing the data

Once my package had arrived I hooked up the receiver to an Arduino and began searching for an Arduino sketch that could capture the data being transmitted. I tried many things that all failed, however eventually found one that appeared to capture the data.

Once I captured what I deemed to be enough data, I began analysing it. It was really difficult to make any sense of this data, and I didn’t even know if what had been captured was correct.

I did some further reading and read a few RF reverse engineering write-ups. A lot of them experimented with the idea of using Audacity to capture the signal via the receiver plugged into the microphone port of the computer. I thought, why not, and began working on this.

The RF capturing setup

Audacity capture

This captures a lot of data. I captured 4 different R type remotes, along with 2 different X type remotes, and to make things even more fun, 8 different devices pairings from the Broadlink RM Pro (B type).

From this, I was able to determine a few things

  1. The transmissions did not have a rolling code. Therefore, I could simply replay captured signals and make the blind do the exact same thing each time. This would be the worst-case scenario if I could not reverse engineer the protocol.
  2. The transmissions were repeated at least 3 times (changed depending on the remote type being used)

Zooming into the waveform, we can see the different parts of a captured transmission. This example below is the capture of Remote 1, Channel 1, for the pairing action:

R1, CH1 PAIR capture

Zooming in:

Zoomed R1, CH1 PAIR capture

In the zoomed image you can see that the transmission begins with a oscillating 0101 AGC pattern, followed by a further double width preamble pattern, followed by a longer header pattern, and then by data.

This preamble, header and data is repeated 3 times for R type remotes (The AGC pattern is only sent once at the beginning of transmission). This can be seen in the first image.

Looking at this data won’t be too useful. I need a way to turn it digital and analyse the bits and determine some patterns between different remotes, channels and actions.

Decoding the waveform.

We need to determine how the waveform is encoded. It’s very common for these kinds of hardware applications to use one of the following:

By doing some research, I was able to determine that the encoding used was most likely manchester encoding. Let’s keep this in mind for later.

Digitising the data

I began processing the data as the raw scheme outlined above (even though I believed it was manchester). The reason for this is that if it happened to not be manchester, I could try decode it again with another scheme. (Also writing out raw by hand was easier than doing manchester decoding in my head).

I wrote out each capture into a Google Sheets spreadsheet. It took about 5 minutes to write out each action for each channel, and there were 6 channels per remote. I began to think this would take a while to actually get enough data to analyse. (Considering I had 160 captures to digitise)

I stopped once I collected all actions from 8 different channels across 2 remotes. This gave me 32 captures to play with. From this much data, I was able to infer a few things about the raw bits:

  • Some bits changed per channel
  • Some bits changed per remote.
  • Some bits changed seemingly randomly for each channel/remote/action combination.
    • Could this be some sort of checksum?

I still needed more data, but I had way too many captures to decode by hand. In order to get anywhere with this, I needed a script to process WAV files I captured via Audacity. I wrote a script that detected headers and extracted data as its raw encoding equivalent (as I had been doing by hand). This script produced output in JSON so I could add additional metadata and cross-check the captures with the waveform:

[
  {
    "filename": "/Users/nickw/Dropbox/RF_Blinds/Export_Audio2/tracks2/R1_CH1.wav",
    "captures": [
      {
        "data": "01100101100110011001100101101001011010010110011010011010101010101010101010011001101010101010101010101010101",
        "header_pos": 15751,
        "preamble_pos": 15071
      },
      {
        "data": "01100101100110011001100101101001011010010110011010100110101010101001101010011001101010101010101010101010101",
        "header_pos": 46307,
        "preamble_pos": 45628
      },
      {
        "data": "01100101100110011001100101101001011010010110011010010110101010101010011010011001101010101010101010101010101",
        "header_pos": 73514,
        "preamble_pos": 72836
      },
      {
        "data": "01100101100110011001100101101001011010010110011010101010101010100101010101101001011010101010101010101010101",
        "header_pos": 103575,
        "preamble_pos": 102895
      }
    ]
  }
]

Once verified, I tabulated this data and inserted it into my spreadsheet for further processing. Unfortunately there was too many bits per capture to keep myself sane:

Raw captures inside a spreadsheet

I decided it would be best if I decoded this as manchester. To do this, I wrote a script that processes the raw capture data into manchester (or other encoding types). Migrating this data into my spreadsheet, it begins to make a lot more sense.

Manchester captures inside a spreadsheet

Looking at this data we can immediately see some relationship between the bits and their purpose:

  • 6 bits for channel (C)
  • 2 bits for action (A)
  • 6 bits for some checksum, appears to be a function of action and channel. F(A, C)
    • Changes when action changes
    • Changes when channel changes.
    • Cannot be certain it changes across remotes, since no channels are equal.
  • 1 bit appears to be a function of Action F(A)
  • 1 bit appears to be a function of F(A), thus, G(F(A)). It changes depending on F(A)’s value, sometimes 1-1 mapping, sometimes inverse mapping.

After some further investigation, I determined that for the same remote and channel, for each different action, the F(A, C) increased by 1. (if you consider the bits to be big-endian.).

Encoded value increasing per different action

Looking a bit more into this, I also determined that for adjacent channels, the bits associated with C (Channel) count upwards/backwards (X type remotes count upwards, R type remotes count backward). Additionally F(C) also increases/decreases together. Pay attention to the C column.

Encoded value increasing with adjacent channels

From this, I can confirm a relationship between F(A, C) and C, such that F(A, C) = F(PAIR, C0) == F(PAIR, C1) ± 1. After this discovery, I also determine that there’s another mathematical relationship between F(A, C) and A (Action).

Making More Data

From the information we’ve now gathered, it seems plausible that we can create new remotes by changing 6 bits of channel data, and mutating the checksum accordingly, following the mathematical relationship we found above. This means we can generate 64 channels from a single seed channel. This many channels is enough to control all the blinds in the house, however I really wanted to fully decode the checksum field and in turn, be able to generate an (almost) infinite amount of remotes.

I wrote a tool to output all channels for a seed capture:

./remote-gen generate 01000110110100100001010110111111111010101
...

My reasoning behind generating more data was that maybe we could determine how the checksum is formed if we can view different remotes on the same channel. I.e. R0CH0, R1CH0, X1CH0, etc…

Essentially what I wanted to do was solve the following equation’s function G:

F(ACTION_PAIR, CH0) == G(F(ACTION_PAIR, CH0))

However, looking at all Channel 0’s PAIR captures, the checksum still appeared to be totally jumbled/random:

Identical channels / action jumbled checksums

Whilst looking at this data, however, another pattern stands out. G(F(A)) sits an entire byte offset (8 bits) away from F(A). Additionally the first 2 bits of F(A, C) sit at the byte boundary and also align with A (Action). As Action increases, so does F(A, C). Lets line up all the bits at their byte boundaries and see what prevails:

Identified Boundaries
Colours denoting byte boundaries
Aligned byte boundaries
Aligned boundaries

From here, we need to determine some function that produces the known checksum based on the first 4 bytes. Initially I try to do XOR across the bytes:

Attempt to find checksum function via XOR

Not so successful. The output appears random and XOR’ing the output with the checksum does not produce a constant key. Therefore, I deduce the checksum isn’t produced via XOR. How about mathematical addition? We’ve already seen some addition/subtraction relationship above.

Attempt to find checksum function via addition

This appeared to be more promising - there was a constant difference between channels for identical type remotes. Could this constant be different across different type remotes because my generation program had a bug? Were we not wrapping the correct number of bits or using the wrong byte boundaries when mutating the channel or checksum?

It turns out that this was the reason 😑.

Solving the Checksum

Looking at the original captures, and performing the same modulo additions, we determine the checksum is computed by adding the leading 4 bytes and adding 3. I can’t determine why a 3 is used here, other than RAEX wanting to make decoding their checksum more difficult or to ensure a correct transmission pattern.

I refactored my application to handle the boundaries we had just identified:

type RemoteCode struct {
    LeadingBit uint // Single bit
    Channel    uint8
    Remote     uint16
    Action     uint8
    Checksum   uint8
}

Looking at the data like this began to make more sense. It turns out that F(A) wasn’t a function of A (Action), it was actually part of the action data being transmitted:

type BlindAction struct {
    Name  string
    Value uint8
}

var validActions = []BlindAction{
    BlindAction{Value: 127, Name: "PAIR"},
    BlindAction{Value: 252, Name: "DOWN"},
    BlindAction{Value: 253, Name: "STOP"},
    BlindAction{Value: 254, Name: "UP"},
}

Additionally, the fact there is a split between channel and remote probably isn’t necessary. Instead this could just be an arbitrary 24 bit integer, however it is easier to work with splitting it up as an 8 bit int and a 16 bit int. Based on this, I can deduce that the protocol has room for 2^24 remotes (~16.7 million)! That’s a lot of blinds!

I formally write out the checksum function:

func (r *RemoteCode) GuessChecksum() uint8 {
    return r.Channel + r.Remote.GetHigh() + r.Remote.GetLow() + r.Action.Value + 3
}

Additional Tooling

My remote-gen program was good for the purpose of generating codes using a seed remote (although, incorrect due to wrapping issues), however it now needed some additional functionality.

I needed a way to extract information from the captures and verify that all their checksums align with our rule-set for generating checksums. I wrote an info command:

./remote-gen info 00010001110001001101010111011111101010100 --validate
Channel:    196
Remote:     54673
Action:     STOP
Checksum:          42
Guessed Checksum:  42

Running with --validate exits with an error if the guessed checksum != checksum. Running this across all of our captures proved that our checksum function was correct.

Another piece of functionality the tool needed was the ability to generate arbitrary codes to create our own remotes:

./remote-gen create --channel=196 --remote=54654 --verbose
00010001101111110101010111111111010011001    Action: PAIR
00010001101111110101010110011111101101000    Action: DOWN
00010001101111110101010111011111111101000    Action: STOP
00010001101111110101010110111111100011000    Action: UP

I now can generate any remote I deem necessary using this tool.

Wrapping Up

There you have it, that’s how I reverse engineered an unknown protocol. I plan to follow up this post with some additional home-automation oriented blog posts in the future.

From here I’m going to need to build my transmitter to transmit my new, generated codes and build an interface into homekit for this via my homebridge program.

You can view all the work related to this project in the nickw444/homekit/blindkit repo.

As mentioned above, if you are based in Australia, you can purchase these blinds and associated accessories in bulk or individually via www.raexaustralia.com (Full disclosure – my father runs the site)

100 Warm Tunas 2016

Last year I predicted the top 3 results in order in Triple J’s hottest 100. This year I’m back at it again, however, now with a webpage and a Spotify playlist.

Results are collected, optimised, and processed multiple times per day. Instagram images tagged with #hottest100 and a few others are included for counting.

Happy voting!

You can read about the process last year here. However, vote collection is a fair bit more accurate this year.


Press:

Other Mentions:


Edit: Woohoo, the Spotify playlist now has just over 1200 followers, and the website has had over 30,000 hits! That’s massive, thanks everyone!

Understanding and Tweaking some GPX data

As a casual bike rider, I enjoy tracking my rides with Strava so I can take a look at how my ride went and how well I performed throughout.

However, very rarely the Strava tracking application randomly crashes, or gets killed by iOS on my phone, during the ride. This means that the data was never recorded between the point at which the app died and the point when I became aware the app had died.

If we plot this type of failure, it looks something like this:

Map with missing data

Fortunately in this case, there wasn’t too much missing data. However, I was still determined to learn about the GPX format and see if I could patch up the GPX file programatically.

In the specific case of the above map, I was riding north west, and at a point Strava crashed. Between this point and when I pulled out my phone to check my progress, no points were plotted. Google maps interprets this lack of data as a straight line between to the 2 points (as per GPX specification).

If we crack open the GPX file and take a look, we can see exactly what this looks like:

...
<trkpt lat="-33.9014420" lon="151.1066810">
    <ele>6.6</ele>
    <time>2016-10-28T23:38:50Z</time>
</trkpt>
<trkpt lat="-33.8802920" lon="151.0702190">
    <ele>20.9</ele>
    <time>2016-10-28T23:51:14Z</time>
</trkpt>
...

In it’s simplest form, a GPX file is an XML document that contains a sequence of GPS points (with associated metadata like elevation, and other depending on the tracker). This makes it reasonably simple for us to get our hands dirty and begin fixing the data set.

In order to add the missing data back into the GPX file, we need 3 things:

  • The last coordinate recorded before the app crashed
  • The coordinate when the app was revived
  • A list of points of the track we want to use for our data points.

Fortunately, I was able to obtain a list of coordinates for the missing data since I travelled the same path on the return journey (As can be seen on the map above).

The other 2 app state points of interest are reasonably easy to find - just find 2 data points that have a (reasonably) large time distance between them.

In order to process the data, I used a python library called gpxpy which provided some good utilities for reading and processing a GPX file.

With this library, I was able to find the crash point, the revival point, and the list of the points of the track. With this data, I interpolated the start/end times of the crash points onto the track data, and spliced it back into the dataset.

After exporting the data set, we achieve a map that looks like:

Map with resolved data

Quite clearly, this has a few limitations, for example, the calculated velocity through all of the data points is simply an average. However, this did provide me with an improved dataset which I could re-upload to Strava.

You can find all the source for this script on my github

Accurately Predicting Triple J's Hottest 100 of 2015

In 2014, a prediction was accurately made for the Hottest 100 of 2013. The results were posted on warmest100.com.au.

The author of the prediction in 2014 managed to acquire accurate results because Triple J featured a social share button on their voting page, which posted your votes to your Facebook in text form. The author scraped results from public Facebook posts and aggregated all the votes. They managed to obtain 1.3% (1779 entries) of the expected total vote.

Consequently, voting for the Hottest 100 2014 and 2015 did not contain such a feature. Fortunately, voters still felt the need to share these results with their friends, and taking a screen shot or a photo of their screen and posting to social media was a concrete alternative. Using these images posted to Instagram, I was able to accurately predict the results of Triple J’s Hottest 100 of 2015.

Some Cool Stats Before You Continue

  • Triple J Tallied 2094350 Votes (209435 Entries) for Hottest 100 2015
  • I collected a sample size of ~2.5% of all entries
    • 7191 images initially collected
    • I categorised 5529 images as votes
    • ~4900 images contained the words “vote/votes/voting”
  • My Top 3 Results were 100% accurate

You’ll probably find this article interesting, but if you’re super eager, you can Skip To The Results.

Taking Advantage of Social Media

I decided to only target votes that were posted to Instagram, since a high majority of the pictures hashtagged with #hottest100 were in fact votes, and there was a reasonably high volume of them, and most publicly accessible.

I required means to acquire all pictures that had been posted to Instagram. Instagram have an official API, however you are required to have your API app usage approved before it can interface with non-sandbox users. Additionally, Instagram impose a rate limit on non-approved apps, as well as approved apps. I did not have time to waste, and wanted results immediately, so I found an alternative.

Fortunately, Instagram exposes a non-public API through their website ajax loading when you browse to a hashtag. By imitating the web browser with a simple python script using the requests library I managed to download all images from the latest until a cut off date that I specified (the day voting opened).

After scraping the hashtag #hottest100, I expanded my search to #hottest1002015 and #triplejhottest100.

Processing Images

After downloading 7191 images from Instagram, I needed to find an accurate way to filter out the images that were not votes.

I’ve had previous experience with using PIL in Python, so using PIL, I wrote a simple script to sort the photos into 2 categories; photos that appeared white-ish, and photos that were not.

A good vote looked like this:

A Good Vote

Unfortunately, not every image ended up in the right folder, and I ended up with both false negatives and false positives, however I wasn’t too concerned about false positives, as my OCR processing step would exclude them. Instead, I was more concerned about false negatives.

As the image processing and sorting continued, I manually moved false negatives to the positives folder. I calculated about 5% of the non-matching photos were incorrectly classified, however this was due to them being pictures taken of computer screens, similar to the photo below:

A Bad Vote

Some image statistics:

  • 7191 images collected initially
  • 1662 images categorised as non-votes
  • 5529 images categorised as votes
  • ~4900 images contained the words vote/votes/voting

Improving OCR Performance

After experimenting on raw photos from Instagram, I found that OCR accuracy was not very accurate. To remediate this, I utilised Imagemagick to flatten image definition to improve text results.

An improved image

Bringing in Tesseract (OCR)

After weeding out the junk, I still needed to turn these images into readable text.

Using Google’s Tesseract library, I slowly processed all the images and extracted the text from them.

Unfortunately, due to the layout of the Hottest 100 voting website the two columns were broken up inconsistently over the results.

Some were processed as:

Line by Line processing
...
Flight Facilities
Hayden James
Hermilude
Major Lazer
RUFUS
Weeknd, The
ZHU x Skrillex x THEY.
Jarryd James
Disclosure
Kendrick Lamar
Heart Attack {FL Owl Eyes)
(Radio Edit)
Something About You
The Buzz (Ft. Malaya/Young
Tapz}
Lean On (Ft. Mé/DJ Snake}
Innerbloom
...

And others processed as:

Song/Artist line by line
...
Lucky Luke 1 Day
Mosquito Coast Call My Name
Tn ka Right By You
Tuka L.D.T.E.
Half Moon Run Trust
Spring King City
Tame Impala Let It Happen
Saskwatch I‘ll Be Fine
Jungle Giants. T Kooky Eyes
he
...

And others just did not process at all, due to resolution, colour, skewing, or simply because they were a photo of a computer screen:

Bad Image
'VHotllne Bling
Regardless (Ft. Julia Stone)

Parsing the Results

I processed the results line by line, and call these “terms”. These such terms could contain a single song title, a single artist, an artist name with song name, or just junk overhang from a previous line. Initially there were 31062 uncategorised terms.

I processed each term and aggregated number of results for each. This worked really well for songs with short names that were less prone to error, such as Hoops, however did not correctly capture terms where artist name and song name occurred on the same line, or where the OCR library interpreted a few characters incorrectly.

OCR Inaccuracy & Levenshtein

Even with photo enhancements, the OCR accuracy was somewhat subpar for some votes. Some l’s were interpreted as t’s, i’s as l’s, etc. Additionally, the longer the name of the song, the more prone to error it was.

Fiesh Without Blood
L D R U Keepmo Score Fl Pavqe IV)
Yam: unpala The Les I Knew The Bauer
The Tlouble Wilh US

A technique that can be used to fix these spelling errors of single/multi character errors is the Levenshtein algorithm for edit distance. Using this algorithm, we can compare 2 strings and determine how many edits need to be made to make the strings equal each other.

In order to perform this kind of matching, we needed an accurate list of songs that were released this year, along with a list of artists that released music this year.

Using Spotify To Help

To acquire an accurate list of songs released this year, I used Spotify and crawled various playlists from 2015. These included Spotify Charts, Triple J Hitlist, and various other genre-alike playlists.

In the end I ended up with a songs list with 1781 songs, and an artists list with 1229 artists. After the Hottest 100 aired, I compared the results of the countdown to the songs found in my list, and only 6 songs that occurred in the hottest 100 were not in my “truth” list.

During list gathering, I made sure to convert all unicode characters to their ASCII counterparts, so that characters with accents and similar would be matched correctly.

Continuing Processing

Now carrying reasonably accurate artists and songs lists we continue categorisation and processing. The processing algorithm worked in the following way:

  1. Load all terms from every image’s .txt OCR result. Every line is a “term”.
  2. Clean all the terms by turning them into lowercase and stripping whitespace.
  3. Loop through each term:
    1. If term exists in our known songs list, move the term to the songs aggregation and count the votes.
    2. If term exists in our known artists list, move the term to the artists aggregation and count the votes.
    3. If couldn’t find it in either of those:
      1. Loop through all artists in our artist known artist list.
        1. Check if the term starts with the current artist. If it does split it into artist and unknown term. Add the votes to the artist aggregation.
        2. If matched artist, check if the new unknown term exists in the songs list, if it does, add it to the songs aggregation. If not, add it back to the unknown. break loop.
      2. If it didn’t have a prefixed artist, just add it back to the unknown terms.

At this stage, we have a reasonably accurate aggregation of results. We have not yet used Levenshtein string matching. We now have 27294 uncategorised terms, down from 31062 uncategorised terms. So far our results:

==       Results       ==
1   Hoops                          998
2   King Kunta                     765
3   Lean On                        750
4   The Buzz                       646
5   Like Soda                      568
6   Never Be                       484
7   Let It Happen                  476
8   Magnets                        465
9   Do You Remember                409
10  Ocean Drive                    405
==    853 unique terms    ==

==       Top Unknown Terms       ==
1   Your Hottest 100 Votes:        2279
2   Your Votes                     2127
3   }                              320
4   Hottest Io                     248
5   V                              231
6   Throne                         222
7   Triple J?                      209
8   D] Snake                       203
9   The Less | Know The Better     203
10  Asap Rocky                     199
==    27294 unique terms    ==

However, we still haven’t aggregated any votes that had spelling errors due to OCR inaccuracies.

Employing the Levenshtein algorithm, we continue to process the unknown terms. I configure matching to allow lenience based on the length of the term - the maximum edits that were allowed was 2/5 * length of term. The process continues:

  1. For all unknown terms:
    1. Check term length > 3. Break if <= 3. Can’t match a short string.
    2. Match Songs:
      1. Loop through all songs in known songs list:
        1. Compare current song to current term. Get edit distance.
        2. If edit distance == 1, move votes for this term to the guessed song in our songs aggregation, then continue to the next term.
        3. Add distance to a dictionary of value/distances
      2. Using our value/distances dictionary, find the closest match that satisfies our 2/5 * len(term) rule. If it matches, move the votes for this term to the guessed song in our songs aggregation, then continue to the next term.
    3. Match Artists using the same method.

Some of the results of string matching, providing some reasonably accurate re-matching.

[A] weekncl, the -> weeknd, the with distance 2
[A] mm m. -> ms mr with distance 2
[S] km; kunta -> king kunta with distance 3
[A] macklelllore ex ryan lewis -> macklemore & ryan lewis with distance 5
[A] eulsch duke) -> deutsch duke with distance 3
[A] bloc pany -> bloc party with distance 2
[S] nommg's forevev -> nothing's forever with distance 5
[S] t he hllns -> the hills with distance 3
[S] emocons -> emoticons with distance 2
[S] better off without you -> better with you with distance 7
[S] - the less | know the better -> the less i know the better with distance 3
[S] vancejoy fire and the fiood -> fire and the flood with distance 10
[S] too much me togglhu -> too much time together with distance 6
[A] of mons-us and m. -> of monsters and men with distance 5
[S] gmek tragedy -> greek tragedy with distance 2
[S] marks to prove 1t -> marks to prove it with distance 1
[A] rlighx facilities -> flight facilities with distance 2
[A] gang 01 youth: -> gang of youths with distance 3
[A] fka lwlgs -> fka twigs with distance 2
[S] hoine bling -> hotline bling with distance 2

After performing this additional processing, I ended up with 18509 uncategorised terms, down from 27294 uncategorised terms.

That means we were able to successfully categorize 8785 terms via the Levenshtein distance algorithm!

==       Results       ==
1   Hoops                          1011
2   King Kunta                     1008
3   Lean On                        793
4   The Buzz                       667
5   Let It Happen                  637
6   Like Soda                      617
7   The Less I Know The Better     602
8   Magnets                        521
9   Never Be                       520
10  The Trouble With Us            501
==    1143 unique terms    ==

==       Top Unknown Terms       ==
1   Your Hottest 100 Votes:        2279
2   }                              320
3   Hottest Io                     248
4   V                              231
5   Throne                         222
6   Triple J?                      209
7   Thanks For Voting!             174
8   Tapz)                          170
9   Suddenly                       155
10  Once                           140
==    18509 unique terms    ==

Quite an improvement, however still not great. Some of the terms there weren’t able to be categorised which caught my attention included:

9   Suddenly                       155
16  Big Jet Plane                  123
17  Heart Attack                   120
18  True Friends                   114
23  Rumour Mill                    107
35  The Less | Know The            76
63  & Chet Faker The Trouble With Us 46

Paying special attention to The Less | Know The, if I were to add it’s sum to our results, it would have placed 4th, however, the results we already have look reasonably accurate.

Final Results

==       Results       ==
1   Hoops                          1011
2   King Kunta                     1008
3   Lean On                        793
4   The Buzz                       667
5   Let It Happen                  637
6   Like Soda                      617
7   The Less I Know The Better     602
8   Magnets                        521
9   Never Be                       520
10  The Trouble With Us            501
11  Do You Remember                480
12  Ocean Drive                    463
13  Can'T Feel My Face             457
14  You Were Right                 444
15  Middle                         423
16  Magnolia                       381
17  Young                          380
18  The Hills                      369
19  Hotline Bling                  356
20  Keeping Score                  321
21  Embracing Me                   319
22  Mountain At My Gates           318
23  Loud Places                    300
24  Run                            298
25  I Know There'S Gonna Be        287
26  Some Minds                     287
27  Say My Name                    283
28  Fire And The Flood             280
29  Visions                        275
30  Greek Tragedy                  274
31  Long Loud Hours                272
32  Shine On                       254
33  Asleep In The Machine          249
34  Leave A Trace                  242
35  Like An Animal                 235
36  Something About You            224
37  Dynamite                       224
38  All My Friends                 218
39  Deception Bay                  217
40  Downtown                       210
41  Ghost                          200
42  Son                            196
43  Hold Me Down                   196
44  No One                         196
45  Kamikaze                       196
46  Puppet Theatre                 192
47  Vice Grip                      191
48  Forces                         185
49  Better                         185
50  Counting Sheep                 184
==    1143 unique terms    ==

Some Notes

  • Run appeared so high on the leaderboard because both Seth Sentry and Alison Wonderland released similar tracks titled RUN/Run. Since I lowercased all comparisons and removed special characters, these votes merged.

Improving the Analysis

After reviewing the method used for analysis, I have identified a few places for improvement that could possibly improve the results.

  1. Improved Levenshtein Algorithm. The Levenshtein algorithm is great for calculating edit distance, however I could not weigh edits of similar characters such as t’s, i’s and l’s less, thus improving matching due to OCR inaccuracies. I expect that string matching could have been significantly improved if this was explored.
  2. Songs that had long titles, such as The Less I Know The Better generally were split across multiple lines. This caused their aggregation to not sum correctly. It would be good if I could determine if a song was split across two lines.
  3. Songs that were in the format of artist song and were spelt incorrectly were most likely not picked up by string matching, as we only matched against songs and artists individually. In order to improve matching for this, an additional list for joined songs/artists could have been used and compared against for remaining terms.

Some Cool Stats

  • Triple J Tallied 2094350 Votes (209435 Entries)
  • I collected a sample size of ~2.5% of all entries
    • I collected 7191 images collected initially
    • I categorised 5529 images as votes
    • ~4900 images contained the words “vote/votes/voting”
  • My Top 3 Results were 100% accurate

Sprinkle

Here’s a quick demo of something I quickly jammed together over the weekend for my Dad. More info to come, along with additional pictures, circuitry, and some proper screenshots

Basically it’s an iOS app to control solenoid valves via a Raspberry Pi over a JSONRPC interface.

CSESoc Hackathon

A couple of weeks back, the society I am a member of at Uni hosted a hackthon event, sponsered by Freelancer. For the uninitiated, a hackathon is an event where programmers literally turn pizza and drink into applications/code. (But in all seriousness, it’s an event where programmers develop a cool idea in a small timeframe and compete to be the ‘best’ product).

I formed a team with 2 friends from Uni. We set out to build a web platform for students of UNSW to list projects they have worked on in an easy to use web directory that they could use for employment and their own portfolio.

The webapp is written in Python/Python-Flask, uses MySQL as the backend (because mongo hates many to many relationships), and use Bootstrap to style the frontend, statically served from the server.

We wanted the following features from the service:

  • A project has:
    • Web URL
    • Download URL
    • Marketing URL
    • Markdown formatted description
    • Ability to upload screenshots of the project
    • A project can have multiple contributors
  • Project Page:
    • Showcase of all projects the user has worked on
    • About me for the user
    • Show who the user follows
    • Show who is following the user
  • Home Page/General:
    • A-Z listing of all projects
    • Show latest 3 projects on the home page “ShowCase”
  • Logins use UNSW’s LDAP service, so it’s all UNSW SSO.

There are some additional features we wish to work into it, such as reading README.md from github projects.

There are a few bugs hanging around still, along with some non-implemented features, such as multi contributors for a project. We’ll eventually get around to these, and finally launch it!

We plan to put it up on http://showc.se/, a domain I purchased for the project. It’s a nice play on words, and also is a valid regular expression, which matches “ShowCase”, but also is a play on CSE - Computer Science and Engineering.

It’s probably important to note that we came first in the Hackathon, each of the team members winning a UE Boom portable bluetooth speaker thanks to Freelancer!

Stick around for more, i’ll update this post when it’s live!

CTF Season

It’s currently CTF season, and as a member of UNSW’s security society, that means I get to play!

We began the season with CSAW CTF, where we (team K17) placed 1st in Australia/10th overall.

I did not participate in this CTF as much as I would have liked to, since I was already pre-occupied with the CSESoc Hackathon, however, I did lend a hand with Web 500 - A fake dating website where the aim was to recover Donald Trump’s TOTP key as well as his password. I managed to solve half of the challenge by finding an SQL injectable endpoint in a CSP reporting endpoint, where I dumped a password hash and other info about the account. We recovered the password hash using a dictionary attack, However the full solution required dumping of source code to determine how the TOTP key was generated, which another member of the team did, and thus solved the challenge.

The following weekend, Trend Micro CTF was running, which K17 also played in. We ended up coming in at 1st place globally out of 359 teams - A fantastic effort. Once again, I only participated lightly in this CTF. I worked on an Android APK reversing challenge, which I solved over the space of 2 hours. I will post a write up of this challenge soon!

Additionally, I was selected to play in CySCA (Australia’s Cyber Security Challenge) for UNSW3. UNSW entered 5 teams. My team (of 4) placed 3rd overall in the competition, but the entire UNSW effort was also amazing:

  • 1st: UNSW1
  • 2nd: UNSW2
  • 3rd: UNSW3
  • 4th: UNSW4
  • 29th: UNSW5

I’ll be posting my write ups over the next few weeks, explaining my solutions to the problems that I solved for these CTF’s!

A Brilliant Explanation of PID

PlaylistGrabber 1.1 (UI Revisions)

Due to the large response to the initial release of PlaylistGrabber, I have quickly revised some of the UI and functionality to bring it up to scratch with user expectations.

Changelog:

  • Added App Icon (Green iTunes yeah close enough)
  • You can now select a playlist by clicking the entire row, not just the checkbox
  • PlaylistGrabber now remembers what playlists you selected last time
  • You can save your progress of playlist selection by pressing “Save Selected”
  • PlaylistGrabber now remembers what XML File you last used. Quick and Easy Startup.
  • PlaylistGrabber skips copying files that already exist in the destination (good if you’re using a cloud service). It will however re-generate M3U Files so you will get playlist updates.
  • Tableview now has pretty icons
  • Tableview nests items inside folders at their level as depicted within itunes. Wish i could indent the icon too.
  • “About” window updated.
  • Auto build incrementing (current release is build 110)
  • Automatically quits app on window close.

Things I’d Like it to do better

  • Nest icons for folder indentation levels
  • Delete songs when a playlist is de-selected. This could prove quite tricky

I’m pretty happy with what I have achieved over the few hours I’ve worked on this project. And after all, I’ve learnt how to program for Mac OSX.

You can obtain the source and self-compile here.

PlaylistGrabber for OSX/iTunes

Spotify was great. I had my music everywhere, could add new music without a computer, however it lacked in a major area – Playlists. It seems to be a growing trend of music players to suck at this. Being unable to shuffle all music on the device is also a massive drawback. I will be cancelling my Spotify subscription once my 3 month trial is over.

Let me introduce PlaylistGrabber for OSX. This is my first Cocoa application, targeted at 10.8 and upwards (not tested much). PlaylistGrabber reads your iTunes library XML file, and allows you to choose playlists to export. It creates a folder structure that you can drag and drop onto your device (or export directly to the device if you have mass storage capabilities). It exports playlists in M3U format and understands that the duplicate songs in different playlists are the SAME song – so no stupid duplicates in your library, just as iTunes handles it.

Due to the nature and simplicity of M3U playlists, most music players understand these, including PowerApp, Samsung Music App and Google Play Music. This is good news, as now you are free to roam to other solutions than DoubleTwist for all your iTunes Syncing needs.

Eventually I will tidy up the application, however at this present time, I do not have enough time to do so. Eventually I would like to make the app do the following:

  • Save Chosen Playlist Preferences for re-loading later on if a user decides they want to re-sync Implemented in newest version.
  • Sync Daemon – Watches when Library Changes, and writes changes to a sync directory, from where you could auto sync with google drive
  • Wifi Sync (With a client app on the phone)
  • Better Async Handling so the program doesn’t appear to “Lock Up”

If you come across any bugs or have any suggestions, let me know, it’ll be nice to track them in the future for new releases in summer this year.

Latest version is available here

Bridging the Gap: iTunes to Android

I hate proprietary things. Specifically iTunes. It’s a great music manager, it’s robust, stable, and has a brilliant store. It also offers fantastic integration with iOS devices. As much as that is great for iOS users, it sucks for anyone on Android. In my opinion this is driving people to move to streaming music services such as Google Play Music or Spotify (both of which I am contemplating).

There are various tools available which bridge the gap. Apps such as DoubleTwist for Mac & Android do the job, however they’re just too clunky.

For me, leaving my iTunes Library and picking up some new music management tool would be a big hassle. iTunes has all my music, including over 300 of my playlists. I need a program to bridge iTunes and my Android Device.

At the moment I’m on the verge of writing a tool to extract iTunes music, track playlists and keep music in sync onto my Android phone. My roadmap for the tool is as follows:

  1. Read iTunes Library & Copy Playlists onto device without duplicates
  2. Synchronise Playlists with the device, so songs removed in iTunes are removed on the Phone.
  3. Write metadata to the device for playlists
  4. OR Write a music playing app – however this may be overkill.

Anyway, this will be a learning process. I’m aware tools are available to do this task, however building one tailored to my needs to possibly the best solution in my case. Looking at the iTunes Library XML files, they look relatively easy to work with, and maybe eventually I’ll set up a auto importer for my “external purchases” to automatically add them to my Library and put them in the right folder on my computer.

4 Monitors, 1 GTX660TI

The other day my VESA wall mount arrived in the mail. It’s a simple $16AUD wall mount, able to hold about 10KG. I drilled the holes in my wall, and hung up my 4th monitor – another acer, except it’s only 20″ (V203H).

I managed to get all four monitors to run from my GTX660TI inside OSX (finally) by installing a nvidia driver for CUDA (which I guess comes with a display driver). 2 monitors are running from DVI, One from HDMI to DVI and another using DisplayPort to DVI.