Your boy has got some time on his hands recently so decided to try out Shopee Code League, which is basically a series of challenges ranging from analytics to alogrithms with some prize money.
The first challenge came in the form of detecting order brushing. For who those who aren’t acquainted with the term (like me), it basically means sellers making fake orders to boost their ranking in the product listings.
The data set provided has columns orderid
, shopid
, userid
and event_time
, where every row is a transaction. The definition of brushing here is when there are 3 or more orders per unique user in a 1 hour window.
I’ll skip the finer details of the submission criteria but essentially all we need to do is to surface the users that are brushing for each shop.
Moving on to my solution; what I had in mind was pretty basic and goes something like this:
- Sort the orders in chronological order grouped by shopids
- For each shop, iterate through every order looking for the next n orders in the 1 hour time window
- Check the n orders against the order brushing definition and pick out the guilty userids
This logic seems pretty bullet-proof to me at that point so off I went to write the code. The time given was 3 hours but since I haven’t touched Python in a while and I can’t remember the last time I typed import pandas as pd
, yours truly took a staggering 2 days.
Anyway finally after 2 days of debugging and hairpulling, I went to submit my solutions which I thought would be a perfect score (I mean the logic I described was how anyone would have done it right?)
Here’s the part where you’d know I was wrong. I got a 0.788 score out of 1.0.
But how?
Well after looking around at some discussion and checking my answers against those who did better, it turns out there’s some nuance to setting the 1 hour window.
Let me paint the picture.
These are the orders for a particular shop. Now if I followed my logic above, I would have identified the highlighted 4 rows as being in the same 1 hour window, by using the first highlighted row as the reference to start the counting. And since there are 2 orders per unique user here, we have no brushing and we move on!
But.
What if the point of reference started at 2019-12-29 22:44:00
instead? Then we would have have only the first 3 rows of that order in the 1 hour window and it most definitely fits the definition of order brushing.
And so the error in my logic is falsely assuming that the 1 hour window always has to start from a order when in fact it doesn’t!
Having realized that, the next thing to do is to rewrite the code. Well that’s the tricky part, how do you translate that logic into code?
Do you iterate through every single possible starting point, which is down to the second? That’s too computationally inefficient.
Maybe we still stick to using every order as a starting reference but we look back from there and see how far back we can adjust the starting reference without including the previous order in the time window?
This sounds like it could be the solution.