This page will be about how I spent my day yesterday debugging an interesting problem. I got to work... and then a p0 bug dropped onto me. The CEO of the company was like - get this fixed ASAP. It's kinda scary to me because bug fixing to me... is mostly investigation. And thus, there's like no real eta. I'm searching for the problem and I have no idea how long that will take. Once I find the problem, I can probably estimate the time it'll take to implement the fix. Anyways... they wanted it fixed within a day.
So bug fixing under time pressure probably made me perform worse because I'm like... kinda nervous and anxious which hurts my cool logical mind. I think when you do programming you need to always maintain your calm and whenever you get frustrated you need to take a break.
So anyways. The issue is that a bunch of our notifications were not getting sent. So what did I do? Of course I looked through the code. Hmmm. The code looks okay. I traced through it. Everything looks fine. There is an asynchronous task that triggers when the notifications get sent out. The task was running to completion no problem. And... when I manually ran what the task would've done... I got a notification on my phone. Hmmmm. This shit should be working.
Fuck. I'm kinda in nervous mode because. WTF. The code should work. The task runs. When I run the code.. I get a push. But, undoubtedly... the notifications aren't being sent. So I go through the logs. Sometimes this asynchronous task doesn't finish to completion. Hmmm. Maybe that's it? I go talk to one of the tech leads about this idea.
Yo. This task sometimes don't finish. Maybe that's why notifications aren't being sent out? But the notifications are asynchronous. It just gets dumped into a queue that a separate task handles. And besides, the failure is random while the notifications are consistently failing. Right. Guess that's not it.
So now. I'm still somewhat like WTF. Ugh. I have no real idea why shit ain't working. Then my previous experience kicks in and tells me of a situation when it happened before. At my old company the notifications broke because of scaling. At some point... our users exceeded our memory allocation. So when you tried to load all the users at once... the task would crap out. I look at the code to see if it fails on large notifications. Nope. Memory is fine. Ugh. Back to square one.
At this point this is what I know. The task runs. The task runs to completion most of the time. The notifications get sent when I manually run the code. And then I'm thinking... theory isn't working out here. I tried looking at this from a theoretically code flow perspective and it don't work. Time to do some testing. Real life testing. I credit one of the tech leads / mentors that recently refreshed me on this technique. Sometimes if you don't know what's gonna happen. Just code and try shit out. Wonder how your query will perform? Run it on staging.
Okay. How would this shit break? I see there's a countdown parameter on the celery task. We use celery as our asynchronous workers. Countdown is a parameter that lets you delay the execution of a task. What was happening was that we were delaying the pushes by a counter of the number of notifications. So if there were 1000 notifications. The 1000th one would get started on the 1000th second.
Okay. Let's try running the code manually with a countdown of 200. I checked production values... we have over 1000 users that we should be notifying on production. I wait for 200 seconds. Nothing. Hmmmmm. Could this be it? I'm hopeful. So then I look for celery documentation on countdown and expiry. We use expiry to prevent outdated notifications from being sent. And our expiry was set to 60 seconds... while with 1000 users the countdown would be 1000 seconds. Maybe? Google celery documentations. Check stack overflow. Fuck. There's no notification on if countdown > expiry if it'll work. Would it add the countdown to the expiry?
I dig into the documentation. It looks like the expiry gets eventually saved as a datetime. And the countdown... looks like it also gets saved as a datetime. Even though you can specify both as timedelta values. Hmmm. Okay. I give up on googling. Why is the documentation so horrible? Maybe I should've looked at the source code. But... fuck it. Just run some tests.
Well... I tried having a countdown of 70 and against the expiry of 60. Nothing. Okay. It fails consistently. Code is deterministic. It ain't magic. I believe that this is the issue. I make a PR with my code and tag people to review.
Two people look at my code and they're kinda confused. I kind of explained my thoughts weakly. Basically I said.. through testing I think this is the problem. So let's try this out and see what happens. They both didn't feel comfortable giving me the thumbs up to deploy it.
Time to get one of the top 2 gosu tech leads to take a look. He takes a look at the PR and confirms my beliefs. If you have countdown > expiry then the task won't run because it'll expire. He knows this because he tested it out. Sweet. He's fairly confident that my fix is correct. Even I don't believe or have total confidence that it'll fix the issue. I'm kind of operating on a hunch I know that it'll fail for counter > 70 and the counter is >70 for ~90% of the notifications.
So I deploy the fix and I just wait. Wait to see if the task will trigger and get notified. I have to wait for over 24 hours because I think it triggers quite rarely. All this time.. I'm wondering if I actually implemented the correct fix. It's because the code is part of the system that I don't have a complete understanding of. Some parts of it are somewhat magical to me. Even though I know in theory how it works. Like how my two coworkers wasn't comfortable signing off on my code. The amazing tech lead didn't see the code as magic. He believed strongly that it would fix it. More so than me. I think that when you see code as magic... it's a sign that you're unfamiliar with the technology / code (in this case it affected all 3 of us) or if you don't believe in programming. Programming isn't magic. Even though sometimes it seems like it.
I guess what I want to say is. Although I fixed the issue and believed that I'd be capable of fixing it. There was still some element of guessing and kinda taking a leap of faith. I tested the case where it fails, and it should in theory fix it. But... I didn't have the complete belief in the code. I guess that's something that I still need to work. I just got a notification earlier today so my code works. ^_^
The good thing is that the other two engineers who I asked to review my code are really good. And even they couldn't figure it out... or they weren't confident in it either. So I guess it's okay for me to be not as confident. I think that as I get better and better... it's like increasingly hard to get people to help you. Because you need people who are better than you to magic help you. Well.. I guess I should just work to reach the magic level of the rock star tech lead.
tl;dr - interesting bug. external pressure and deadline made me nervous. fixed it correctly based on testing / guessing / a hunch. I feel kinda badass