In any sport, you have practice and you have competitions. Practice helps you to improve, but what about competition? Is it just about ranking yourself against others? Or is it also a mechanism to improve?
We've done a lot of coderetreats, which are mainly about practice, but this February (2022), we did something different. We participated in a Lean Poker event, which is much more of a competition.
Everybody has a plan until they get punched in the mouth
--Mike Tyson
It turns out you can learn something from competition, but most of that learning doesn't come when you win, it comes after you lose and you reflect on what went wrong.
So what was the result? We got punched in the face (metaphorically)...
Competition Format
Lean Poker is a battle royale of poker bots. A new battle to the death commences every 30 seconds. The bots play poker until only 1 remains. The winning team gets 5 points added to their score on a global score board that determines the overall winner for the week. (The 2nd place player also gets 3 points).
Every team starts with a fully functioning bot in the language of their choice, but the only thing this bot does is fold.
Note: Lean Poker normally runs in a single day, but we decided to do it in one hour a day sessions over a 5 day week, to accommodate a remote global audience.
The Players
We had 6 teams comprised of a total of 22 players. All of them were specially invited, and the teams were balanced according to competencies, but the only two teams that matter are ours, so that's what we are going to tell you about :-)
Both of our teams chose a language where at least one team member was an expert (Ruby for one and Java for the other). Both of our teams chose to mob. Both of our teams followed most XP practices, including TDD.
... things started out the same ...
Session 1
Moments after the competition started, we felt the pressure. All of our bots were folding and losing. Most teams released their first change within 5 minutes.
After we had a bot that didn't just fold, we started to make it smarter. This involved crafting code and objects to represent our hands and help us decide on how much to bet. The bots were very basic, either betting on a good hand or folding on a bad hand.
The logic defining a good or bad hand was also rather basic. At this point our logic was pretty close, but that's where things started to diverge.
The Middle
Llewellyn's team
We decided that we needed better logic to figure out better hands. We could easily detect a pair, and 3 of a kind and 4 of a kind includes a pair so we got that for free. We thought we could differentiate ourselves by being able to detect a straight.
If you have only 5 cards ordered from low to high, a straight is easy to detect. [5, 6, 7, 8, 9] is a straight because each one is one more than the other. it's slightly harder with 7 cards [2, 5, 6, 6, 7, 8, 9], but we figured it out - eventually.
However, even though we were deploying consistently, we weren't doing better in the battles.
Then we realized that none of the games were even getting to the point of having 5 cards in play. Other bots were going all in on the initial two cards, so it was impossible to raise after the 1st round.
Everything we had written was waste.
Jacqueline's team
With a poker expert on our team, we realized the most important cards to assess were the initial 2 cards (the pocket cards). Our strategy focused on refining our bet based on these 2 cards, but we were still losing... Why?
Our algorithm seemed solid, but when we finally looked at the playbacks, we realized we were consistently being overbet. We would have a good hand and bet 80%, only to have another team bet 100%, and then we would fold.
This forced us to change our strategy to all-in or fold based on those first two cards.
When we did this, we started to improve.
The End
Llewellyn's team
We now had a bunch of code, but we weren't getting the results we wanted, and we realized we needed a better understanding of what was happening in the battles. It was late to start adding logging to our system, but it was better late than never. Sometimes our logs didn't seem to match what our code should have done, so we also started labeling each version so we could better match the production to the code.
This telemetry should have been one of the first things we did, but the other teams were already winning, and we didn't want to be left behind. Our early shortcuts ended up hurting us.
Jacqueline's team
Even in the short amount of time available we were able to write some fairly confusing code. We knew it did what we wanted, because we used TDD, but we had been skipping the refactoring step.
Everyone on the team wanted to refactor, and we didn't know why everyone else seemed to avoid it. Finally, we made time for a retrospective where we discovered that the pressure to release a better bot kept getting in the way. We decided to spend the last session cleaning the code.
Once we had finished refactoring, it became obvious and simple to improve the code. These improvements helped - we gained 30% on the competition, but by that point, it was too late.
After the Game
Neither of our teams won, but in our failures, there was a lot of learning. Some retrospectives lasted up to 6 hours after the final match. All of us agreed that we made stupid mistakes because of the pressure we were under to perform. All of us could relate to that pressure in our day-to-day jobs.
This pressure resulted in cutting corners and preferencing short term wins which created long term problems (even just 5 hours later). It also dramatically reduced the amount of experimentation and exploration.
Worst of all, we all knew better, but we still didn't act better.
Ironically, at work, it often seems that this pressure is constructed artificially in the form of deadlines to try to drive better performance.
An odd observation
One question that came up in the final retro was:
"What if 1 team was only allowed to release every 30 minutes?"
Everyone agreed this would put that team at a major disadvantage.
It's obvious in the game that this restriction would be crippling. Yet we normalize monthly deploys in the business world, and no one seems to complain.
Details
Just in case you'd like to see the details of our match