Rewriting a framework!

My experience as being architect of a custom ML framework and its rewrite/refactor journey in 2 years.

Have you ever decided to reorganise your house/place? Writing a framework, or big project is somewhat like decorating/organizing your house.

Confused? Ok, I will try to explain my analogy...

When you buy or rent a house, you have a clear (most of the time) idea about how it should look like, where to put what etc. Once you start living there you realize your initial idea about your house did not consider some specific scenarios like space for storing winter jackets/clothes/shoes and all. So now you accommodate those things in ad-hoc cupboards or adjust in the existing place (which makes everything messy) but you move on with life. In a couple of years, you look back and realize that some of those ad-hoc things should be properly stored, maybe a 6-seater dining table is not needed. Do you see my point?

I have been writing software for 15+ years and in multiple languages. I have noticed this trend that no matter how much time you spend on writing nice & thoughtful software in 2-3 years times so many ad-hoc things (read, features are needed as a side effect of existing features) are added the code looks out-of-shape, some APIs are not in sync (for a valid reason), you have some redundant code that nobody is using it but you are afraid to remove it as it's a nice feature (assuming you have a test-suite to avoid issues like "I don't know if this is being used?"). Do you see a messy house and you are partially (or fully) responsible for this?

In Mar'22 I changed my team to write an ML framework for Quants. I joined the team with no prior experience in building ML framework, however, I built 3-4 deployment tools (like Ansible) and just before joining this team I worked on building a QA Testing framework. To prepare for the new role I started reading about existing ML libraries like scikit-learn, TF and existing tools the team is using to train their models. I am not going to lie but this took some time to understand what the ML framework should look like.

Fast forward to Mar'23, By this time I had most of the framework ready which can build 70% of existing use cases and also satisfy the major requirement of its cloud-native framework, modular workflow unfortunately with one major blocker! Even though my workflow was flexible as required, this flexibility was achieved in such a way that if a new data source or transformation was to be added then you had to create a new flow file with additional steps. You could use inheritance to avoid duplicating stuff but this ended up with many (30+) flow files with unwanted hierarchy. I wasn't expecting this many files within 3-4 months of usage. Something was wrong and it was my understanding of their use case! It was time to re-organize the house...

So, after talking to stakeholders and looking at other ML frameworks in the company we all agreed to come up with a graph-based schema which will describe the steps to take and their order. At this time, One of the team is already migrating their existing models to my newly built (but not scalable in terms of model complexity) framework, at the same time business is pushing to finish this framework ASAP and I am thinking of re-writing it. Given this situation, I decided to fit a new graph-based schema into existing code so it became backwards compatible. I was not particularly happy about this decision but losing backwards compatibility within the first 6 months of release was not worth it.

Fast forward to Dec'23, I have a code base which has all the features required to migrate all existing models without compromising flexibility but this time some new feature pops up, we need a simulator! The simulator was planned since the beginning but it was way down the line, but somehow this got bumped up and guess what the problem was? If you thought it was a new graph schema, you were right. So, where did things go wrong this time around?

Back in April, when I was working on graph schema a couple of design decision drivers were

  1. Steps and order should be defined graph

  2. Graphs will be created with a low-level API so it should be easy to do it and should be compact as there are thousands of models and all these configs will be in DB. So having a small size would be beneficial.

I was wrong on point #2. As I started releasing beta changes to Quants they started writing scripts with high-level API that was so abstract that to the end user, it was just 2-3 lines of calls. These wrapper scripts became part of the framework with nicer API to make model graphs. Still no regrets about choosing a compact graph but after frequent use, I realized that to make it compact I made it more complex than necessary. This was partially due to maintaining the backwards compatibility and time pressure.

Fast forward to Feb'24, approximately 2 years since I started with the framework. As I mentioned earlier, I never thought about traversing graphs, the only reason this came up was when we started working on a Simulator. It's like designing a house for 2 people and then you have a kid and suddenly the same house is not kid-proof.

This is the 2nd time I am re-writing or refactoring the code base but this time around I decided that no backwards compatibility and simpler graph schema. Also in 2 years, all use cases (different kinds of models) were built, so I could make better decisions this time around. The team also supported me in this as this refactor was almost like re-writing from scratch.

Like a house, no matter how thoughtful you are about the layout or furniture to place in the house. In 3-4 years, you grow or completely new things come up and you need to change things around. I am sure this time around I did a better job and I hope that the code base is stable enough for the next 2-3 years of use cases.

This is my opinion after looking at how languages evolved (py2 to 3), async support in languages, and popular libraries having major rewrites. I could be wrong but if you know a popular codebase which has not changed since its inception and is still being actively developed, please let me know, I am sure there will be something to learn from it.