Reproduce everything!
Reproducibility is a must-have in development, research and all software systems.
Yet, we often forget about it.
In other fields, experts struggle without it and wish their battlefields were more reproducible. Natural sciences, and many things concerning human body and mind, are very difficult to reproduce. In software systems, we are in the very favorable positions were most things are deterministic. Thus, they should be fully reproducible too.
Yet, they often are not.
Here is this article's message in one sentence: make everything you do easily reproducible.
The community will benefit, because it will be easier to learn from what you've done.
And if the community benefits, you benefit too---reputation, visibility, status.
Last but not least: every now and then you will want to use something you did long time ago, and you will wish you had done it in a reproducible way. I know because it happens to me very often.
But then ... why it is not?
Why isn't the default, normal thing to do---to make everything reproducible?
I have some guesses about this.
I believe that to make a project reproducible requires some more careful planning than usual. In other words, it requires thinking before hacking.
It's way easier to jump into some easy scripting that solves the problem at hand, and then move on.
Reproducibility requires a slightly slower approach, in the sense that one must plan ahead and ask some questions. For instance:
- What things will I want to reproduce of all this?
- What parts of this project can be useful to me, or to others shortly?
- What about in 6 months from now? And in 1 year?
Sometimes the rush to "getting it done" leaves little space for serious planning.
Here is the good news: making reproducibility a habit is very easy!
It's easy because it's also fun, so after you've done it a couple of times it will just feel very natural. I know because that's what happened to me.
Also, there are great tools to make it easier, and funnier.
A recent (coding) story
This past week I was doing some Machine-Learning experiments and was searching for existing work done for text paraphrases: you have a sentence, and you want to change some words in it while keeping the same meaning.
I googled a bit and then ended up on this repository in GitHub: Vamsi995/Paraphrase-Generator
It seemed a solid work to me. Sometimes projects that use large deep network models are very difficult to run in local, and would otherwise cost a lot of money if run in cloud environment. But this one seemed easy enough, even for a newbie like me.
Then I saw that the project is mentioned in Hugging Face website, which is more or less a guarantee of a good work.
The only problem? It required to install a few things manually.
Not too many, I must say, but still I was hoping to run something in a couple of clicks, or maybe even to just use a API to experiment a bit. That wasn't the case.
Well, so I cloned the project, followed the instructions and was able to get what I needed in my laptop.
There a few twists along the way, especially because the code was meant to be used via a UI (built with Streamlit), but I wasn't very interested in that. So I looked inside the code and found the block of code that were used by the UI---basically the backend. That's what I needed.
I made a couple of changes to the code so that it would run in my local as a small json API, without UI, just for sending a sentence to it and receiving back the text paraphrase.
It worked, and it was fun!
I did my experiments, what I wanted to do before putting my fingers in that other project, using this small API running in my laptop. My experiments were just nonsense, but this is another story...
Once I was done, I stopped for a minute and thought: Could somebody else need the same thing I needed?
My go-to rule is that if something is useful for me, then there must be somebody else who needs it to.
So I wrote a Dockerfile to reproduce the entire project (its backend, actually) in just one command. The Dockerfile was just 7 lines!! Here it is:
FROM tensorflow/tensorflow
RUN apt-get update
COPY ./requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt
COPY . .
WORKDIR ./Server
CMD ["python", "server.py"]
It means that even you, without any knowledge of what the code is doing, could clone the project and have in your laptop a working API to do text paraphrase in just 2 lines:
docker build -t paraphrase .
docker run -p 5000:5000 paraphrase
Once I was done, I submitted a Pull-Request to the project mantainer. My request was approved in less than one hour!
Here is why I did this, and what I got:
- If I ever need the same project, now I can reproduce it in just 2 lines. Less than five minutes total time!
- Community contribution: that project in GitHub was already very good, and now can be used by anyone, machine-learning experts or not, in no time.
- One more pull-request accepted in my GitHub profile!
The tools
You guessed it: my favorite tool at the moment to make my projects reproducible is Docker.
I do some work as API developer, and in that case I always have handy a Dockerfile for spinning up a Flask, Django, or Go API. I have one template of a Dockerfile for each of these cases, and I usually need to change it just a little for the specific project I am working on.
Primarily, I am a systems engineer. So, not just API, but software systems on medium-large scale, that interact each with the others. When I say "software systems" I also include databases, cloud services, and several other things.
In these cases I still use Docker, though usually I end up with a docker-compose file to handle a network of containers.
If you are a data scientist, or a machine learning developer, then for sure you use Jupyter and/or RStudio.
These are great tools for reproducibility too! I often find myself cloning somebody's repository to run their jupyter lab in my local and quickly prototype something.
Maybe that was your repo!
What are your thoughts on reproducibility?