At first glance, the deployment of data science seems trivial: just run it on the production server! Closer examination reveals that what was built during data science creation is not what is being put into production.
Think of a chef designing recipes in their experimental kitchen. Similar to the data scientist experimenting in the lab with different data sources and testing and optimizing parameters, the path to the perfect recipe involves the chef trying out new ingredients and optimizing quantities and cooking times.
It is the final result that gets moved into production: the scientist’s best model or the chef’s recipe.
This is where the gap is usually biggest. Why?
Ask yourself, for example, if you can use the same set of tools for both your data science creation and deployment setup; maybe one of the two setups covers only a subset of the other.
Most tools allow only a subset of possible models to be exported and even ignore certain preprocessing steps completely. Can you deploy automatically into a service (e.g., REST) or scheduled job, or is the deployment only a library/model that needs to be embedded elsewhere?
All too often, what is exported is not ready to use but needs to be adjusted manually. For the chef, this is not a huge issue, as the recipe book is updated infrequently, and the chef can spend a day translating the results of the experimentation into a recipe that works in a typical kitchen at home.
For the data science team, this is a much bigger problem. The team needs to be able to update models, deploy new tools, and use new data sources much more frequently, which could easily be on a daily or even hourly basis.
Adding manual steps slows this process to a crawl and enables errors to creep in. Large organizations can’t afford for this to happen, and small to medium sized businesses even less so.
What kind of strategies can close the gap?
An approach of “integrated deployment” helps by bringing the deployment process into the data science cycle. The data scientist can model both creation and production within the same environment by capturing the parts of the process that are needed for deployment.
In the event of model change, the necessary changes can be made, and the revised data science process can be deployed in less than a minute — instantaneous deployment from the exact same environment that was used to create the data science process.