What are some key ways to automate and optimize data science processes?

Q:

What are some key ways to automate and optimize data science processes?

A:

Data science processes in the context of machine learning and AI can be divided into four distinct phases:

  1. data acquisition and exploration,
  2. model building,
  3. model deployment and
  4. online evaluation and refinement.

From my experience, the most impeding phases are the data acquisition and model deployment phases in any machine-learning-based data science process, and here are two ways to optimize them:

1. Establish a highly accessible datastore.

In most organizations, data is not stored in one central location. Let’s just take information related to customers. You have customer contact information, customer support emails, customer feedback and customer browsing history if your business is a web application. All this data are naturally scattered, as they serve different purposes. They may reside in different databases and some may be fully structured and some unstructured, and may even be stored as plain text files.

Unfortunately, the scatteredness of these datasets is highly limiting to data science work as the basis of all NLP, machine learning and AI problems is data. So, having all this data in one place – the datastore – is paramount in accelerating model development and deployment. Given that this is a crucial piece to all data science processes, organizations should hire qualified data engineers to help them build their datastores. This can easily start off as simple data dumps into one location and slowly grow into a well-thought-out data repository, fully documented and queriable with utility tools to export subsets of data into different formats for different purposes.

2. Expose your models as a service for seamless integration.

In addition to enabling access to data, it’s also important to be able to integrate the models developed by data scientists into the product. It can be extremely difficult to integrate models developed in Python with a web application that runs on Ruby. In addition, the models may have a lot of data dependencies that your product may not be able to provide.

One way to deal with this is to set up a strong infrastructure around your model and expose just enough functionality needed by your product in order to use the model as a “web service.” For example, if your application needs sentiment classification on product reviews, all it should need to do is invoke the web service, providing the relevant text and the service would give back the appropriate sentiment classification which the product can directly use. This way the integration is simply in the form of an API call. Decoupling the model and the product that uses it makes it really easy for new products that you come up with to also use these models with little hassle.

Now, setting up the infrastructure around your model is a whole other story and requires a heavy initial investment from your engineering teams. Once the infrastructure is there, it’s just a matter of building models in a way that fits into the infrastructure.

Have a question? Ask us here.

View all questions from Kavita Ganesan.

Share this:
Written by Kavita Ganesan
Profile Picture of Kavita Ganesan

Kavita Ganesan is currently a Senior Data Scientist at Github. She has spent over a decade building scalable Natural Language Processing, Machine Learning and Search systems. She is highly passionate about helping companies leverage their unstructured data to improve their products and services with AI driven models and insights. She also actively writes articles related to Natural Language Processing and Text Mining and speaks at different industry conferences. Kavita holds a Ph.D in Text Mining, Analytics and Search from the University of Illinois at Urbana Champaign.

 Full Bio