During the last decade, data science projects in the enterprise have developed a reputation for being complex and expensive. However, the last few years have seen an explosion in new machine learning and big data infrastructure technologies that have helped lower the entry point for implementing data science solutions in the enterprise. Despite the technical evolution, enterprise data science projects remain relatively complex compared to traditional areas of investment in enterprise IT.
Similar to other groundbreaking technologies in enterprise IT, implementing successful data science solutions is a combination of strong processes, delivery methodologies and technologies. Our experience implementing dozens of successful enterprise data science and machine learning solutions have allowed us to develop certain perspective about patterns we think help to optimize the success of data science projects in the enterprise. The following list provides a small summary of best practices in enterprise data science projects. Some of them might seem trivial but they can be difficult to enforce in real world implementations.
Build For the Future: Build on Technologies You Can Innovate Upon
Data science platforms is one of the fastest growing areas in the technology ecosystem. As a result, new platforms, machine learning algorithms, data visualization technologies, etc are constantly surging bringing new value propositions to enterprise solutions. Additionally, the requirements for enterprise data science solutions are constantly changing based on new market trends.
Building on a technology stack that facilitates innovation, extensibility and scalability is essential to guarantee the success of enterprise data science projects. In that sense, when selecting a data science platform, organizations should not only evaluate its technical capabilities but also complementary factors such as developer community, open source contributions, talent availability etc.
No Model is Right: Implement Various Models for the Same Scenario
One of the most common mistakes in machine learning projects is deciding on a specific prediction or classification algorithm before implementing the solution. Many times, the optimal algorithm is not discovered until several models are tested and evaluated with the real data. In that sense, is a good practice to implement the first iteration of the solution running several machine learning algorithms concurrently and compare the results over time.
Continuous Data Science: Deliver Results Every Week and the First MVP in a Month
Enterprise data science projects are notorious for taking a long time and being extremely expensive. Also, is not uncommon that stakeholders need to wait months before seeing the first results of a data science solution which, more often than not, need to be improved. To mitigate some of those challenges, we always recommend structuring projects in a way that deliver weekly results to stakeholders.
In addition to deliver weekly results, we always recommend to focus on delivering a minimum viable product (MVP) within the first month of starting the project. Sometimes, this model requires cutting a few corners on the infrastructure side on the early days but it guarantees the constant feedback from the ultimate users which will help to continuously improve the data science solution.
Test Test Test: Make the Models Testable
Complementing the previous point, it is very important to provide mechanisms to continuously test and validate machine learning algorithms even if the solution is running in production. Building testing models is an often overlooked aspect of enterprise data science projects but one that becomes critical to guarantee the evolution of the solution.
Monitor Everything: Implement Operational Monitoring in Your Data Science Solutions
Monitoring the execution of machine learning models, data inputs and outputs, model failures etc becomes essential for the production readiness of an enterprise data science project. In that sense, IT organizations should considering implementing the correct operational monitoring and instrumentation infrastructure as part of any data science project. While conceptually obvious, incorporating these capabilities in a data science solution is far from trivial as most operational monitoring platforms are still not integrated with machine learning and data science stacks.
Start Small, Fail Fast and Iterate
Machine learning and data science solutions are new initiatives for most enterprises and one that requires new skillsets and practices. In that sense, it is important to approach these projects in a highly iterative manner and allocating room for initial failures. While the limitations of legacy data science technology stacks prevented organizations from applying agile and lean development practices to data science projects, this is no longer the case. Today most of the modern data science and machine learning stacks provide enough capabilities that allow organizations to start delivering results extremely fast with a minimum investment.