The steps to a successful machine learning project

The risk of failing a project is way higher when an organization is adopting a new and unfamiliar technology. It is still recent that the benefits of machine learning became attractive to business and many of them don’t really master the methodology that should be used in machine learning projects. Therefore, these projects have to be executed carefully. We propose you to have a look at the different steps that will help your firm to successfully manage a machine learning project. If you still have questions after you have read this article or if you are very interested in the topic, feel free to come to our free annual event Innovation Leader 2018 – Machine Learning Business Applications on April 26th.

The danger to be tech-driven

Of course, having the right data and a team with the right skills to handle a machine learning project is important. These two components will not be a guarantee of success though. The CRISP-DM (cross-industry standard process for data mining) will be of great help when it comes to structuring a machine learning project.

Process diagram showing the relationship between the different phases of CRISP-DM

Source ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf

Keeping in mind that quality issues with one of the presented steps will directly affect the quality of the entire outcome, we give you below the essential points to focus on.

Business understanding

Business understanding is very important in machine learning but is also often under-invested.
The danger is to be “tech-driven” instead of “business-driven”. To avoid getting blinded by technological possibilities, start by analyzing the business problem you need to solve instead of investing in resources and infrastructure you may not need.

You can then define the success metrics. These metrics will help you measure the success of the project and will have a deep effect on the chosen solution. It is recommended to come up with two sets of metrics. The first set will be business related (for example minimizing the amount of time which urgent emails need to wait in the queue before receiving a response). Note that these metrics can also affect the alternatives there are to machine learning. The second set should be machine learning related and will help you build and validate models properly. Defining the two sets of metrics, you should always think about the relationship between these two indicators.

As business understanding will be the key for a good start in your machine learning project, we will give you later in this article the questions you should ask yourself before investing serious efforts in the next steps.

Identification of the data sets required

You will then need to clearly identify the data sets required to solve the problem you analyzed in the first step. To do so, you will need the help of your data strategist who has the necessary expertise to understand the identified problem and will make sure your firm has the right data.

Choose the product solution and determine the right architecture

Once you identified the data sets needed, it’s time to figure out which product solution should be built. After that, with the help of your data engineer, you will determine how to best stream data into your platform depending on your available infrastructure and technology.

Work on your data

Data is the foundation for any machine learning project so you have to take good care of it.
Data formatting, data cleaning (usually 90% of the job), data anonymization (i.e. when working with healthcare and banking data), etc… will be the key to getting more precise results from an applied machine learning model. Dr. Sébastien Foucaud, former Head of Data Innovation at Naspers, recommends reducing the number of free features with feature engineering if you have people who can do it (it requires advanced domain knowledge). This will be essential to reduce the complexity of models.

Build a simple model

When comes the time to build the right model for your machine, it is often advised to keep it simple. As the first model provides the biggest boost to your product, it doesn’t need to be fancy. Moreover, simple models are usually easier to debug, easier to explain and easier to implement. So chose a simple one that suits your business needs sufficiently rather than a fancy one with a very high level of accuracy.

Build the product

The key element at this stage is the interaction between the members of your team (product manager, software engineers, data scientists). Dr Sébastien Foucaud advises opting for lean development and a small, minimum viable product (MVP) so that you can iterate quickly.

Test and adjustments

Setting a hypothesis about the impact your product will have and measuring success by collecting more data will be the best way to proceed when you launch a product.

Questions that can help you make a good start

We saw that the first step presented in the process diagram showing the relationship between the different phases of CRISP-DM, business understanding, is often under-invested in machine learning projects. Shahar Cohen, co-founder of YelowRoad, also came to this conclusion advising many organizations on machine learning and running such projects within his own firm.

To avoid losing time and money, Shahar Cohen and his team developed a list of questions they use in any machine learning projects. They will not invest serious efforts in the following steps until they have good answers to these questions:

  • What are we trying to achieve, business wise? Why is it important?
  • What are the inputs and outputs for the task that we are trying to solve?
  • Given a hypothetical solution to that task, how would it affect our operations? (another way to ask this question: assuming that I have a perfect solution to your machine learning task, how will you use it?)
  • Do we already have the ability to act based on such solution, or do we also need to develop that ability? (if the ability is there, learn it carefully. If not, keep close contact with the team that is responsible for developing it)
  • How are we going to measure a suggested solution? (KPIs)
  • What would make it a success?
  • Do we have the input data available? How hard it is to extract it? Are we allowed to use it?
  • Are we experienced in building similar solutions? Do we understand what it takes?
  • Do we have hard budget and timelines constraints?

Who will develop the solution? Do we have the required skills in-house?

Conclusion