How Can Data Scientists Write Production-quality Machine Learning Code?

Data science is a field that encapsulates a range of skills. However, in the data scientist’s toolbox, coding might become an underrated tool. In order to derive optimal results, a data scientist may choose to focus on a mathematical or conceptual component of their data science task. This strategy might reap greater benefits for tasks that are more focused on testing models or research. However, it may be entirely different in commercial production environments.

When it comes to one-off analysis, code quality may not be very important.  However, when developing analytics applications that will be used by actual users and mission-critical systems, code quality is critical and even the slightest increase in code quality can boost productivity and profits.

A study on the business impact of code quality reported that the Time-in-Development for Alert level 1 codebases is 124% more than that of Healthy level code 2. Furthermore, technical debt in machine learning piles extremely fast that bad quality code can set even the most experienced teams back for half a year. In a survey of C-level executives conducted by Stripe, it was reported that bad code costs companies $85 billion annually. 3

In this article, we discuss the value of high-quality machine learning model code in the data science process, and how organisations can enable data scientists to develop production-quality model code.

We start by understanding what makes code high quality.

Four key features of high-quality code

We believe the following four features are of utmost importance to streamline the transition of data science research into production:

1) Reliability: When a model is developed to carry out a specific machine learning task, it is vital that the written code does what it is supposed to do, without any major failures. Having unreliable code seriously undermines the validity and efficiency of the model and the predictions it makes. Fixing faulty code can take up a significant component of the development timeline, that’s why sometimes it’s more beneficial to spend some time on designing or architecting the coding process than delving right into actual coding

2) Availability of documentation: In an industry environment, multiple stakeholders are involved in taking the model from initiation to deployment. A high-quality code includes documentation that allows members of different teams to read and understand what a data scientist is aiming to achieve with a given codebase. In the event of a model failure, documentation makes the bug-fixing process easier.

3) Clarity and consistency: Model code that is clear and consistent enables easy reading and comprehension of a model. Having code that follows a clear and consistent style prevents development teams from investing large amounts of time, resources, and energy into understanding and implementing models.

4) Versioning data and code: As the product goes through the phases of integrations, sometimes it’s hard to keep track of the release-candidate models if no proper source-code and data versioning system is in place. Research shows that companies that adopt some sort of versioning system are able to reduce the release timeline from once every three months to hundreds of times per day.

Four pain points of low-quality model production code

1) Increase in development time and resource demands: One of the biggest pain points in writing model code that does not meet quality standards is the time and resource investments that will have to go into refining a codebase. An average developer may spend approximately around 60% of their time on programme comprehension activities. 4 This emphasises the number of developer hours that may have to be invested to remedy issues arising from non-standard code, ultimately extending research-to-production timelines of model-building processes.

2) Difficulty in reproducing models: Producing non-standard code for short-term testing and research purpose may not have long-term repercussions. However, in the event data scientists are required to repeat a research task, the lack of high-quality code may make it difficult to understand previously completed work and to reproduce model code.

3) Models failing in non-ideal scenarios: Often, testing models in research environments cannot entirely expose code vulnerabilities. However, when in the hands of end-users, models may fail on unexpected data while running in different environments or they may take longer to train or make predictions. Producing high-quality code makes the space to insulate models from potential failure upon exposure to non-ideal scenarios.

4) Failure to scale:  Even when writing a research code or a code for PoC, a data scientist should think about the potential of scaling the code to multiple users and environments. It pays off to spend some initial time thinking of a code and model design that would be easy to put into production and scale.

How do we solve the data science vs. coding dilemma?

At the onset of this article, we mentioned how data scientists may choose to invest their time and effort in the mathematics and data science components of the machine learning lifecycle, rather than in developing high-quality code.

What would be ideal is to enable data scientists to focus on the data science part of model building, while finding a way to write production-quality model code without jeopardising developer time or resources.

evoML is an AI optimisation platform developed by TurinTech which embeds our proprietary research in code optimisation as well as state-of-the-art AI research. It enables businesses to build efficient machine learning models at speed by automating the entire model building process. A key feature of evoML is the ability to provide production-quality code that users can download and embed in their own software systems. Data scientists can use evoML to build, optimise and evaluate machine learning models. The model code can then be downloaded, customised and used by software engineers in production. This is a solution to the dilemma of investing effort in data science tasks vs. investing time to write high-quality model code.

Using evoML to build machine learning models with high-quality model code reduces the countless back and forth between data scientists and software engineers on guided development, accelerating the organisation’s AI/ML projects’ time-to-market. It not only enables data scientists to focus more on their scientific tasks but also saves software engineers’ time in deploying the machine learning models into the end product.

Model code generated on evoML is easily customisable. Everyone knows that a messy code makes customisation difficult, but with evoML, engineers can spend less time comprehending code, making it easier for them to modify code to fit preferred tasks. Particularly for projects that are sensitive in nature, evoML enables organisations to have full ownership of the model code to easily manipulate codebases without having to resort to third party input.

About the Author

Malithi Alahapperuma ​| TurinTech Technical Writer

Researcher, writer and teacher. Curious about the things that happen at the intersection of technology and the humanities. Enjoys reading, cooking, and exploring new cities.

1 Code health category denoting code with low quality

2 https://arxiv.org/pdf/2203.04374.pdf

3 https://stripe.com/files/reports/the-developer-coefficient.pdf

4 X. Xia, L. Bao, D. Lo, Z. Xing, A. E. Hassan and S. Li, “Measuring Program Comprehension: A Large-Scale Field Study with Professionals,” in IEEE Transactions on Software Engineering, vol. 44, no. 10, pp. 951-976, 1 Oct. 2018, doi: 10.1109/TSE.2017.2734091.

Unlock the Full Potential of Your Code with GenAI.

Contact Us

© 2024 · TurinTech AI. All rights reserved.

This is a staging enviroment