Towards Jupyter Notebook to Structured Code Repository Conversion

Jupyter notebook is a popular computational notebook for doing exploratory analysis because of its native support of keeping the code and the result side by side along with its rich documentation feature. This interactive nature gives data scientists the flexibility that is suitable for an exploratory study while at the same time the notebook becomes messy with time since no standard guidelines are being followed during the analysis. This becomes an issue when a notebook becomes stable and it’s then needed to be converted to a structured repository for developing a software component from it. The objective of this study is to identify the common messes that are encountered during analysis in a Jupyter notebook and to identify the best practices in different phases of an exploratory analysis so that the gap between a Jupyter notebook and a structured repository can be bridged by applying best practices and avoiding the common messes.

Through a systematic literature review, this study identified a set of common messes and best practices in multiple phases of exploratory analysis and proposed a conceptual architecture of an assistive tool to generate a repository from a stable notebook along with its possible limitations, pros, and cons. Data scientists encounter some common issues during their work inside a Jupyter notebook which is mainly because of their practice while coding and the limitation of the notebook’s kernel. Reproducibility is the cornerstone that needs to be preserved during the conversion of a notebook to a repository. A set of guidelines need to be regulated during analysis to preserve reproducibility and to avoid some common messes while working in a Jupyter notebook. Further studies are required for the implementation and evaluation of the proposed assistive tool.

Project information

Status:

Finished

Thesis for degree:

Master

Student:

Mehtanin Rashikh

Supervisor:
Part of research project:

SE4ML - Processes, People and Tools

Id:

2022-010