Towards Automatic Jupyter Notebook to Structured Code Repository Conversion

currently reserved

Motivation

For a given ML problem, a group of researchers experiment on potential solutions. To do this, their most preferred tool is Jupyter Notebook, which allows for fast & interactive prototyping with text and executable code blocks plus code outputs and data visualizations in one document.

However, a notebook cannot be used easily in production. After a notebook is finished, the code must be manually transferred to a structured repository and thereby refactored based on the activities, like preprocessing, training, and predicting. This procedure requires not only effort but is also prone to human errors. Developers also must guess which version of dependent python packages were used.

Goal

The goal of this thesis is to create a tool which converts a Jupyter notebook to a structured code repository (using data science repository templates, such as Cookiecutter). Hereby, the code must be refactored based on phases, artifacts and visualizations.

 

Tasks

  • Developing a strategy to assign code statements to phases, e.g., by designing clustering or pattern matching techniques
  • Designing required refactoring strategies
  • Implementing an automatic Jupyter notebook to repository (e.g. cookiecutter) converter
  • Prototype evaluation with provided real-world Jupyter notebooks

 

Required Knowledge

  • Python (intermediate)
  • Data Science basics

 

Literature

Lanubile, F., Calefato, F., Quaranta, L., Amoruso, M., Fumarola, F., & Filannino, M. (2021). Towards Productizing AI/ML Models: An Industry Perspective from Data Scientists. 129–132.

Venkatesh, A. P. S., & Bodden, E. (2021). Automated cell header generator for Jupyter notebooks. AISTA 2021 – Proceedings of the 1st ACM International Workshop on AI and Software Testing/Analysis, Co-Located with ECOOP/ISSTA 2021, 17–20.

  • Type: Master Thesis
  • Status: Open

Supervisor