'Automating' DVC stage generation

So recently, I have started to work with DVC to create reproducible pipelines. In my previous post about DVC I wrote about dealing with multiple different types of model in one repository.

In this article, would like to look at one other thing I ran into while I started working with DVC and that is the fact that at least when setting up the different stages, it requires quite a bit of

Especially if you are creating stages that depend the outputs of one or more previous stages, has one or more metrics, multiple parameters, and finally also one or more output files. In this case, you will have to either create a very long dvc run -n STAGE_NAME command or provide the details in the dvc.yaml file.

As you will be able to see in my DVC test project on GitHub, I am using separate python scripts in the dvc_scripts folder. Any of the stages that I add to my dvc.yaml file will be a separate file in this folder. This allows me to easily split the dvc related code that execute the stages from the actual implementation (which can be found in the src folder). The actual implementations can then also be used in Jupyter notebooks when I am playing around with different things.

In order to add a new stage train_test_split that depends on a raw input file to generate two files train.pkl and test.pkl based on parameters indicating the random state and the size of the test set I need to run the following command:

dvc run -n train_test_split \
  -d ./dvc_scripts/train_test_split.py \
  -d ./src/train_test_split/train_test_split.py  \
  -d ./data/raw/raw_data.pkl  \
  -p train_test_split.random_state,train_test_split.test_size \
  -o ./data/train_test_split/train.pkl \
  -o ./data/train_test_split/test.pkl \
  python dvc_scripts/train_test_split.py  \
    --raw-input-file ./data/raw/raw_data.pkl  \
    --train-output-file ./data/train_test_split/train.pkl \
    --test-output-file ./data/train_test_split/test.pkl

As you can see from the above, the command that I want to use for this stage is the train_test_split.py script from the dvc_scripts folder. Therefore, I should add this script to the list of dependencies for the stage (i.e. if I make any change to this file, it should invalidate the current stage). This CLI script itself will call the main file src/train_test_split/train_test_split.py holding the actual implementation for the train test split stage. So this file for sure also needs to be part of dependencies.

Besides that, there is the parameters for this stage, as well as the outputs that it generates

As mentioned before, this requires quite a bit of typing AND I should not forget to include any of the python scripts that the stage depends on, in which you can easily make a mistake. In order to solve this problem, I go back to what I like doing most: AUTOMATE things :)

So the python script for each stage in the dvc_scripts folder will know everything about the input, output, and metrics files, because I provide them with commandline arguments. For example, if I run the dvc_scripts/train_test_split.py with the --help argument, you will see the following:

$ python dvc_scripts/train_test_split.py  --help 

usage: train_test_split.py [-h] --raw-input-file file --train-output-file file
                           --test-output-file file

optional arguments:
  -h, --help            show this help message and exit
  --raw-input-file file
                        Name of the train output file
  --train-output-file file
                        Name of the train output file
  --test-output-file file
                        Name of the test output file
  --show-dvc-add-command
                        After the script is done, it will show the command you
                        will need to use for adding stage

As you can see, I have introduced commandline arguments for each of the items that are relevant:

The file holding the raw input
Where to write the training subset to
Where to write the test subset to

Generating the dvc run -n STAGE command

You also see that I have one more argument, namely the --show-dvc-add-command. This is the automation part: By adding this argument, the python script will use all of the information provided in the commandline arguments to build up the string that you have to add as argument to the dvc run -n STAGE command.

If we execute the dvc_scripts/train_test_split.py as follows:

$ python dvc_scripts/train_test_split.py  \
    --raw-input-file ./data/raw/raw_data.pkl  \
    --train-output-file ./data/train_test_split/train.pkl \
    --test-output-file ./data/train_test_split/test.pkl \
    --show-dvc-add-command

we will get the following output:

Size train set: 712
Size test set: 179


Please copy paste the items below AFTER command to create stage:
    dvc run -n STAGE_NAME \

  -d ./data/raw/raw_data.pkl \
  -d dvc_scripts/train_test_split.py \
  -d src/train_test_split/train_test_split.py \
  \
  -o ./data/train_test_split/train.pkl \
  -o ./data/train_test_split/test.pkl \
  \
  -p train_test_split \
  \
  \
  python dvc_scripts/train_test_split.py \
    --raw-input-file ./data/raw/raw_data.pkl \
    --train-output-file ./data/train_test_split/train.pkl \
    --test-output-file ./data/train_test_split/test.pkl

The first two lines will give the information about the split between train and test. However, after that you will get the information about the text to add after a dvc run statement. As you can see, the first dependency on the raw_data.pkl file is taken from the commandline, but the two other ones are automatically determined during the calling of the script.

You can now copy these lines and easily generate your new stage based on the way you call thescript in the dvc_folder.

Implementation of the generator function

The implementation of the function is using the importlib python module and adds all modules to the list of dependencies that are under the current working directory.

def get_dvc_stage_info(deps, outputs, metrics, params, all_args):
    """Build up the commandline that should be added after dvc run -n STAGE
    for the provided dependencies, outputs, metrics, parameters, and arguments
    provided to the script.

    This function will add all python script source files to the list of dependencies
    that are included during the runtime

    Args:
        deps: list of manual dependencies (i.e. not code) that should be added with a
            -d argument to the dvc run commandline. Typically these will be the input
            files that this script depends on, as the python scripts will be
            automatically determined
        outputs: list of all the output files that should be added with a -o argument
            to the dvc run commandline
        metrics: list of the metrics files that should be added with a -M argument
            to the dvc run commandline
        params: list of the parameters that should be added with a -p argument to the
            dvc run commandline.
        all_args: List of all of the commandline arguments that were given to this
            dvc_script script. All of these arguments are used to build up the final
            cmd provided as argument to the dvc run commandline

    Returns:
        String holding the complete text that should be added directly after a
        dvc run -n STAGE_NAME
        run
    """

    python_deps = []
    _modules = sorted(list(set(sys.modules)))

    # For each unique module, we try to import it and try to get the
    # file in which this is defined. For some of the builtins, this will 
    # fail, hence the except 
    for i in _modules:
        imported_lib = importlib.import_module(i)
        try:
            if not os.path.relpath(imported_lib.__file__, "").startswith(".."):
                python_deps.append(os.path.relpath(imported_lib.__file__, ""))
        except AttributeError:
            pass
        except ValueError:
            pass

    # Now create unique sorted list and add concatenate the user based deps
    # and the python module deps
    python_deps = sorted(list(set(python_deps)))
    all_deps = deps + python_deps

    # Start building the dvc run commandline
    # 1. Add all dependencies
    ret = ""
    for i in all_deps:
        ret += f"  -d {i} \\\n"
    ret += f"  \\\n"

    # 2. Add all outputs
    for i in outputs:
        ret += f"  -o {i} \\\n"
    ret += f"  \\\n"

    # 3. Add all parameters
    ret += f"  -p {','.join(params)} \\\n  \\\n"

    # 4. Add all metrics
    for i in metrics:
        ret += f"  -M {i} \\\n"
    ret += f"  \\\n"

    # We want to create newlines at every new argument that was provided to the script
    all_args = map(lambda x: f"\\\n    {x}" if x[0] == "-" else x, all_args)

    # Now build up the final string
    ret += "  python " + " ".join(all_args) 
    return ret

Conclusion

Although you normally would not need to write these lines very often, I really want to automate all items that I can!

Also, instead of writing the component that should be placed after the dvc run statement, the above code could also be adapted to instead show the yaml lines that are to be copied into the dvc.yaml file for this stage. The academic statement for this would be "Left as an exercise for the reader" ;-)

'Automating' DVC stage generation

Guido Diepen

Guido Diepen

Generating the dvc run -n STAGE command

Implementation of the generator function

Conclusion

Multiple different models with DVC

Problems with anaconda and pip

Multiple different models with DVC