So recently, I have started to work with DVC to create reproducible pipelines. In my previous post about DVC I wrote about dealing with multiple different types of model in one repository.
In this article, would like to look at one other thing I ran into while I started working with DVC and that is the fact that at least when setting up the different stages, it requires quite a bit of
Especially if you are creating stages that depend the outputs of one or more previous stages, has one or more metrics, multiple parameters, and finally also one or more output files. In this case, you will have to either create a very long dvc run -n STAGE_NAME
command or provide the details in the dvc.yaml
file.
As you will be able to see in my DVC test project on GitHub, I am using separate python scripts in the dvc_scripts
folder. Any of the stages that I add to my dvc.yaml
file will be a separate file in this folder. This allows me to easily split the dvc related code that execute the stages from the actual implementation (which can be found in the src
folder). The actual implementations can then also be used in Jupyter notebooks when I am playing around with different things.
In order to add a new stage train_test_split
that depends on a raw input file to generate two files train.pkl
and test.pkl
based on parameters indicating the random state and the size of the test set I need to run the following command:
dvc run -n train_test_split \
-d ./dvc_scripts/train_test_split.py \
-d ./src/train_test_split/train_test_split.py \
-d ./data/raw/raw_data.pkl \
-p train_test_split.random_state,train_test_split.test_size \
-o ./data/train_test_split/train.pkl \
-o ./data/train_test_split/test.pkl \
python dvc_scripts/train_test_split.py \
--raw-input-file ./data/raw/raw_data.pkl \
--train-output-file ./data/train_test_split/train.pkl \
--test-output-file ./data/train_test_split/test.pkl
As you can see from the above, the command that I want to use for this stage is the train_test_split.py
script from the dvc_scripts folder. Therefore, I should add this script to the list of dependencies for the stage (i.e. if I make any change to this file, it should invalidate the current stage). This CLI script itself will call the main file src/train_test_split/train_test_split.py
holding the actual implementation for the train test split stage. So this file for sure also needs to be part of dependencies.
Besides that, there is the parameters for this stage, as well as the outputs that it generates
As mentioned before, this requires quite a bit of typing AND I should not forget to include any of the python scripts that the stage depends on, in which you can easily make a mistake. In order to solve this problem, I go back to what I like doing most: AUTOMATE things :)
So the python script for each stage in the dvc_scripts
folder will know everything about the input, output, and metrics files, because I provide them with commandline arguments. For example, if I run the dvc_scripts/train_test_split.py
with the --help
argument, you will see the following:
$ python dvc_scripts/train_test_split.py --help
usage: train_test_split.py [-h] --raw-input-file file --train-output-file file
--test-output-file file
optional arguments:
-h, --help show this help message and exit
--raw-input-file file
Name of the train output file
--train-output-file file
Name of the train output file
--test-output-file file
Name of the test output file
--show-dvc-add-command
After the script is done, it will show the command you
will need to use for adding stage
As you can see, I have introduced commandline arguments for each of the items that are relevant:
- The file holding the raw input
- Where to write the training subset to
- Where to write the test subset to
Generating the dvc run -n STAGE command
You also see that I have one more argument, namely the --show-dvc-add-command
. This is the automation part: By adding this argument, the python script will use all of the information provided in the commandline arguments to build up the string that you have to add as argument to the dvc run -n STAGE
command.
If we execute the dvc_scripts/train_test_split.py
as follows:
$ python dvc_scripts/train_test_split.py \
--raw-input-file ./data/raw/raw_data.pkl \
--train-output-file ./data/train_test_split/train.pkl \
--test-output-file ./data/train_test_split/test.pkl \
--show-dvc-add-command
we will get the following output:
Size train set: 712
Size test set: 179
Please copy paste the items below AFTER command to create stage:
dvc run -n STAGE_NAME \
-d ./data/raw/raw_data.pkl \
-d dvc_scripts/train_test_split.py \
-d src/train_test_split/train_test_split.py \
\
-o ./data/train_test_split/train.pkl \
-o ./data/train_test_split/test.pkl \
\
-p train_test_split \
\
\
python dvc_scripts/train_test_split.py \
--raw-input-file ./data/raw/raw_data.pkl \
--train-output-file ./data/train_test_split/train.pkl \
--test-output-file ./data/train_test_split/test.pkl
The first two lines will give the information about the split between train and test. However, after that you will get the information about the text to add after a dvc run
statement. As you can see, the first dependency on the raw_data.pkl file is taken from the commandline, but the two other ones are automatically determined during the calling of the script.
You can now copy these lines and easily generate your new stage based on the way you call thescript in the dvc_folder.
Implementation of the generator function
The implementation of the function is using the importlib
python module and adds all modules to the list of dependencies that are under the current working directory.
def get_dvc_stage_info(deps, outputs, metrics, params, all_args):
"""Build up the commandline that should be added after dvc run -n STAGE
for the provided dependencies, outputs, metrics, parameters, and arguments
provided to the script.
This function will add all python script source files to the list of dependencies
that are included during the runtime
Args:
deps: list of manual dependencies (i.e. not code) that should be added with a
-d argument to the dvc run commandline. Typically these will be the input
files that this script depends on, as the python scripts will be
automatically determined
outputs: list of all the output files that should be added with a -o argument
to the dvc run commandline
metrics: list of the metrics files that should be added with a -M argument
to the dvc run commandline
params: list of the parameters that should be added with a -p argument to the
dvc run commandline.
all_args: List of all of the commandline arguments that were given to this
dvc_script script. All of these arguments are used to build up the final
cmd provided as argument to the dvc run commandline
Returns:
String holding the complete text that should be added directly after a
dvc run -n STAGE_NAME
run
"""
python_deps = []
_modules = sorted(list(set(sys.modules)))
# For each unique module, we try to import it and try to get the
# file in which this is defined. For some of the builtins, this will
# fail, hence the except
for i in _modules:
imported_lib = importlib.import_module(i)
try:
if not os.path.relpath(imported_lib.__file__, "").startswith(".."):
python_deps.append(os.path.relpath(imported_lib.__file__, ""))
except AttributeError:
pass
except ValueError:
pass
# Now create unique sorted list and add concatenate the user based deps
# and the python module deps
python_deps = sorted(list(set(python_deps)))
all_deps = deps + python_deps
# Start building the dvc run commandline
# 1. Add all dependencies
ret = ""
for i in all_deps:
ret += f" -d {i} \\\n"
ret += f" \\\n"
# 2. Add all outputs
for i in outputs:
ret += f" -o {i} \\\n"
ret += f" \\\n"
# 3. Add all parameters
ret += f" -p {','.join(params)} \\\n \\\n"
# 4. Add all metrics
for i in metrics:
ret += f" -M {i} \\\n"
ret += f" \\\n"
# We want to create newlines at every new argument that was provided to the script
all_args = map(lambda x: f"\\\n {x}" if x[0] == "-" else x, all_args)
# Now build up the final string
ret += " python " + " ".join(all_args)
return ret
Conclusion
Although you normally would not need to write these lines very often, I really want to automate all items that I can!
Also, instead of writing the component that should be placed after the dvc run
statement, the above code could also be adapted to instead show the yaml lines that are to be copied into the dvc.yaml
file for this stage. The academic statement for this would be "Left as an exercise for the reader" ;-)