Notebook to train a model¶

We will train a model on kaggle dataset to detect credit card fraud. This model will be uploaded as a versioned model artifact in order to be used in a production context.

Creation of a pex with python dependencies¶

Some basic python modules are already available in Punch images (ex: pandas). If you need some specific dependencies you need to generate a pex. This same pex will be used in development and in production in order to limit version changes between these two environments.

Here we only need sklearn module but you can give a list of modules if needed (see punch_pex).

In [1]:
%punch_pex -l scikit-learn mlflow --group demo --artifact dependencies -v 1.0.0 -o
  adding: dependencies-1.0.0.pex (deflated 2%)
  adding: metadata.yml (deflated 26%)
++ java -Xmx1g -Xms256m -Dlog4j.configurationFile=/punch/conf/log4j2/log4j2-stdout.xml -cp /punch/resourcectl.jar com.github.punchplatform.resourcectl.ResourceCtl -u http://artifacts-server.punch-artifacts:4245 upload -f /punch/punch_pex/dependencies-1.0.0.zip -o
Resource uploaded : additional-pex:demo:dependencies:1.0.0

Adding dependencies to the environment¶

This notebook only needs the pex created above. Thus we load it via the punch_dependencies magic cell.

In [ ]:
%%punch_dependencies
additional-pex:demo:dependencies:1.0.0
++ java -Xmx1g -Xms256m -Dlog4j.configurationFile=/punch/conf/log4j2/log4j2-stdout.xml -cp /punch/resourcectl.jar com.github.punchplatform.resourcectl.ResourceCtl -u http://artifacts-server.punch-artifacts:4245 download -r additional-pex:demo:dependencies:1.0.0 -o /usr/share/punch/extlib/python
Resource additional-pex:demo:dependencies:1.0.0 downloaded to /usr/share/punch/extlib/python/dependencies-1.0.0.pex

Importing modules¶

In [1]:
from sklearn import tree
from sklearn.metrics import accuracy_score
import mlflow

Reading data from s3¶

Punch provides magic cells to read data from different sources. If your Jupypunch was deployed with preconfigured databases then you do not need to rewrite your login credentials.

Here, the train data set is download from a minio bucket named "demo". We read the file and store the data in a variable called "train". The testing dataset is loaded in the test variable. (punch_source)

In [2]:
%%punch_source --type s3 --name train -o 
bucket: demo
prefix: train/train.csv
Data is available in train variable.
Execution time: 0:00:00.418062
In [3]:
%%punch_source --type s3 --name test -o 
bucket: demo
prefix: test/test.csv
Data is available in test variable.
Execution time: 0:00:00.165612

Removing unused columns¶

Punch source node adds some variables like _ppf_path and _ppf_last_modified which are useful in some contexts but unnecessary for our example.

In [4]:
train = train[['distance_from_home', 'distance_from_last_transaction',
       'ratio_to_median_purchase_price', 'repeat_retailer', 'used_chip',
       'used_pin_number', 'online_order', 'fraud']]
train.head(2)
Out[4]:
distance_from_home distance_from_last_transaction ratio_to_median_purchase_price repeat_retailer used_chip used_pin_number online_order fraud
0 4.805367 1.379477 1.236960 1.0 0.0 0.0 0.0 0.0
1 27.052054 1.766070 0.415689 1.0 0.0 0.0 0.0 0.0
In [5]:
test = test[['distance_from_home', 'distance_from_last_transaction',
       'ratio_to_median_purchase_price', 'repeat_retailer', 'used_chip',
       'used_pin_number', 'online_order', 'fraud']]
test.head(2)
Out[5]:
distance_from_home distance_from_last_transaction ratio_to_median_purchase_price repeat_retailer used_chip used_pin_number online_order fraud
0 11.188842 0.067784 1.659848 1.0 0.0 0.0 1.0 0.0
1 8.359728 0.186258 0.495259 1.0 1.0 0.0 0.0 0.0

Training the model¶

We train a decision tree classifier on train data.

In [6]:
model = tree.DecisionTreeClassifier()
model = model.fit(train.drop("fraud", axis=1).values, train["fraud"].values)

Testing the model¶

We test the model on test data.

In [7]:
prediction = model.predict(test.drop("fraud", axis=1))
accuracy_score(test["fraud"], prediction)
/root/.pex/installed_wheels/a4f411ae99491abcca22021235750d2a9cc0bfbab39d5aa7e62f9861f905f58a/scikit_learn-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl/sklearn/base.py:402: UserWarning: X has feature names, but DecisionTreeClassifier was fitted without feature names
  warnings.warn(
Out[7]:
0.9999866666666667

Saving the model and uploading it as an artifact¶

Once satisfied with model results, we can upload the model in a desired packaging format (here mlflow) via a lambda function.

In [8]:
%%punch_upload_model -g demo -n credit_card -v 1.0.0 -o
lambda path: mlflow.sklearn.save_model(model, path)
2022/12/14 10:59:39 WARNING mlflow.utils.environment: Encountered an unexpected error while inferring pip requirements (model URI: /tmp/punch_upload_model/demo/credit_card/1.0.0/model.pkl, flavor: sklearn), fall back to return ['scikit-learn==1.2.0', 'cloudpickle==2.2.0']. Set logging level to DEBUG to see the full traceback.
/root/.pex/installed_wheels/57f6f22bde4e042978bcd50176fdb381d7c21a9efa4041202288d3737a0c6a54/setuptools-65.6.3-py3-none-any.whl/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
2022/12/14 10:59:39 WARNING mlflow.utils.environment: Failed to resolve installed pip version. ``pip`` will be added to conda.yaml environment spec without a version specifier.
++ java -Xmx1g -Xms256m -Dlog4j.configurationFile=/punch/conf/log4j2/log4j2-stdout.xml -cp /punch/resourcectl.jar com.github.punchplatform.resourcectl.ResourceCtl -u http://artifacts-server.punch-artifacts:4245 upload -f /tmp/punch_upload_model/demo/credit_card/1.0.0/artifact_credit_card_1.0.0.zip -o
Resource uploaded : model:demo:credit_card:1.0.0
In [ ]: