We will train a model on kaggle dataset to detect credit card fraud. This model will be uploaded as a versioned model artifact in order to be used in a production context.
Some basic python modules are already available in Punch images (ex: pandas). If you need some specific dependencies you need to generate a pex. This same pex will be used in development and in production in order to limit version changes between these two environments.
Here we only need sklearn module but you can give a list of modules if needed (see punch_pex).
%punch_pex -l scikit-learn mlflow --group demo --artifact dependencies -v 1.0.0 -o
adding: dependencies-1.0.0.pex (deflated 2%) adding: metadata.yml (deflated 26%)
++ java -Xmx1g -Xms256m -Dlog4j.configurationFile=/punch/conf/log4j2/log4j2-stdout.xml -cp /punch/resourcectl.jar com.github.punchplatform.resourcectl.ResourceCtl -u http://artifacts-server.punch-artifacts:4245 upload -f /punch/punch_pex/dependencies-1.0.0.zip -o
Resource uploaded : additional-pex:demo:dependencies:1.0.0
This notebook only needs the pex created above. Thus we load it via the punch_dependencies magic cell.
%%punch_dependencies
additional-pex:demo:dependencies:1.0.0
++ java -Xmx1g -Xms256m -Dlog4j.configurationFile=/punch/conf/log4j2/log4j2-stdout.xml -cp /punch/resourcectl.jar com.github.punchplatform.resourcectl.ResourceCtl -u http://artifacts-server.punch-artifacts:4245 download -r additional-pex:demo:dependencies:1.0.0 -o /usr/share/punch/extlib/python
Resource additional-pex:demo:dependencies:1.0.0 downloaded to /usr/share/punch/extlib/python/dependencies-1.0.0.pex
from sklearn import tree
from sklearn.metrics import accuracy_score
import mlflow
Punch provides magic cells to read data from different sources. If your Jupypunch was deployed with preconfigured databases then you do not need to rewrite your login credentials.
Here, the train data set is download from a minio bucket named "demo". We read the file and store the data in a variable called "train". The testing dataset is loaded in the test variable. (punch_source)
%%punch_source --type s3 --name train -o
bucket: demo
prefix: train/train.csv
Data is available in train variable. Execution time: 0:00:00.418062
%%punch_source --type s3 --name test -o
bucket: demo
prefix: test/test.csv
Data is available in test variable. Execution time: 0:00:00.165612
Punch source node adds some variables like _ppf_path and _ppf_last_modified which are useful in some contexts but unnecessary for our example.
train = train[['distance_from_home', 'distance_from_last_transaction',
'ratio_to_median_purchase_price', 'repeat_retailer', 'used_chip',
'used_pin_number', 'online_order', 'fraud']]
train.head(2)
distance_from_home | distance_from_last_transaction | ratio_to_median_purchase_price | repeat_retailer | used_chip | used_pin_number | online_order | fraud | |
---|---|---|---|---|---|---|---|---|
0 | 4.805367 | 1.379477 | 1.236960 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 27.052054 | 1.766070 | 0.415689 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
test = test[['distance_from_home', 'distance_from_last_transaction',
'ratio_to_median_purchase_price', 'repeat_retailer', 'used_chip',
'used_pin_number', 'online_order', 'fraud']]
test.head(2)
distance_from_home | distance_from_last_transaction | ratio_to_median_purchase_price | repeat_retailer | used_chip | used_pin_number | online_order | fraud | |
---|---|---|---|---|---|---|---|---|
0 | 11.188842 | 0.067784 | 1.659848 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | 8.359728 | 0.186258 | 0.495259 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
We train a decision tree classifier on train data.
model = tree.DecisionTreeClassifier()
model = model.fit(train.drop("fraud", axis=1).values, train["fraud"].values)
We test the model on test data.
prediction = model.predict(test.drop("fraud", axis=1))
accuracy_score(test["fraud"], prediction)
/root/.pex/installed_wheels/a4f411ae99491abcca22021235750d2a9cc0bfbab39d5aa7e62f9861f905f58a/scikit_learn-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl/sklearn/base.py:402: UserWarning: X has feature names, but DecisionTreeClassifier was fitted without feature names warnings.warn(
0.9999866666666667
Once satisfied with model results, we can upload the model in a desired packaging format (here mlflow) via a lambda function.
%%punch_upload_model -g demo -n credit_card -v 1.0.0 -o
lambda path: mlflow.sklearn.save_model(model, path)
2022/12/14 10:59:39 WARNING mlflow.utils.environment: Encountered an unexpected error while inferring pip requirements (model URI: /tmp/punch_upload_model/demo/credit_card/1.0.0/model.pkl, flavor: sklearn), fall back to return ['scikit-learn==1.2.0', 'cloudpickle==2.2.0']. Set logging level to DEBUG to see the full traceback. /root/.pex/installed_wheels/57f6f22bde4e042978bcd50176fdb381d7c21a9efa4041202288d3737a0c6a54/setuptools-65.6.3-py3-none-any.whl/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") 2022/12/14 10:59:39 WARNING mlflow.utils.environment: Failed to resolve installed pip version. ``pip`` will be added to conda.yaml environment spec without a version specifier. ++ java -Xmx1g -Xms256m -Dlog4j.configurationFile=/punch/conf/log4j2/log4j2-stdout.xml -cp /punch/resourcectl.jar com.github.punchplatform.resourcectl.ResourceCtl -u http://artifacts-server.punch-artifacts:4245 upload -f /tmp/punch_upload_model/demo/credit_card/1.0.0/artifact_credit_card_1.0.0.zip -o
Resource uploaded : model:demo:credit_card:1.0.0