1nceuponapanda commited on
Commit
87bdcf0
·
1 Parent(s): cd9523e

Upload predicting-fraud-in-financial-payment-services.ipynb

Browse files
predicting-fraud-in-financial-payment-services.ipynb ADDED
@@ -0,0 +1 @@
 
 
1
+ {"metadata": {"kernelspec": {"name": "python3", "language": "python", "display_name": "Python 3"}, "language_info": {"name": "python", "mimetype": "text/x-python", "pygments_lexer": "ipython3", "file_extension": ".py", "version": "3.6.4", "nbconvert_exporter": "python", "codemirror_mode": {"name": "ipython", "version": 3}}, "anaconda-cloud": {}}, "cells": [{"source": ["This dataset is presently only one of four on Kaggle with information on the rising risk of digital financial fraud, emphasizing the difficulty in obtaining such data. The main technical challenge it poses to predicting fraud is the highly imbalanced distribution between positive and negative classes in 6 million rows of data. Another stumbling block to the utility of this data stems from the possible discrepancies in its description <a href='https://www.kaggle.com/ntnu-testimon/paysim1/discussion/35004'>[1]</a>, <a href='https://www.kaggle.com/lightcc/money-doesn-t-add-up/'>[2]</a>, <a href='https://www.kaggle.com/ntnu-testimon/paysim1/discussion/32786'>[3]</a>. The goal of this analysis is to solve both these issues by a detailed data exploration and cleaning followed by choosing a suitable machine-learning algorithm to deal with the skew. I show that an optimal solution based on feature-engineering and extreme gradient-boosted decision trees yields an enhanced predictive power of 0.997, as measured by the area under the precision-recall curve. Crucially, these results were obtained without artificial balancing of the data making this approach suitable to real-world applications.\n", "\n", "Update: This notebook got the Kaggle kernel award for the first week of October 2017."], "cell_type": "markdown", "metadata": {"_cell_guid": "c7b45d97-d213-4bff-a39a-fb973be3d1c2", "_uuid": "40baadcf98e87b9fa263ec2c80804a6298d6472e"}}, {"source": ["<a id='top'></a>\n", "#### Outline: \n", "#### 1. <a href='#import'>Import</a>\n", "#### 2. <a href='#EDA'>Exploratory Data Analysis</a>\n", "21. <a href='#fraud-trans'>Which types of transactions are fraudulent?</a>\n", "22. <a href='#isFlaggedFraud'>What determines whether the feature *isFlaggedFraud* gets set or not?</a>\n", "23. <a href='#merchant'>Are expected merchant accounts accordingly labelled?</a>\n", "24. <a href='#common-accounts'>Are there account labels common to fraudulent TRANSFERs and CASH_OUTs?</a>\n", "\n", "#### 3. <a href='#clean'>Data Cleaning</a>\n", "31. <a href='#imputation'>Imputation of Latent Missing Values</a>\n", "\n", "#### 4. <a href='#feature-eng'>Feature Engineering</a>\n", "#### 5. <a href='#visualization'>Data Visualization</a>\n", "51. <a href='#time'>Dispersion over time</a>\n", "52. <a href='#amount'>Dispersion over amount</a>\n", "53. <a href='#error'>Dispersion over error in balance in destination accounts</a>\n", "54. <a href='#separation'>Separating out genuine from fraudulent transactions</a>\n", "51. <a href='#correlation'>Fingerprints of genuine and fraudulent transactions</a>\n", "\n", "#### 6. <a href='#ML'>Machine Learning to Detect Fraud in Skewed Data</a>\n", "61. <a href='#importance'>What are the important features for the ML model?</a>\n", "62. <a href='#decision-tree'>Visualization of ML model</a>\n", "63. <a href='#learning-curve'>Bias-variance tradeoff</a>\n", "\n", "#### 7. <a href='#conclusion'>Conclusion</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "a58a2de7-d514-49f1-bfd4-a170d073fc58", "_uuid": "b08b5ca36efff32b32e4c77d1ec858a15e7469c5"}}, {"source": ["<a id='import'></a>\n", "#### 1. Import"], "cell_type": "markdown", "metadata": {"_cell_guid": "aae6646e-bcaa-4afb-85bc-2f67e314d5b0", "_uuid": "d821b548edc31ad27967b8419af17bfa4f0bae48"}}, {"source": ["import pandas as pd\n", "import numpy as np\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import matplotlib.lines as mlines\n", "from mpl_toolkits.mplot3d import Axes3D\n", "import seaborn as sns\n", "from sklearn.model_selection import train_test_split, learning_curve\n", "from sklearn.metrics import average_precision_score\n", "from xgboost.sklearn import XGBClassifier\n", "from xgboost import plot_importance, to_graphviz"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "03afcaca-4105-4c95-9a38-9a922f592814", "collapsed": true, "_uuid": "f936a10915bc042e678e1148923e493f92b518c3"}}, {"source": ["import warnings\n", "warnings.filterwarnings(\"ignore\", category=DeprecationWarning)"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "9bc1119f-d883-4cc6-b569-a707359abe01", "collapsed": true, "_uuid": "b7a3f0c8714f9e3d78a898a014f5ad81ecee1063"}}, {"source": ["Import data and correct spelling of original column headers for consistency"], "cell_type": "markdown", "metadata": {"_cell_guid": "da58e56c-7ba7-46b2-8b84-d1bcea8c7e42", "_uuid": "435ddb566ca27f5ce720f0fcb52e62e99dc829dd"}}, {"source": ["df = pd.read_csv('../input/PS_20174392719_1491204439457_log.csv')\n", "df = df.rename(columns={'oldbalanceOrg':'oldBalanceOrig', 'newbalanceOrig':'newBalanceOrig', \\\n", " 'oldbalanceDest':'oldBalanceDest', 'newbalanceDest':'newBalanceDest'})\n", "print(df.head())"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "9dc91648-ab99-44c3-a1f1-bea967df390f", "collapsed": true, "_uuid": "6e1110dd564e4a38eb9d8e83c4575dcc3d26abe8"}}, {"source": ["Test if there any missing values in DataFrame. It turns out there are no\n", "obvious missing values but, as we will see below, this does not rule out proxies by a numerical\n", "value like 0."], "cell_type": "markdown", "metadata": {"_cell_guid": "51d76f51-53ed-4ba5-ae39-588909fcb531", "_uuid": "d0987a74164e0d432df7291eb91338a6a6b17fe2"}}, {"source": ["df.isnull().values.any()"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "665e996b-39b8-4f47-bb2a-6a31643cde22", "collapsed": true, "_uuid": "4b7ffd4888d1df69961a0f2a58aa5102e694bc8b"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "ee5bc661-4f56-47f0-8771-9d033a17f347", "_uuid": "f1de48b059ede362e51020b263b01ed1f2dda83d"}}, {"source": ["<a id='EDA'></a>\n", "#### 2. Exploratory Data Analysis\n", "In this section and until section 4, we wrangle with the data exclusively using Dataframe methods. This is the most succinct way to gain insights into the dataset. More elaborate visualizations follow in subsequent sections. "], "cell_type": "markdown", "metadata": {"_cell_guid": "1f073780-2735-4481-a80c-87ed3cfc9d6a", "_uuid": "4409f927b742283f1024982d00585deae5e57976"}}, {"source": ["<a id='fraud-trans'></a>\n", "##### 2.1. Which types of transactions are fraudulent? \n", "We find that of the five types of transactions, fraud occurs only in two of them (see also kernels by <a href='https://www.kaggle.com/netzone/eda-and-fraud-detection'>Net</a>, <a href='https://www.kaggle.com/philschmidt/where-s-the-money-lebowski'>Philipp Schmidt</a> and <a href='https://www.kaggle.com/ibenoriaki/three-features-with-kneighbors-auc-score-is-0-998'>Ibe_Noriaki</a>):\n", "'TRANSFER' where money is sent to a customer / fraudster\n", "and 'CASH_OUT' where money is sent to a merchant who pays the customer / \n", "fraudster\n", "in cash. Remarkably, the number of \n", "fraudulent TRANSFERs almost equals the number of fraudulent CASH_OUTs (see the right half of the plot in section <a href='#time'>5.1</a>). These\n", "observations appear, at first, to bear out\n", "the description provided on Kaggle for the modus operandi of fraudulent \n", "transactions in \n", "this dataset, namely, fraud is committed by first transferring out funds\n", "to another account which subsequently cashes it out. We will return to this issue later in section <a href='#common-accounts'>2.4</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "cff74f1a-5121-44c2-a6fe-6d69fd904a45", "_uuid": "856038be1f75ffd6e0b132c80eac2a35bb36f9d5"}}, {"source": ["print('\\n The types of fraudulent transactions are {}'.format(\\\n", "list(df.loc[df.isFraud == 1].type.drop_duplicates().values))) # only 'CASH_OUT' \n", " # & 'TRANSFER'\n", "\n", "dfFraudTransfer = df.loc[(df.isFraud == 1) & (df.type == 'TRANSFER')]\n", "dfFraudCashout = df.loc[(df.isFraud == 1) & (df.type == 'CASH_OUT')]\n", "\n", "print ('\\n The number of fraudulent TRANSFERs = {}'.\\\n", " format(len(dfFraudTransfer))) # 4097\n", "\n", "print ('\\n The number of fraudulent CASH_OUTs = {}'.\\\n", " format(len(dfFraudCashout))) # 4116"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "a10a3577-4716-49dc-8dda-d567fb395a74", "collapsed": true, "_uuid": "abc0207b9075a0defd828fe8732d63cce8da33c6"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "dfae99f3-ed49-47e3-8036-afafda366f0b", "_uuid": "744f972b4966ebf9f207382f1e3bb900c370f0d6"}}, {"source": ["<a id='isFlaggedFraud'></a>\n", "##### 2.2. What determines whether the feature *isFlaggedFraud* gets set or not? \n", "It turns out that the origin of *isFlaggedFraud* is unclear, contrasting\n", "with the description provided.\n", "The 16 entries (out of 6 million) where the *isFlaggedFraud* feature \n", "is set \n", "do not seem to correlate with any\n", "explanatory variable. The data is described as *isFlaggedFraud* being set when\n", "an attempt is made to 'TRANSFER' an 'amount' greater than 200,000. \n", "In\n", "fact, as shown below, *isFlaggedFraud* can remain not set despite this condition being met."], "cell_type": "markdown", "metadata": {"_cell_guid": "a2207bbf-53cf-463c-bb24-f3c871091f75", "_uuid": "04b0c4acefb484866fb54a19d04c67dc62699c09"}}, {"source": ["print('\\nThe type of transactions in which isFlaggedFraud is set: \\\n", "{}'.format(list(df.loc[df.isFlaggedFraud == 1].type.drop_duplicates()))) \n", " # only 'TRANSFER'\n", "\n", "dfTransfer = df.loc[df.type == 'TRANSFER']\n", "dfFlagged = df.loc[df.isFlaggedFraud == 1]\n", "dfNotFlagged = df.loc[df.isFlaggedFraud == 0]\n", "\n", "print('\\nMin amount transacted when isFlaggedFraud is set= {}'\\\n", " .format(dfFlagged.amount.min())) # 353874.22\n", "\n", "print('\\nMax amount transacted in a TRANSFER where isFlaggedFraud is not set=\\\n", " {}'.format(dfTransfer.loc[dfTransfer.isFlaggedFraud == 0].amount.max())) # 92445516.64"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "81e27fe8-a42b-40a1-a629-7201f5d4bbd0", "collapsed": true, "_uuid": "42a56e2d6de87ea04c9b03ab7e32b4c806ba00c6"}}, {"source": ["Can *oldBalanceDest* and *newBalanceDest* determine *isFlaggedFraud* being\n", "set?\n", "The old is identical to the new balance in the origin and destination \n", "accounts, for every TRANSFER where *isFlaggedFraud* is set. This is presumably because \n", "the \n", "transaction is halted <a href='https://www.kaggle.com/lightcc/money-doesn-t-add-up/comments#187011'>[4]</a>.\n", "Interestingly, *oldBalanceDest* = 0 in every such transaction. However, as shown below, since\n", "*isFlaggedFraud* can remain not set in TRANSFERS where\n", "*oldBalanceDest* and *newBalanceDest* can both be 0, these conditions do not\n", "determine the state of *isFlaggedFraud*.\n", "\n"], "cell_type": "markdown", "metadata": {"_cell_guid": "a7c69e69-8329-474e-802c-d0d57d970226", "_uuid": "6343cbcd0a3d9f52aa43a7f9b04cd2de8b5d0c48"}}, {"source": ["print('\\nThe number of TRANSFERs where isFlaggedFraud = 0, yet oldBalanceDest = 0 and\\\n", " newBalanceDest = 0: {}'.\\\n", "format(len(dfTransfer.loc[(dfTransfer.isFlaggedFraud == 0) & \\\n", "(dfTransfer.oldBalanceDest == 0) & (dfTransfer.newBalanceDest == 0)]))) # 4158"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "791158f2-7bca-427c-ac9b-6adde1f1db75", "collapsed": true, "_uuid": "86b3e7ba00dd65c3c7e715bd5a1dc92c854a5708"}}, {"source": ["*isFlaggedFraud* being set cannot be thresholded on *oldBalanceOrig* since\n", "the corresponding range of values overlaps with that for TRANSFERs where *isFlaggedFraud* is not set (see below). Note that we do not need\n", "to consider *newBalanceOrig* since it is updated only after the transaction,\n", "whereas *isFlaggedFraud* would be set before the transaction takes place."], "cell_type": "markdown", "metadata": {"_cell_guid": "02430c3d-7b18-43e0-8bda-a5345e82b7f4", "_uuid": "e8903ad23847ae97bb707e7de12fe092cfa07d05"}}, {"source": ["print('\\nMin, Max of oldBalanceOrig for isFlaggedFraud = 1 TRANSFERs: {}'.\\\n", "format([round(dfFlagged.oldBalanceOrig.min()), round(dfFlagged.oldBalanceOrig.max())]))\n", "\n", "print('\\nMin, Max of oldBalanceOrig for isFlaggedFraud = 0 TRANSFERs where \\\n", "oldBalanceOrig = \\\n", "newBalanceOrig: {}'.format(\\\n", "[dfTransfer.loc[(dfTransfer.isFlaggedFraud == 0) & (dfTransfer.oldBalanceOrig \\\n", "== dfTransfer.newBalanceOrig)].oldBalanceOrig.min(), \\\n", "round(dfTransfer.loc[(dfTransfer.isFlaggedFraud == 0) & (dfTransfer.oldBalanceOrig \\\n", " == dfTransfer.newBalanceOrig)].oldBalanceOrig.max())]))"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "bcd12ad3-206d-4705-92c9-c6fd2afe8233", "collapsed": true, "_uuid": "a1ecf24b9ab510ea8f0d52db863a01fe8baf5c41"}}, {"source": ["Can *isFlaggedFraud* be set based on seeing a customer transacting more than\n", "once? Note that duplicate customer names don't exist within transactions \n", "where *isFlaggedFraud* is set, but duplicate customer names exist within\n", "transactions where *isFlaggedFraud* is not set. It turns out that originators\n", "of transactions that have *isFlaggedFraud* set have transacted only once.\n", "Very few destination accounts of transactions that have *isFlaggedFraud* set\n", "have transacted more than once."], "cell_type": "markdown", "metadata": {"_cell_guid": "0fbe89fb-9683-4802-bb00-80dcc63e7c29", "_uuid": "92fcee6f07af5bd9e47e7cb8944f414771698a62"}}, {"source": ["print('\\nHave originators of transactions flagged as fraud transacted more than \\\n", "once? {}'\\\n", ".format((dfFlagged.nameOrig.isin(pd.concat([dfNotFlagged.nameOrig, \\\n", " dfNotFlagged.nameDest]))).any())) # False\n", "\n", "print('\\nHave destinations for transactions flagged as fraud initiated\\\n", " other transactions? \\\n", "{}'.format((dfFlagged.nameDest.isin(dfNotFlagged.nameOrig)).any())) # False\n", "\n", "# Since only 2 destination accounts of 16 that have 'isFlaggedFraud' set have been\n", "# destination accounts more than once,\n", "# clearly 'isFlaggedFraud' being set is independent of whether a \n", "# destination account has been used before or not\n", "\n", "print('\\nHow many destination accounts of transactions flagged as fraud have been \\\n", "destination accounts more than once?: {}'\\\n", ".format(sum(dfFlagged.nameDest.isin(dfNotFlagged.nameDest)))) # 2"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "b6d950f3-6c4e-443b-8096-0106a02929f8", "collapsed": true, "_uuid": "0202615ffa8bfee7af1c9cc8b13c54a936cf7df8"}}, {"source": ["It can be easily seen that transactions with *isFlaggedFraud* \n", "set occur at\n", "all values of *step*, similar to the complementary set of transactions. Thus\n", "*isFlaggedFraud* does not correlate with *step* either and is therefore\n", "seemingly unrelated to any explanatory variable or feature in the data"], "cell_type": "markdown", "metadata": {"_cell_guid": "475ab1f1-7959-4608-8adc-981416aa53cc", "_uuid": "5ac2326b82ed2efdde2f1a0d4019611aea8b6e4d"}}, {"source": ["*Conclusion*: Although *isFraud* is always set when *isFlaggedFraud* is set, since\n", "*isFlaggedFraud* is set just 16 times in a seemingly meaningless way, we \n", "can treat this feature as insignificant and discard it in the dataset \n", "without loosing information."], "cell_type": "markdown", "metadata": {"scrolled": true, "_cell_guid": "2fa84d92-9173-4caa-9386-05e5c7d09e5d", "_uuid": "3fc43eef9e317cef7818e60bc09eeb2c8febc1ed"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "e599bb1d-89ae-4c01-9ec7-de6d2960d4ad", "_uuid": "5a5d627e0d65abba7a8209209fdeb73b0e0f6ca3"}}, {"source": ["<a id='merchant'></a>\n", "##### 2.3. Are expected merchant accounts accordingly labelled?"], "cell_type": "markdown", "metadata": {"_cell_guid": "ca4202ed-e1a5-48af-90e3-6a1324c18d95", "_uuid": "1abcdc6816fdb410029556cd5f610ed4c5ee59a0"}}, {"source": ["It was stated <a href='http://www2.bth.se/com/edl.nsf/pages/phd-dissertation'>[5]</a> that CASH_IN involves being paid by\n", "a merchant (whose name is prefixed by 'M'). However, as shown below, the present data does not have\n", "merchants making CASH_IN transactions to customers."], "cell_type": "markdown", "metadata": {"_cell_guid": "8abe931c-56c9-427a-af3b-690c36cedc66", "_uuid": "91f8c24828731f9b9cfd8332992c54b7b501983e"}}, {"source": ["print('\\nAre there any merchants among originator accounts for CASH_IN \\\n", "transactions? {}'.format(\\\n", "(df.loc[df.type == 'CASH_IN'].nameOrig.str.contains('M')).any())) # False"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "16426d05-83d1-4ca0-a7c0-e65d0f63d640", "collapsed": true, "_uuid": "57f3a4f8b1a96a3904b190e75119869ecba9f441"}}, {"source": ["Similarly, it was stated that CASH_OUT involves paying \n", "a merchant. However, for CASH_OUT transactions\n", "there are no merchants among the destination accounts."], "cell_type": "markdown", "metadata": {"_cell_guid": "c6a8a232-ffaf-4ca1-88a2-1049f9e151ad", "_uuid": "af083b445624b979b4c60213a9c228b3b75fada5"}}, {"source": ["print('\\nAre there any merchants among destination accounts for CASH_OUT \\\n", "transactions? {}'.format(\\\n", "(df.loc[df.type == 'CASH_OUT'].nameDest.str.contains('M')).any())) # False"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "1e8005fc-f470-4146-811a-53bf144bbebe", "collapsed": true, "_uuid": "a3fee89c25edcf20f1cac787c8c4a5c0804fb5d4"}}, {"source": ["In fact, there are no merchants among any originator accounts. Merchants are\n", "only\n", "present in destination accounts for all PAYMENTS."], "cell_type": "markdown", "metadata": {"_cell_guid": "0ba20cd9-4739-481c-8e24-1e0524a07533", "_uuid": "17821cd06c2b6c86940950bd3bcd534700c2b3d4"}}, {"source": ["print('\\nAre there merchants among any originator accounts? {}'.format(\\\n", " df.nameOrig.str.contains('M').any())) # False\n", "\n", "print('\\nAre there any transactions having merchants among destination accounts\\\n", " other than the PAYMENT type? {}'.format(\\\n", "(df.loc[df.nameDest.str.contains('M')].type != 'PAYMENT').any())) # False"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "198925f6-559c-45f0-8ecf-386b80bc98c7", "collapsed": true, "_uuid": "0b933060f76aafc9cdf5c9a83f25b4ccd6503f13"}}, {"source": ["*Conclusion*: Among the account labels *nameOrig* and *nameDest*, for all transactions, the merchant prefix of 'M' occurs in an unexpected way."], "cell_type": "markdown", "metadata": {"_cell_guid": "97b0473c-d403-4b10-bcef-377d9d5aaff3", "_uuid": "8f9e378300ccc8270e46970394d04d8511072101"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "9f3a2096-47f3-418c-8639-93c5bb97dc71", "_uuid": "021d68ea1c89f435d49ffe3339f22848a9006b32"}}, {"source": ["<a id='common-accounts'></a>\n", "##### 2.4. Are there account labels common to fraudulent TRANSFERs and CASH_OUTs?"], "cell_type": "markdown", "metadata": {"_cell_guid": "eac018aa-712e-4162-89c1-a0fd42d5868c", "_uuid": "4a1bf9add032d3d8e0c290f84cde5db96e737d4e"}}, {"source": ["From the data description, the modus operandi for committing fraud involves \n", "first making a TRANSFER to a (fraudulent) account which in turn \n", "conducts a CASH_OUT.\n", "CASH_OUT involves transacting with a merchant who\n", "pays out cash.\n", "Thus, within this two-step process, the fraudulent account would be both, \n", "the destination in a TRANSFER\n", "and the originator in a CASH_OUT. However, the data shows below that there are no \n", "such common accounts among \n", "fraudulent transactions. Thus, the data is not imprinted with\n", "the expected modus-operandi."], "cell_type": "markdown", "metadata": {"_cell_guid": "a5ddbc33-3da5-4095-b4bc-06ee673afb5e", "_uuid": "77dfff4adb09874c1bc4b3e775a60bb98beeba93"}}, {"source": ["print('\\nWithin fraudulent transactions, are there destinations for TRANSFERS \\\n", "that are also originators for CASH_OUTs? {}'.format(\\\n", "(dfFraudTransfer.nameDest.isin(dfFraudCashout.nameOrig)).any())) # False\n", "dfNotFraud = df.loc[df.isFraud == 0]"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "ba9c708c-2f0a-4598-9563-7abaf50120f9", "collapsed": true, "_uuid": "8598fa5d577c02ff00638da238a59663d8259a7d"}}, {"source": ["Could destination accounts for fraudulent TRANSFERs originate CASHOUTs that\n", "are not detected and are labeled as genuine? It turns out there are 3 such\n", "accounts."], "cell_type": "markdown", "metadata": {"_cell_guid": "d6a65aec-f9c9-41e6-b565-aefbe26029d1", "_uuid": "0d4af6d00edfb657a1aaf8ccc1df4d1f983dae4d"}}, {"source": ["print('\\nFraudulent TRANSFERs whose destination accounts are originators of \\\n", "genuine CASH_OUTs: \\n\\n{}'.format(dfFraudTransfer.loc[dfFraudTransfer.nameDest.\\\n", "isin(dfNotFraud.loc[dfNotFraud.type == 'CASH_OUT'].nameOrig.drop_duplicates())]))"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "58d72273-5415-4391-9d01-c7e9cc09c2c9", "collapsed": true, "_uuid": "61386b6444d1dde29c2ba77944806fb08e146a1d"}}, {"source": ["However, 2 out of 3 of these accounts first make a genuine CASH_OUT and \n", "only later \n", "(as evidenced by the\n", "time step) receive a fraudulent TRANSFER. Thus, fraudulent transactions are\n", "not indicated by the *nameOrig* and *nameDest* features."], "cell_type": "markdown", "metadata": {"_cell_guid": "b9e3ad76-30c0-47fa-a060-2bd0b60dfc35", "_uuid": "815445401224d21b17b7989e7d39eb9dc34f1138"}}, {"source": ["print('\\nFraudulent TRANSFER to C423543548 occured at step = 486 whereas \\\n", "genuine CASH_OUT from this account occured earlier at step = {}'.format(\\\n", "dfNotFraud.loc[(dfNotFraud.type == 'CASH_OUT') & (dfNotFraud.nameOrig == \\\n", " 'C423543548')].step.values)) # 185"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "f3ad49db-c8af-4145-96a6-8f05fd94b377", "collapsed": true, "_uuid": "1f0e7510db0998f2bec79069eedd4757466d41f2"}}, {"source": ["*Conclusion*: Noting from section <a href='#merchant'>2.3</a>\n", "above that\n", "the *nameOrig* and *nameDest* features neither encode merchant accounts in the expected way, below, we\n", "drop these features from the data since they are meaningless."], "cell_type": "markdown", "metadata": {"_cell_guid": "eb8fa972-8045-4b21-b1ea-9444f726d9d5", "_uuid": "a231a128247d1a6231ba17ad20c34d615993b2b0"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "c7625b95-f1e5-4458-9ed8-d53bcba685c1", "_uuid": "0f4fa4c3bfca8795883c2eb17d2340feb209247d"}}, {"source": ["<a id='clean'></a>\n", "#### 3. Data cleaning"], "cell_type": "markdown", "metadata": {"_cell_guid": "cb401745-33d3-42df-a226-12a84fb04bd9", "_uuid": "86a79448bf138b3b4c249c5be4b812dbce492e9a"}}, {"source": ["From the exploratory data analysis (EDA) of section <a href='#EDA#'>2</a>, we know that fraud only occurs in \n", "'TRANSFER's and 'CASH_OUT's. So we assemble only the corresponding data in X\n", "for analysis."], "cell_type": "markdown", "metadata": {"_cell_guid": "cdd94133-5730-4f31-9a7e-c23c1322edb2", "_uuid": "a5b91ecc33d7f51bec7e5e357edef789e2459d12"}}, {"source": ["X = df.loc[(df.type == 'TRANSFER') | (df.type == 'CASH_OUT')]\n", "\n", "randomState = 5\n", "np.random.seed(randomState)\n", "\n", "#X = X.loc[np.random.choice(X.index, 100000, replace = False)]\n", "\n", "Y = X['isFraud']\n", "del X['isFraud']\n", "\n", "# Eliminate columns shown to be irrelevant for analysis in the EDA\n", "X = X.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1)\n", "\n", "# Binary-encoding of labelled data in 'type'\n", "X.loc[X.type == 'TRANSFER', 'type'] = 0\n", "X.loc[X.type == 'CASH_OUT', 'type'] = 1\n", "X.type = X.type.astype(int) # convert dtype('O') to dtype(int)"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "bc50e596-b0f3-4e63-877d-d506e2528cd1", "collapsed": true, "_uuid": "42807f63bb86782daedce8dfa7ba1385458c84c9"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "9e10d9ff-0cc1-4c4b-9dd2-6772b6ffbc12", "_uuid": "51ae3efbe461831c91c595096021a362928312fe"}}, {"source": ["<a id='imputation'></a>\n", "##### 3.1. Imputation of Latent Missing Values"], "cell_type": "markdown", "metadata": {"_cell_guid": "8068fa22-b0d9-46b7-86ff-21ca4dd212f5", "_uuid": "520ca37c8d0db0422edf31ecbac74667276494ab"}}, {"source": ["The data has several transactions with zero balances in the destination \n", "account both before and after a non-zero amount is transacted. The fraction\n", "of such transactions, where zero likely denotes a missing value, is much \n", "larger\n", "in fraudulent (50%) compared to genuine transactions (0.06%)."], "cell_type": "markdown", "metadata": {"_cell_guid": "080ba9dc-2879-4e44-8cb8-a73b7d87cd04", "_uuid": "9953274969e078811390f2bd275b41e5219e87fb"}}, {"source": ["Xfraud = X.loc[Y == 1]\n", "XnonFraud = X.loc[Y == 0]\n", "print('\\nThe fraction of fraudulent transactions with \\'oldBalanceDest\\' = \\\n", "\\'newBalanceDest\\' = 0 although the transacted \\'amount\\' is non-zero is: {}'.\\\n", "format(len(Xfraud.loc[(Xfraud.oldBalanceDest == 0) & \\\n", "(Xfraud.newBalanceDest == 0) & (Xfraud.amount)]) / (1.0 * len(Xfraud))))\n", "\n", "print('\\nThe fraction of genuine transactions with \\'oldBalanceDest\\' = \\\n", "newBalanceDest\\' = 0 although the transacted \\'amount\\' is non-zero is: {}'.\\\n", "format(len(XnonFraud.loc[(XnonFraud.oldBalanceDest == 0) & \\\n", "(XnonFraud.newBalanceDest == 0) & (XnonFraud.amount)]) / (1.0 * len(XnonFraud))))"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "ebff49a7-a155-4fcf-b6a1-69fedffed60b", "collapsed": true, "_uuid": "add520480eb9c8c0249aebaede714f1afae0f274"}}, {"source": ["Since the destination account balances being zero is a strong indicator of\n", "fraud, we do not impute the account balance (before the transaction is made)\n", "with a statistic or from a distribution with a subsequent adjustment for \n", "the amount transacted. Doing so would mask this\n", "indicator of fraud and make fraudulent transactions appear genuine. Instead,\n", "below we\n", "replace the value of 0 with -1\n", "which will be more useful to a suitable machine-learning (ML) algorithm detecting \n", "fraud."], "cell_type": "markdown", "metadata": {"_cell_guid": "ac9c5868-7817-4c03-a898-9da026776ca7", "_uuid": "0d575a669ee174dc04e34dcd40b6989c1b536f27"}}, {"source": ["X.loc[(X.oldBalanceDest == 0) & (X.newBalanceDest == 0) & (X.amount != 0), \\\n", " ['oldBalanceDest', 'newBalanceDest']] = - 1"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "41351678-070d-4ed8-8386-760baeafc5bf", "collapsed": true, "_uuid": "b277e238d9470a7c1231ac2e59f9d0a0a25089ff"}}, {"source": ["The data also has several transactions with zero balances in the originating\n", "account both before and after a non-zero amount is transacted. In this case,\n", "the fraction of such transactions is much smaller in fraudulent (0.3%)\n", "compared to genuine transactions (47%). Once again, from similar reasoning as\n", "above, instead of imputing a \n", "numerical value we replace the value of 0 with a null value."], "cell_type": "markdown", "metadata": {"_cell_guid": "e40457d1-cf12-480b-99c4-316f470a2205", "_uuid": "c63f1339566c32356fb65ec9f2ac1abbd87f4d11"}}, {"source": ["X.loc[(X.oldBalanceOrig == 0) & (X.newBalanceOrig == 0) & (X.amount != 0), \\\n", " ['oldBalanceOrig', 'newBalanceOrig']] = np.nan"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "a950ca90-ce03-4973-811c-96b714e09335", "collapsed": true, "_uuid": "57ee6691c420dc2f5261f99d8b02f2c17ea947f1"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "8a5432e2-f4f2-4e4f-8ad4-c3533c32ef7d", "_uuid": "f3619f3f8200f0ba02cab0ef2241faabbaa08d02"}}, {"source": ["<a id='feature-eng'></a>\n", "#### 4. Feature-engineering"], "cell_type": "markdown", "metadata": {"_cell_guid": "6f02a6bb-08db-41b6-b0d1-2076e49a47e8", "_uuid": "b8e9aab218faffb19fa309eb3ef754d2cc58c4a4"}}, {"source": ["Motivated by the possibility of zero-balances serving to differentiate between\n", "fraudulent and genuine transactions, we take the data-imputation of section <a href='#imputation'>3.1</a> a\n", "step further and create 2 new features (columns) recording errors in the \n", "originating and\n", "destination accounts for each transaction. These new features turn out to be \n", "important in obtaining the best performance from the ML algorithm that we will\n", "finally use."], "cell_type": "markdown", "metadata": {"_cell_guid": "ee52641e-2440-4a48-b312-77369d9f2841", "_uuid": "f31f4503e62234aedc3782a64374207a1a6366a7"}}, {"source": ["X['errorBalanceOrig'] = X.newBalanceOrig + X.amount - X.oldBalanceOrig\n", "X['errorBalanceDest'] = X.oldBalanceDest + X.amount - X.newBalanceDest"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "51aac74d-bdd1-4eeb-8819-3fc0ec58ea20", "collapsed": true, "_uuid": "7ab7061e23ae18e435fd88b9f1fb418f88865d96"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "66741241-13b4-43a1-8d7c-cd7d00b0a9b0", "_uuid": "163be4b95059fc219b9544b8fc674744278b1e19"}}, {"source": ["<a id='visualization'></a>\n", "#### 5. Data visualization"], "cell_type": "markdown", "metadata": {"_cell_guid": "9e3b61cc-9119-4483-8def-39503b6a0ce8", "_uuid": "a4da30990990d41b9c2232644fb09ca310edb004"}}, {"source": ["The best way of\n", "confirming that the data contains enough information so that a ML algorithm \n", "can make strong predictions, is to try and directly visualize the \n", "differences between fraudulent and genuine transactions. Motivated by this\n", "principle, I visualize these differences in several ways in the plots below."], "cell_type": "markdown", "metadata": {"_cell_guid": "bb8ba23b-948a-4f01-bf85-805b12095f37", "_uuid": "111b6b0f3dce94fac8ea6552761d522561692305"}}, {"source": ["limit = len(X)\n", "\n", "def plotStrip(x, y, hue, figsize = (14, 9)):\n", " \n", " fig = plt.figure(figsize = figsize)\n", " colours = plt.cm.tab10(np.linspace(0, 1, 9))\n", " with sns.axes_style('ticks'):\n", " ax = sns.stripplot(x, y, \\\n", " hue = hue, jitter = 0.4, marker = '.', \\\n", " size = 4, palette = colours)\n", " ax.set_xlabel('')\n", " ax.set_xticklabels(['genuine', 'fraudulent'], size = 16)\n", " for axis in ['top','bottom','left','right']:\n", " ax.spines[axis].set_linewidth(2)\n", "\n", " handles, labels = ax.get_legend_handles_labels()\n", " plt.legend(handles, ['Transfer', 'Cash out'], bbox_to_anchor=(1, 1), \\\n", " loc=2, borderaxespad=0, fontsize = 16);\n", " return ax"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "1676906a-a4f0-43bf-8db9-3131e4928343", "collapsed": true, "_uuid": "84a8bf0739d49fadf2fbc9dfd2ccce8fd0abc0cc"}}, {"source": ["<a id='time'></a>\n", "##### 5. 1. Dispersion over time"], "cell_type": "markdown", "metadata": {"_cell_guid": "ad15006e-1c46-4cc5-9c7d-08c3ab223e20", "_uuid": "286795bf2fd7fa5fd718adc424fce231cfff323f"}}, {"source": ["The plot below shows how the fraudulent and genuine transactions yield different \n", "fingerprints when their dispersion is viewed over time. It is clear that\n", "fradulent transactions are more homogenously distributed over time compared to \n", "genuine\n", "transactions. Also apparent is \n", "that CASH-OUTs outnumber TRANSFERs in genuine transactions, in contrast to \n", "a balanced distribution between them in fraudulent transactions. Note that the\n", "the width of each \n", "'fingerprint' is set by the 'jitter' parameter in the plotStrip function above\n", "which attempts to separate out and plot transactions\n", "occuring at the same time with different abscissae."], "cell_type": "markdown", "metadata": {"_cell_guid": "38ffb3fe-6505-4649-826a-e820226bcbf6", "_uuid": "2c0f08e24c70daf2794a4ea8857470103a0013ea"}}, {"source": ["ax = plotStrip(Y[:limit], X.step[:limit], X.type[:limit])\n", "ax.set_ylabel('time [hour]', size = 16)\n", "ax.set_title('Striped vs. homogenous fingerprints of genuine and fraudulent \\\n", "transactions over time', size = 20);"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "955c350d-7e4d-4e74-83d9-6dedd426333a", "collapsed": true, "_uuid": "a63848e6a25238259d38a750646e99ad30dfea6a"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "f2665309-126f-421c-a996-992b6ea504a2", "_uuid": "d4dbdb5d2268d4e4b840f13d60dc16b1111e1768"}}, {"source": ["<a id='amount'></a>\n", "##### 5. 2. Dispersion over amount"], "cell_type": "markdown", "metadata": {"_cell_guid": "eb256ef3-e56f-4bca-8308-1af5290bf98d", "_uuid": "1562b2d9d6cbff930b1505301d995850ddef7b4c"}}, {"source": ["The two plots below shows that although the presence of fraud in a transaction\n", "can be discerned by the original *amount* feature, the new\n", "*errorBalanceDest* feature is more effective at making a distinction."], "cell_type": "markdown", "metadata": {"_cell_guid": "2ebd6fa3-c63d-4115-ac81-7baf2fd74c02", "_uuid": "6268491fc5cf2f438b44d082bcadc463dd4736fa"}}, {"source": ["limit = len(X)\n", "ax = plotStrip(Y[:limit], X.amount[:limit], X.type[:limit], figsize = (14, 9))\n", "ax.set_ylabel('amount', size = 16)\n", "ax.set_title('Same-signed fingerprints of genuine \\\n", "and fraudulent transactions over amount', size = 18);"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "6a96c405-0fe5-4e5d-82e7-6e98400e579b", "collapsed": true, "_uuid": "0bcad3a83d79c0852a18992b1a7d996c1ce7dbd4"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "135942c5-8f86-4272-9eab-f611ea1ec0f7", "_uuid": "999e4e0f6f76825860d0449aebad58dffed9b325"}}, {"source": ["<a id='error'></a>\n", "##### 5. 3. Dispersion over error in balance in destination accounts"], "cell_type": "markdown", "metadata": {"_cell_guid": "eeddcb11-d295-4966-ab81-c956c4654413", "_uuid": "9a1a04c20c8b61037d3686064823bafdbf1a7f35"}}, {"source": ["limit = len(X)\n", "ax = plotStrip(Y[:limit], - X.errorBalanceDest[:limit], X.type[:limit], \\\n", " figsize = (14, 9))\n", "ax.set_ylabel('- errorBalanceDest', size = 16)\n", "ax.set_title('Opposite polarity fingerprints over the error in \\\n", "destination account balances', size = 18);"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"scrolled": false, "_cell_guid": "d6698020-c416-4c48-b5d9-af8ee81b355a", "collapsed": true, "_uuid": "b4736a82a85d7a7ca5527a6e16f588e7f843aaa7"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "4f23c844-1396-48c4-a61f-7e936effc68f", "_uuid": "36545d44b71167ff24c410e7f97a40222e2b598f"}}, {"source": ["<a id='separation'></a>\n", "##### 5. 4. Separating out genuine from fraudulent transactions"], "cell_type": "markdown", "metadata": {"_cell_guid": "9a1a6267-2d03-4217-8ddd-2e000f10d6b1", "_uuid": "edc74751225b04027fded2411ac9d5d61c96e9a0"}}, {"source": ["The 3D plot below distinguishes best between fraud and non-fraud data\n", "by using both of the engineered error-based features. Clearly, the\n", "original *step* feature is ineffective in seperating out fraud. Note\n", "the striped nature of the genuine data vs time which was aniticipated\n", "from the figure in section <a href='#time'>5.1</a>."], "cell_type": "markdown", "metadata": {"_cell_guid": "e5eb1a5d-71e6-4dc5-9f8d-c7ac3ed3e31f", "_uuid": "8c0e7c0fffd13e924cfe7d1558b8d5962e4913ce"}}, {"source": ["# Long computation in this cell (~2.5 minutes)\n", "x = 'errorBalanceDest'\n", "y = 'step'\n", "z = 'errorBalanceOrig'\n", "zOffset = 0.02\n", "limit = len(X)\n", "\n", "sns.reset_orig() # prevent seaborn from over-riding mplot3d defaults\n", "\n", "fig = plt.figure(figsize = (10, 12))\n", "ax = fig.add_subplot(111, projection='3d')\n", "\n", "ax.scatter(X.loc[Y == 0, x][:limit], X.loc[Y == 0, y][:limit], \\\n", " -np.log10(X.loc[Y == 0, z][:limit] + zOffset), c = 'g', marker = '.', \\\n", " s = 1, label = 'genuine')\n", " \n", "ax.scatter(X.loc[Y == 1, x][:limit], X.loc[Y == 1, y][:limit], \\\n", " -np.log10(X.loc[Y == 1, z][:limit] + zOffset), c = 'r', marker = '.', \\\n", " s = 1, label = 'fraudulent')\n", "\n", "ax.set_xlabel(x, size = 16); \n", "ax.set_ylabel(y + ' [hour]', size = 16); \n", "ax.set_zlabel('- log$_{10}$ (' + z + ')', size = 16)\n", "ax.set_title('Error-based features separate out genuine and fraudulent \\\n", "transactions', size = 20)\n", "\n", "plt.axis('tight')\n", "ax.grid(1)\n", "\n", "noFraudMarker = mlines.Line2D([], [], linewidth = 0, color='g', marker='.',\n", " markersize = 10, label='genuine')\n", "fraudMarker = mlines.Line2D([], [], linewidth = 0, color='r', marker='.',\n", " markersize = 10, label='fraudulent')\n", "\n", "plt.legend(handles = [noFraudMarker, fraudMarker], \\\n", " bbox_to_anchor = (1.20, 0.38 ), frameon = False, prop={'size': 16});"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "10b656da-cfb9-4b12-a91a-0e6530b77730", "collapsed": true, "_uuid": "65cd616c7035047d3e960cda1fe20f046393c19c"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "d146f9e0-9838-4651-9d5c-effda2383e6c", "_uuid": "51cf0645b33f79289623c119f09ae27dc577d516"}}, {"source": ["<a id='correlation'></a>\n", "##### 5. 5. Fingerprints of genuine and fraudulent transactions"], "cell_type": "markdown", "metadata": {"_cell_guid": "a4dc25e0-d621-4b5f-aa90-53835f5d08da", "_uuid": "8b699e030a7a53ecf3aacb931fcc3dc312d6513a"}}, {"source": ["Smoking gun and comprehensive evidence embedded in the dataset of the \n", "difference between fraudulent\n", "and genuine transactions is obtained by examining their respective\n", "correlations in the heatmaps below."], "cell_type": "markdown", "metadata": {"_cell_guid": "adc9e2e6-9877-4990-8300-8a6ceca6acaa", "_uuid": "702684862a956dccae5959eb95c9293364a10bb6"}}, {"source": ["Xfraud = X.loc[Y == 1] # update Xfraud & XnonFraud with cleaned data\n", "XnonFraud = X.loc[Y == 0]\n", " \n", "correlationNonFraud = XnonFraud.loc[:, X.columns != 'step'].corr()\n", "mask = np.zeros_like(correlationNonFraud)\n", "indices = np.triu_indices_from(correlationNonFraud)\n", "mask[indices] = True\n", "\n", "grid_kws = {\"width_ratios\": (.9, .9, .05), \"wspace\": 0.2}\n", "f, (ax1, ax2, cbar_ax) = plt.subplots(1, 3, gridspec_kw=grid_kws, \\\n", " figsize = (14, 9))\n", "\n", "cmap = sns.diverging_palette(220, 8, as_cmap=True)\n", "ax1 =sns.heatmap(correlationNonFraud, ax = ax1, vmin = -1, vmax = 1, \\\n", " cmap = cmap, square = False, linewidths = 0.5, mask = mask, cbar = False)\n", "ax1.set_xticklabels(ax1.get_xticklabels(), size = 16); \n", "ax1.set_yticklabels(ax1.get_yticklabels(), size = 16); \n", "ax1.set_title('Genuine \\n transactions', size = 20)\n", "\n", "correlationFraud = Xfraud.loc[:, X.columns != 'step'].corr()\n", "ax2 = sns.heatmap(correlationFraud, vmin = -1, vmax = 1, cmap = cmap, \\\n", " ax = ax2, square = False, linewidths = 0.5, mask = mask, yticklabels = False, \\\n", " cbar_ax = cbar_ax, cbar_kws={'orientation': 'vertical', \\\n", " 'ticks': [-1, -0.5, 0, 0.5, 1]})\n", "ax2.set_xticklabels(ax2.get_xticklabels(), size = 16); \n", "ax2.set_title('Fraudulent \\n transactions', size = 20);\n", "\n", "cbar_ax.set_yticklabels(cbar_ax.get_yticklabels(), size = 14);"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "66d77c97-6da1-48f5-83ca-ef3412e2d2bd", "collapsed": true, "_uuid": "a9c3802f68e765be4450b6f48178e99c0ec9bf68"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "efe41a6c-472d-4135-b759-6e929dcb1bfd", "_uuid": "4473f9e3365ae8dadc0538c2451b6bcf37abaca4"}}, {"source": ["<a id='ML'></a>\n", "#### 6. Machine Learning to Detect Fraud in Skewed Data"], "cell_type": "markdown", "metadata": {"_cell_guid": "3f2ee030-3b87-4d26-b635-c1eeb7a4fd20", "_uuid": "12707e17588726f8c19aa299cad9bf910ab26340"}}, {"source": ["Having obtained evidence from the plots above that the data now contains \n", "features that\n", "make fraudulent transactions clearly \n", "detectable, the remaining obstacle for training a robust ML model is the highly \n", "imbalanced\n", "nature of the data."], "cell_type": "markdown", "metadata": {"_cell_guid": "20171a3c-f922-424d-84ac-86a3805ee540", "_uuid": "7cc3d6277f20cdd9e3f4c8ff7d76070a087e57ca"}}, {"source": ["print('skew = {}'.format( len(Xfraud) / float(len(X)) ))"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "1de6d509-4ea6-4aec-a465-b06bc2beafa8", "collapsed": true, "_uuid": "3696671c29b0d206491e2405650cd0830a86997c"}}, {"source": ["*Selection of metric*: \n", "Since the data is highly skewed, I use the area under the precision-recall curve (AUPRC) rather than the conventional area under the receiver operating characteristic (AUROC). This is because the AUPRC is more sensitive to differences between algorithms and their parameter settings rather than the AUROC (see <a href='http://pages.cs.wisc.edu/~jdavis/davisgoadrichcamera2.pdf'>Davis and Goadrich, 2006</a>)."], "cell_type": "markdown", "metadata": {"_cell_guid": "6e3d45e6-5bf7-4c97-845f-3a3bce01252c", "_uuid": "ac08bdf806601f6b0921d60c2a37aa58f4798b82"}}, {"source": ["*Selection of ML algorithm*: A first approach to deal with imbalanced data is to balance it by discarding the majority class before applying an ML algorithm. The disadvantage of undersampling is that a model trained in this way will not perform well on real-world skewed test data since almost all the information was discarded. A better approach might be to oversample the minority class, say by the synthetic minority oversampling technique (SMOTE) contained in the 'imblearn' library. Motivated by this, I tried a variety of anomaly-detection and supervised learning approaches. I find, however, that the best result is obtained on the original dataset by using a ML algorithm based on ensembles of decision trees that intrinsically performs well on imbalanced data. Such algorithms not only allow for constructing a model that can cope with the missing values in our data, but they naturally allow for speedup via parallel-processing. Among these algorithms, the extreme gradient-boosted (XGBoost) algorithm used below slightly outperforms random-forest. Finally, XGBoost, like several other ML algorithms, allows for weighting the positive class more compared to the negative class --- a setting that also allows to account for the skew in the data."], "cell_type": "markdown", "metadata": {"_cell_guid": "4fd4a2b9-6263-4ea1-af2f-c48ede10185d", "_uuid": "9f13c320f7146a08509f7edeb5d2b885057d6de4"}}, {"source": ["Split the data into training and test sets in a 80:20 ratio"], "cell_type": "markdown", "metadata": {"_cell_guid": "f8e836d8-94e9-44eb-ac73-27260555583f", "_uuid": "cbbd628b5ec1f1bc7346be24e27b8d02f8206cd1"}}, {"source": ["trainX, testX, trainY, testY = train_test_split(X, Y, test_size = 0.2, \\\n", " random_state = randomState)"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "d42493c5-061b-4b5e-8a51-ac4b77ddb7e8", "collapsed": true, "_uuid": "86740e8f7845c4eb12f979f5e7af87be19022110"}}, {"source": ["# Long computation in this cell (~1.8 minutes)\n", "weights = (Y == 0).sum() / (1.0 * (Y == 1).sum())\n", "clf = XGBClassifier(max_depth = 3, scale_pos_weight = weights, \\\n", " n_jobs = 4)\n", "probabilities = clf.fit(trainX, trainY).predict_proba(testX)\n", "print('AUPRC = {}'.format(average_precision_score(testY, \\\n", " probabilities[:, 1])))"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "7a1c6c5c-564d-4f10-9b48-02d2e287d82f", "collapsed": true, "_uuid": "bd071708fd0b0b93fdbd973fce9126f91f8dd871"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "c4cd5e93-8038-4cba-a48d-f3e227db1f54", "_uuid": "ed0275e8371cc951382844a61def5a9c994e4d78"}}, {"source": ["<a id='importance'></a>\n", "##### 6.1. What are the important features for the ML model?\n", "The figure below shows that the new feature *errorBalanceOrig* that we created is the most relevant feature for the model. The features are ordered based on the number of samples affected by splits on those features."], "cell_type": "markdown", "metadata": {"_cell_guid": "6fa793ac-7f64-43cb-aaaf-6851c04d392f", "_uuid": "94c7e03bafa5665d824d279c1786e5325667ec4a"}}, {"source": ["fig = plt.figure(figsize = (14, 9))\n", "ax = fig.add_subplot(111)\n", "\n", "colours = plt.cm.Set1(np.linspace(0, 1, 9))\n", "\n", "ax = plot_importance(clf, height = 1, color = colours, grid = False, \\\n", " show_values = False, importance_type = 'cover', ax = ax);\n", "for axis in ['top','bottom','left','right']:\n", " ax.spines[axis].set_linewidth(2)\n", " \n", "ax.set_xlabel('importance score', size = 16);\n", "ax.set_ylabel('features', size = 16);\n", "ax.set_yticklabels(ax.get_yticklabels(), size = 12);\n", "ax.set_title('Ordering of features by importance to the model learnt', size = 20);"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "e9af0368-f772-4170-b827-4ef4ccb1faca", "collapsed": true, "_uuid": "b35967bb8e6fb4beb337404eca84963632461b3a"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "22e04a16-6c8a-4d38-8ceb-cb1dfd3ff678", "_uuid": "d8214340ed92a6a3f0d26a56d48d3f6590caee5a"}}, {"source": ["<a id='decision-tree'></a>\n", "##### 6.2. Visualization of ML model\n", "The root node\n", "in the decision tree visualized below is indeed\n", "the feature *errorBalanceOrig*, \n", "as would be expected from its high significance to the\n", "model."], "cell_type": "markdown", "metadata": {"_cell_guid": "ae38a6b4-c275-4f8a-a4aa-6a4d91310d8d", "_uuid": "1544cb3d28eccc1b87ddb997a5bfbc262cc94b03"}}, {"source": ["to_graphviz(clf)"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "04396421-7f99-48fe-9cfb-46ddfbd7e5cc", "collapsed": true, "_uuid": "e1a97efd238503a62344a85123dc40ca14bd74ae"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "8fd6c0a7-fcf8-4156-bfc3-b5750a14ab4d", "_uuid": "de5a2f6830171ff4eb6a40b988889b3f3c5481ac"}}, {"source": ["<a id='learning-curve'></a>\n", "##### 6.3. Bias-variance tradeoff"], "cell_type": "markdown", "metadata": {"_cell_guid": "2c86eea7-b47b-41ab-a017-dbfae8d786a7", "_uuid": "3f76f0347b415c84510e13acb215137ecce33ab5"}}, {"source": ["The model we have learnt has a degree of bias and is slighly underfit. This is indicated by the levelling in AUPRC as the size of the training set is increased in the cross-validation curve below. The easiest way to improve the performance of the model still further is to increase the *max_depth* parameter of the XGBClassifier at the expense of the longer time spent learning the model. Other parameters of the classifier that can be adjusted to correct for the effect of the modest underfitting include decreasing *min_child_weight* and decreasing *reg_lambda*."], "cell_type": "markdown", "metadata": {"_cell_guid": "b803b53d-5252-41d1-b4fd-fbd7385f8c0b", "_uuid": "16cc77c6138aa736c7792c1f8d574cf627ae2c36"}}, {"source": ["# Long computation in this cell (~6 minutes)\n", "\n", "trainSizes, trainScores, crossValScores = learning_curve(\\\n", "XGBClassifier(max_depth = 3, scale_pos_weight = weights, n_jobs = 4), trainX,\\\n", " trainY, scoring = 'average_precision')"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "b88b43b8-89fe-4cee-819f-b6b46361571b", "collapsed": true, "_uuid": "e4798a9cbd5bfa22d0f8ba16b3854e98e7e92039"}}, {"source": ["trainScoresMean = np.mean(trainScores, axis=1)\n", "trainScoresStd = np.std(trainScores, axis=1)\n", "crossValScoresMean = np.mean(crossValScores, axis=1)\n", "crossValScoresStd = np.std(crossValScores, axis=1)\n", "\n", "colours = plt.cm.tab10(np.linspace(0, 1, 9))\n", "\n", "fig = plt.figure(figsize = (14, 9))\n", "plt.fill_between(trainSizes, trainScoresMean - trainScoresStd,\n", " trainScoresMean + trainScoresStd, alpha=0.1, color=colours[0])\n", "plt.fill_between(trainSizes, crossValScoresMean - crossValScoresStd,\n", " crossValScoresMean + crossValScoresStd, alpha=0.1, color=colours[1])\n", "plt.plot(trainSizes, trainScores.mean(axis = 1), 'o-', label = 'train', \\\n", " color = colours[0])\n", "plt.plot(trainSizes, crossValScores.mean(axis = 1), 'o-', label = 'cross-val', \\\n", " color = colours[1])\n", "\n", "ax = plt.gca()\n", "for axis in ['top','bottom','left','right']:\n", " ax.spines[axis].set_linewidth(2)\n", "\n", "handles, labels = ax.get_legend_handles_labels()\n", "plt.legend(handles, ['train', 'cross-val'], bbox_to_anchor=(0.8, 0.15), \\\n", " loc=2, borderaxespad=0, fontsize = 16);\n", "plt.xlabel('training set size', size = 16); \n", "plt.ylabel('AUPRC', size = 16)\n", "plt.title('Learning curves indicate slightly underfit model', size = 20);"], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"_cell_guid": "ca081d48-0036-4f48-a15d-80cb56e982a7", "collapsed": true, "_uuid": "b422b057cef6b2cced3c65faac5182232350bf70"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "cd22e56b-b8b3-40d9-94f4-51dc715e9223", "_uuid": "81f70a751db2e89b9aeda667d09f3fd7c332a0d1"}}, {"source": ["<a id='conclusion'></a>\n", "#### 7. Conclusion"], "cell_type": "markdown", "metadata": {"_cell_guid": "9bc93ea2-99cf-45ad-96e8-de74b99069b2", "_uuid": "8ecccea57c03137cdd21bc4cc88dcb972c918507"}}, {"source": ["We thoroughly interrogated the data at the outset to gain insight into which features could be discarded and those which could be valuably engineered. The plots provided visual confirmation that the data could be indeed be discriminated with the aid of the new features. To deal with the large skew in the data, we chose an appropriate metric and used an ML algorithm based on an ensemble of decision trees which works best with strongly imbalanced classes. The method used in this kernel should therefore be broadly applicable to a range of such problems.\n", "\n", "*Acknowledgements*: Thanks to Edgar Alonso Lopez-Rojas for posting this dataset.\n", "\n", "*Hope you enjoyed reading this kernel as much as I had fun writing it. Please feel free to fork, upvote, and leave your comments to make my day* :-)"], "cell_type": "markdown", "metadata": {"_cell_guid": "d3aed614-e781-4b7b-b3ed-d51b0e90acc5", "_uuid": "865f045bae5f6570f9c0a430fbef484a73606a74"}}, {"source": ["<a href='#top'>back to top</a>"], "cell_type": "markdown", "metadata": {"_cell_guid": "9a61ccc1-8b89-46e4-93e0-fa50826c4c66", "_uuid": "1d03269b9fa12761dc250cf307899b2f13700072"}}], "nbformat_minor": 1, "nbformat": 4}