Prediction using Orange (.ows) on Loan Status.

Analyzing the Factors and Requisites that can influence the loan status and finally classify whether the person paid the loan or is charged off.

Aryan Bajaj
4 min readNov 16, 2020
Orange

The project undertaken predicted the requisite figures and analyzed them under the given parameters to arrive at the conclusion of whether the person has fully paid the loan or is charged off. The analysis was enabled on Orange and using a wide variety of tools to arrive at the above-mentioned conclusion.

Firstly, the .CSV File was uploaded, then all the target column(s), i.e., LOAN STATUS were selected & then, the Rank widget from the Data Column was taken, as ranking helps in giving a gist of what is required the most in a particular type of data. Then the first 11 Data Heads were selected according to their ranks.

Then, the Data was check using Data Table, and, it was then observed that 8.3% of values from the data were missing. So, there is a need to impute the data by considering the mean and mode of the values and fill the missing values in the data (using func. impute)

The task was done by using different models and then evaluated by using Test & Score.

The two different models that were used are: -

  1. Combination of Naive Bayes & Decision Tree
  2. Random Forest

- The Naive Bayes & Tree were used because: -

The Naive Bayes model can deal with both continuous and discrete data. It is profoundly versatile with the number of indicators and data focuses. It is quick and can be utilized to make continuous real-time predictions. It is not sensitive to irrelevant features.

The Decision Tree is used to comprehend & predict both numerical values and categorical value problems. But there is a drawback that it generally results in overfitting of the data/information. Yet, we can dodge the over fittings by utilizing a pre-pruning approach, for instance, creating a tree with fewer leaves and branches.

The combination of Naive Bayes & Decision Tree was used because Naive Bayes has some plus points, which Tree does not have, and vice-versa. For instance, Naive Bayes can do text classification and spam filtering. On the other hand, Tree can do the pattern, sequence, and financial recognition. Together they are strong.

- Random Forest model was used because: -

Random Forest is a tree-based learning algorithm with the power to form accurate decisions as it many decision trees together. As its name says — it’s a forest of trees. Hence, Random Forest takes more training time than a single decision tree. Each branch and leaf within the decision tree works on the random features to predict the output. Then this algorithm combines all the predictions of individual decision trees to generate the final prediction, and it can also deal with the missing values.

After Test & Score, the Confusion Metrics was used to see all the True Positives and False Negative values, etc. And lastly, Distribution visualization was used to ascertain the information.

Conclusion

It can be seen that the final results were turned out to be different. So — now, there is a need to take an average of both the results and then it can be said that in LOAN STATUS only 5.52% of the population (Total Population is 79.25k (As there is a need to take an average of the total population as well.)) comes under CHARGED OFF and rest, i.e., 94.48% of the population comes under FULLY PAID.

And, It can also be said that the RANDOM FOREST is a better model than DECISION TREE and NAIVE BAYES Combination because it has better AUC as: -

  • AUC is scale-invariant — i.e., — it measures how well predictions are ranked, irrespective of their absolute values.
  • AUC is also a classification threshold invariant- i.e.,- it measures the quality of the model’s predictions irrespective of what classification threshold is chosen.
Viewing the data according to the RANK and then SELECTING the Data from the FILE.
Looking out for MISSING VALUES and then IMPUTING the DATA to rectify MISSING VALUES.
Now the table has NO MISSING VALUES.
Using two different methods to get the best possible result (1st image is a combination of Naive Bayes & Decision Tree Model) (2nd image is of Random Forest Model).
The Final Prediction is different — so, now there is a need to take an average (Average of Total population) (Average of Charged Off) (Average of Fully Paid).
In this case, it can be seen that — Random forest is a better model as it has a HIGHER AUC (Area Under Curve).
This is the full MODEL VIEW.

This dataset (.CSV file) is taken from Kaggle.

Filename: Credit_train

Contacts

In case you have any questions or any suggestions on what my next article should be about, please leave a comment below or mail me at aryanbajaj104@gmail.com

If you want to keep updated with my latest articles and projects, follow me on Medium.

Connect with me via:

LinkedIn

Instagram

--

--

Aryan Bajaj

Passionate about studying how to improve performance and automate tasks.