When we are working on a Machine Learning project the part most important is the dataset that we are using, building to do the respective test and training. For this project I look for the models that are be using in the University and research centers to breast cancer analysis.
The project was divided in 2 parts:
Work with variables: share.streamlit.io/felipelx/hackathon/IDC_D..
Work with images: share.streamlit.io/felipelx/hackathon/IDC_D..
Nowadays the best place for look for a dataset is Kaggle. For the variables model was choose the dataset: Breast Cancer Wisconsin (Diagnostic) Data Set
The dataset contain 32 columns, been divide if the variables are from a Benign/Normal or Malign (diagnosis column). After to decide the model to be used was tested almost all approach to check the accuracy of the model chosen what had more.
Division of the dataset:
from sklearn.model_selection import train_test_split
X= df.iloc[:, 2:32]
Y= df['diagnosis']
X_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=.2, random_state=42)
Checking the models:
# Using Logistic Regression Algorithm to the Training Set
from sklearn.linear_model import LogisticRegression
classifier_reg = LogisticRegression(random_state = 0)
classifier_reg.fit(X_train, Y_train)
# Using KNeighborsClassifier Method of neighbors class to use Nearest Neighbor algorithm
from sklearn.neighbors import KNeighborsClassifier
classifier_nei = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier_nei.fit(X_train, Y_train)
# Using SVC method of svm class to use Support Vector Machine Algorithm
from sklearn.svm import SVC
classifier_svc_linear = SVC(kernel = 'linear', random_state = 0)
classifier_svc_linear.fit(X_train, Y_train)
# Using SVC method of svm class to use Kernel SVM Algorithm
from sklearn.svm import SVC
classifier_svc_rbf = SVC(kernel = 'rbf', random_state = 0)
classifier_svc_rbf.fit(X_train, Y_train)
# Using GaussianNB method of naïve_bayes class to use Naïve Bayes Algorithm
from sklearn.naive_bayes import GaussianNB
classifier_gau = GaussianNB()
classifier_gau.fit(X_train, Y_train)
# Using DecisionTreeClassifier of tree class to use Decision Tree Algorithm
from sklearn.tree import DecisionTreeClassifier
classifier_tre = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier_tre.fit(X_train, Y_train)
# Using RandomForestClassifier method of ensemble class to use Random Forest Classification algorithm
from sklearn.ensemble import RandomForestClassifier
classifier_for = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier_for.fit(X_train, Y_train)
The result bring that RandomForestClassifier was the most accurate. The application was build using this model, and the interaction with the user was build for the user define the 30 variables that come when the patient do a breast cancer exams. With the user input is possible to predict the result in percent and return for the user if the combination of the variables are Normal or Malign.
prediction = dtc.predict(user_input_variables)
# target_name = ['B', 'M']
result_cross = str((dtc.predict_proba(user_input_variables)[:,0]* 100).round(2))
result_cross = result_cross.replace('[', '').replace(']', '')
result_cross_str = str(result_cross) + '%'
print('result_cross', result_cross_str)
result = ''
if prediction == 'B':
result = 'Benign/Normal'
else:
result = 'Malign'
c.metric("Result", result, delta= result_cross_str, delta_color='normal')
The repository with the code: github.com/felipeLx/hackathon/blob/master/p..