Client Background

Client: A Leading Tech Firm in the USA

Industry Type: IT Consulting

Services: Software, Consulting

Organization Size: 100+

Project Objective

  1. Classify the medical research paper into 0 if the medical research paper cannot be used in future medical research and 1 if the medical research paper can be used in research based on some research-related phrases.
  2. Train an ML/DL model on classified data.

Project Description 

We have given an excel sheet of medical research paper text and provided some phrases to identify research papers that can be used for future medical research. If the phrase is not present in a research paper then it will not be used for research. After annotation, we need to find the best ML/DL model to train research data and evaluate the model on test data.

Our Solution

We have created a python script that can compare all medical research paper text to research phrases and annot 0 if research phrases are not present in a medical research paper and 1 if research phrases present in medical research paper. 

After annotation we have trained different machine learning and deep learning models like Bert base uncased using Tensorflow, bert large, XGBoost Classifier, Random Forest Classifier and Logistic Regression. Among these models we have chosen the best accuracy  parameters model. In our case the bert-base model performed good and gave 95% test accuracy. 

Project Deliverables

ML/DL model which is trained on medical research classification data to classify other medical research papers.

Tools used

Google Colab notebooks, Tensorflow, PyTorch, Transformers, MS Excel

Language/techniques used

Python, Machine learning, Deep learning, Data Science, Natural Language Processing (NLP).

Models used

Tensorflow-Bert model, PyTorch LSTM model, Random Forest Classifier, XGBoost Classifier, Logistic Regression.  

Skills used

Machine Learning, Deep learning, NLP, Python programming. 

Databases used

used ms excel data

What are the technical Challenges Faced during Project Execution

There are various technical challenges faced during project execution:

  1. The research paper has a huge amount of text data so the model was giving space errors in colab notebooks.
  2. Find the best threshold value which gives best test accuracy. 

How the Technical Challenges were Solved

  1. To solve space error we have trained the model with lower batch size so this solved the error.
  2. To find the best threshold value we created the ROC AUC curve and Precision  Recall curve and checked best points where accuracy will be higher.