Client Background

Client: A leading financial firm in the USA

Industry Type: Financial services & Consulting

Services: Financial consultant

Organization Size: 100+

Project Objective

Project “Sentimental Analysis on Shareholder Letter of Companies” objective was to Predict the Sentiments columns Shareholder Letter in terms of Polarity and Subjectivity finally classification of data into positive, negative and neutral tone.

Project Description

The project ‘Sentimental Analysis on Shareholder Letter of US Companies’ task involved data cleaning on shareholder letters of different companies which includes lemmatization, lower case conversion, removing special character, \n , \t , punctuations, numbers & single character and tokenization. To generate polarity and subjectivity columns for the letter 1 & letter 2 columns using the Textblob library of NLTK. Based on the polarity categorizing it into positive, neutral  &  negative. 

Our Solution

  1. Letter Text Length Variation
  2. Contraction mapping on dataset
  3. Replacing missing value with some neutral tone string like None so that cleaning doesn’t generate any issue. 
  4. Data Cleaning and Preprocessing which involves : 

            i.  Lemmatisation 

            ii. lower case conversion 

            iii.  Removing Special character 

            iv.  Removing \n , \t etc

             v.  remove punctuations, numbers & single character removal

            vi.  forming list of letter data using tqdm

  1. Tokenization and word count.
  2. Used Textblob Library which is part of NLTK for Sentiment analysis.
  3. Created Polarity and Subjectivity column for the Letter1 & Letter2 columns
  4. Based on the polarity of letter 1 created a letter1_type column with values “positive” , “neutral”  &  “negative” category.

Project Deliverables

  1. Output iPython File
  2. Preprocessed Dataset

Tools used

● Jupyter Notebook

● Anaconda

● Notepad++

● Sublime Text

● Brackets

● Python 3.4

Language/techniques used

  1. Python
  2. Machine Learning
  3. NLP (Natural Language Processing)

Models used

My project ‘Sentimental Analysis on Shareholder Letter of Companies’ developed with a software model which makes the project high quality, reliable and cost effective.

● Software Model : Waterfall Model

● For Project ‘Sentimental Analysis on Shareholder Letter of US Companies’ is a Waterfall Model as our model is not forming the loop from end to the start using Textblob which predicts Sentiments, Polarity and Subjectivity as the output following the Waterfall Model.

Skills used

  • Pandas Operations
  • Data Chunking and Integration
  • Data Visualization

Databases used

No Database is used to complete this project.

Web Cloud Servers used

No Web cloud Server was required for this work.

What are the technical Challenges Faced during Project Execution

I have worked before on tasks similar to this so there were no challenges faced but the data cleaning was a bit different and required time to complete.

How the Technical Challenges were Solved

As Discussed no technical Challenges were faced during this project.

Project Snapshots 

Figure 1: Input Data Schema

Figure 2: Output Data Schema

Figure 3: Sample Input Dataset

figure 3 is pandas dataframe which was fetched from google cloud database there were 7 columns and 13290 rows.

Figure 4: Sample Output Dataset

figure 4 is output pandas dataframe after data cleaning and modeling of sentiment identification there are 13 columns and 13290 rows.

Figure 5: Sentiments assignment based on polarity

figure 5 represents the identification of sentiments and tone based on polarity and subjectivity. polarity>0 then sentiment type is positive,  if the polarity<0 sentiment type is negative and if the polarity=0 sentiment type is neutral.

Figure 6:  Histogram Representation of Length of Shareholder Letter 1

figure 6 is histogram plot between length of shareholder letter 1 among the final output dataset.

Figure 7:  Histogram Representation of Length of Shareholder Letter 2

figure 7 is Histogram plot between length of shareholder letter 2 among the final output dataset.

Figure 8: Flow Chart