Client Background
Client: A leading financial institution in the USA
Industry Type: Financial services & Consulting
Services: Financial consultant
Organization Size: 100+
Project Objective
>To process two json file stocktwits_legacy_msg_2015_10.txt (file size = 2 GB) & stocktwits_legacy_msg_2015_10.txt (file size = 3.5 GB).
>To handle Nested Json for both files and after conversion into one merged Data Frame need to perform Data Structurization.
>While accessing a Json file in JupyterNB, I need to perform Chunking as the file size is bigger and it is in json format to avoid PC standstill.
>After Data Preprocessing I need to perform Exploratory Data Analysis on that Data.
> Conditional Programming to deal with Data Transferring to a particular folder based on the column values.
Project Description
During the training period I was involved with 2 live projects, One project named ‘Stocktwits Data Structurization’ in which I have to process huge JSON Data which was already obtained the size of data was nearly 5 GB need to process the data by chunking with chunk size = 20000 rows at a time. The file has nested JSON data within it’s attributes so abstracts data from the nested columns into a new dataframe. Completed handling complex nested json formed columns abstracted from nested json. Then need to Handle the missing data by mapping it with another index dataset further missing values for certain attributes were handled by mean value and 0 substitution. This task involves numerous pandas operations along with multiple python functions. Further done Exploratory Data Analysis on the cleaned dataset finding correlation matrix and plotting certain seaborn graphs between strong correlated attributes.
Our Solution
Worked on Accessing Json Data, done tree Analysis on Json Sample data.
Both the File was too big for reading and applying some Python Code in JupyterNb, so performed chunking of stocktwits_legacy_messages_2015_10.txt with chunk size = 20000 rows at a time. Similarly trying for the other file.
Created a list of all the chunked files of Json Data & Concat all the files in that list.
The File has Nested Json data within it’s attributes so abstracted data from the nested columns into a new DataFrame. Completed handling complex nested json formed columns abstracted from nested json.
Renamed the columns with identification. (Eg: ‘id’ as ‘entities_id’) likewise for others. So that while merging the data doesn’t create any issue. Completed forming Preprocessed csv file for 1st json file which Output2015.csv.
For Second file size was > 3gb so splitted the file into ten parts and then individually solved nested json for all these parts like done in the 1st file finally concat them into one, then handled columns arrangements and removed unwanted columns and finally removed dictionary representation from entity_sentiments column. Completed forming Preprocessed csv file for 2nd json file which is Output_Stocktwits_2017.csv.
The cleaned dataset finding correlation matrix and plotting certain seaborn graphs between strong correlated attributes. Further done Exploratory Data Analysis on the cleaned dataset finding correlation matrix and plotting certain seaborn graphs between strong correlated attributes. Conditional Programming to deal with Data Transferring to a particular folder based on the column values.
Project Deliverables
- Categorized Preprocessed CSV Files
- Python Script
- iPython NB with comments on each performed code.
Tools used
● Jupyter Notebook
● Anaconda
● Notepad++
● Sublime Text
● Brackets
● JsonViewer
Language/techniques used
● Python Programming
Models used
My project ‘Stocktwits Data Structurization’ developed with a software model which makes the project high quality, reliable and cost effective.
● Software Model : RAD(Rapid Application Development model) Model
● This project follows a RAD Model as our model is not forming the loop from end to the start, also my project was based on prototyping without any specific planning. In the RAD model, there is less attention paid to the planning and more priority is given to the development tasks. It targets developing software in a short span of time.
● Advantages of RAD Model:
o Changing requirements can be accommodated.
o Progress can be measured.
o Iteration time can be short with use of powerful RAD tools.
o Productivity with fewer people in a short time.
o Reduced development time.
o Increases reusability of components.
o Quick initial reviews occur.
o Encourages customer feedback.
o Integration from very beginning solves a lot of integration issues
Skills used
● Data Mining
● Data Wrangling
● Data Visualization
● Python Programming including OOPs and Exception Handling
Databases used
No Databases were used, all the data was stored on Google Drive and Local Device.
Web Cloud Servers used
No Cloud Server were used
What are the technical Challenges Faced during Project Execution
● Handling Huge Data and Data Cleaning
● JSON Data Serialization.
● Solving Complex Nested JSON among the data provided.
How the Technical Challenges were Solved
● Handling Huge Data and Data Cleaning
Solved by Breaking the Dataset into 10 stream parts as the data was too huge and was not able to read easily in Jupyter NB.
● JSON Data Serialization
Solved by Data Chunking with chunk_size=20000 which means serialization of data with processing 20000 rows at a time.
● Solving Complex Nested JSON among the data provided.
Viewed the Structure of the part of data in JSON Viewer then Changed the data in proper standard JSON Format. After Reading JSON Data Performing Normalization of Nested JSON data setting maximum level of normalization with specifying proper orient form. Then After Normalization remaining Unsolved Nested JSON was solved using Dictionary Conversions and Structuring the data.
Project Snapshots
Figure 1 Sample Input Dataframe After Converting Outer JSON
Figure 2 Sample Output Dataframe After Solving Nested JSON and Data Preprocessing