Client Background
Client: A leading Venture Capital and Private Equity Principals in the Globe
Industry Type: Venture Capital and Private Equity Principals
Services: Private Equity, Venture Capital, Data Analysis, Fund Performance, Alternative Assets, Competitive Intelligence, Limited Partners, Customized Benchmarks, Service Providers, Fund of Funds, M&A, and Financial Services
Organization Size: 100+
The Problem
- Extract funding-related data from news articles (from 1000+ websites) such as company name, funded amount, participated investors, and other details.
- create a web app to manage the extraction of funding data
Our Solution
- There were 1000+ websites from funding-related articles so we couldn’t make a crawler for each website. So we used an inbuilt web crawler provided by elasticsearch. When we have extracted articles then we need to extract funding related information company name, fund amount and investors participated etc. Then we decided to use NLP’s question-answering method in which we need to train transformers to extract funding-related information. First we have created some keywords based approaches to create labels for each field we need to extract to train models. After that we have trained distil bert model on labelled data on AWS EC2’s GPU server. We applied this approach for all the fields we need to extract. We got 90%+ accuracy for the company name field and for other fields we got 80%+ accuracy.
- To manage and view all the fields of extracted funding data we created a web app using python flask. In this we created several pages to show extracted raw data by crawler, cleaned data after applying some cleaning functions and final output which have all the fields. We also created admin dashboard pages to show daily crawling status, how many articles processed in one day, total final output etc.
Solution Architecture
Deliverables
- Flask Web app
- Elasticsearch crawler
Tools used
Flask, Spacy, NLTK, pandas, numpy, transformers, elasticsearch etc.
Language/techniques used
Question answering in NLP, web scraping, web application Flask, Python
Models used
Distil-bert model, en-core-web-sm (pre trained model of spacy)
Skills used
NLP, Data Analysis, Flask web app, Pandas, Numpy, transformers, fastapi, elasticsearch etc.
Databases used
Elasticsearch database
Web Cloud Servers used
AWS
What are the technical Challenges Faced during Project Execution
- The client wanted to extract data from 1000+ different websites and if we make any crawler it only works for one website so it was not possible to create a 1000+ web crawler.
- How to extract funding information from an article. It is very difficult to extract that type of information from normal python code by defining keywords because every website has different types of articles.
How the Technical Challenges were Solved
- To solve web crawler-related issues we used elasticsearch web crawler which is very fast and can extract multiple websites at a time. In this we need to create an engine and add websites that we want to scrape. After that we added some keywords to extract only funding-related articles. We set up this crawler to run every hour so we can get new articles every hour.
- To extract funding-related information we collected articles from different websites and created labels for each field we wanted to extract. After that we have fine-tuned the transformer’s Distil-bert model on our labeled data. We used these models to extract funding-related information. We also created an automated python script that uses these model on every extracted article and extracts funding-related information.
Business Impact
This funding-related data would be used in two ways. From this project, companies can find suitable investors for their startups. Companies can search for investors based on industry, verticals, etc., and find investors to help their startups.
Investors can use it to find a startup in which they want to invest based on their preferences like industry, verticals, etc.
Project Snapshots (Minimum 10 Pictures)
Contact Details
Here are my contact details:
Email: ajay@blackcoffer.com
Skype: asbidyarthy
WhatsApp: +91 9717367468
Telegram: @asbidyarthy
For project discussions and daily updates, would you like to use Slack, Skype, Telegram, or Whatsapp? Please recommend, what would work best for you.