Client Background

Client: A leading tech firm in the USA

Industry Type:  IT

Services: SaaS, Products

Organization Size: 100+

Project Objective

To extract the data for the given keywords from the listed websites https://www.ferguson.com/, https://www.bakerdist.com/, https://www.hajoca.com/, https://www.carrier.com/residential/en/us/, https://www.gemaire.com/, https://www.fwwebb.com/ and store the count of each keywords for each website it in an Excel File.

Project Description

A list of websites is provided from which we were supposed to find out the mentioned keywords and store their respective counts for each website in an Excel sheet with different tabs for different set of keywords.

Our Solution

We used Selenium as well as Bs4(Beautiful Soup) to extract data from the given websites. To accomplish the given task, 2 tools were developed for each website.

  1.  Search tool was developed to search the keyword in the website’s search bar and count displayed for keywords of each category was stored in separate files.
  2. Content tool was developed which scraped full text from each url obtained from the respective sitemaps. Along with the text visible on the page, data from meta keywords, meta description and title was also scrapped.

Extracted content from all the websites was stored in their respective text files. After that number of keywords in the text were counted using substring and count method and stored the keyword and its corresponding count in an Ordered Dictionary and then the count was transferred to a list and Excel file was created for the same. Counts received from search tool and content tool were combined and final output file was created.

Project Deliverables

  1. Python Scripts for each website to extract the count of keywords.
  2. Excel Sheet name HVAC_Report Test.xlsx having counts for each set of keywords for each website.

Tools used

Python Interpreter

Language/techniques used

Language Used: Python

Libraries used: BeautifulSoup, collection.OrderedDict, pandas, requests, xlsxwriter, selenium.webdriver

What are the technical Challenges Faced during Project Execution

Some of the websites cannot be accessed using Indian IP address as it was having captchas. Also, we cannot go to each and every page by clicking the results and get the count.

How the Technical Challenges were Solved

To bypass the captcha and reach the website, we need to use VPN of Singapore. And to get access to each and every page of the website, we found out sitemap for each website which includes link to every page present in it.

Project Snapshots

Contact Details

Here are my contact details:

Email: ajay@blackcoffer.com

Skype: asbidyarthy

WhatsApp: +91 9717367468

Telegram: @asbidyarthy 

For project discussions and daily updates, would you like to use Slack, Skype, Telegram, or Whatsapp? Please recommend, what would work best for you.