Client Background
Client: A leading tech firm in the USA
Industry Type: IT
Services: SaaS, Products
Organization Size: 100+
Project Objective
To extract the data for the given keywords from the listed websites https://www.ferguson.com/, https://www.bakerdist.com/, https://www.hajoca.com/, https://www.carrier.com/residential/en/us/, https://www.gemaire.com/, https://www.fwwebb.com/ and store the count of each keywords for each website it in an Excel File.
Project Description
A list of websites is provided from which we were supposed to find out the mentioned keywords and store their respective counts for each website in an Excel sheet with different tabs for different set of keywords.
Our Solution
We used Selenium as well as Bs4(Beautiful Soup) to extract data from the given websites. To accomplish the given task, 2 tools were developed for each website.
- Search tool was developed to search the keyword in the website’s search bar and count displayed for keywords of each category was stored in separate files.
- Content tool was developed which scraped full text from each url obtained from the respective sitemaps. Along with the text visible on the page, data from meta keywords, meta description and title was also scrapped.
Extracted content from all the websites was stored in their respective text files. After that number of keywords in the text were counted using substring and count method and stored the keyword and its corresponding count in an Ordered Dictionary and then the count was transferred to a list and Excel file was created for the same. Counts received from search tool and content tool were combined and final output file was created.
Project Deliverables
- Python Scripts for each website to extract the count of keywords.
- Excel Sheet name HVAC_Report Test.xlsx having counts for each set of keywords for each website.
Tools used
Python Interpreter
Language/techniques used
Language Used: Python
Libraries used: BeautifulSoup, collection.OrderedDict, pandas, requests, xlsxwriter, selenium.webdriver
What are the technical Challenges Faced during Project Execution
Some of the websites cannot be accessed using Indian IP address as it was having captchas. Also, we cannot go to each and every page by clicking the results and get the count.
How the Technical Challenges were Solved
To bypass the captcha and reach the website, we need to use VPN of Singapore. And to get access to each and every page of the website, we found out sitemap for each website which includes link to every page present in it.
Project Snapshots
Contact Details
Here are my contact details:
Email: ajay@blackcoffer.com
Skype: asbidyarthy
WhatsApp: +91 9717367468
Telegram: @asbidyarthy
For project discussions and daily updates, would you like to use Slack, Skype, Telegram, or Whatsapp? Please recommend, what would work best for you.