Client Background
Client: A Leading Legal Tech Firm in the USA
Industry Type: AI-Simulation / Complex Systems ModellingCosmo Tech+1
Products & Services: Platform for system simulation, decision intelligence, digital twins for supply chain, assets and enterprise value chainsCosmo Tech
Organization Size: 100+
The Problem
The client needed to build a retrieval-augmented-generation (RAG) agent focused on AI regulation, laws and rules. To enable this, data had to be collected from 7+ sources about AI regulation. The challenge was to gather and structure large volumes of regulatory content reliably. One major difficulty: from sources such as U.S. Securities and Exchange Commission (SEC) the pages were interconnected , metadata, comments, categories, and sometimes document links instead of direct text. The scraping process had to handle dynamic layouts, nested links, metadata extraction, and variable schema.
Our Solution
A dynamic web-scraper framework that adapts to page layouts, identifies key fields (date, name, category, summary, description), extracts the data, and ingests it into a structured schema in a Supabase database, preparing downstream flow for vector embeddings and RAG indexing.
Solution Architecture
- The scraper triggers against target sources (e.g., Eurlex and SEC)
- For each source page: detect if it’s simple text or a link to a document, parse accordingly
- Extract metadata (date, categories, comments), summary and full description
- Normalize into standard schema: {date, name, category, summary, description, source_url}
- Insert cleaned and structured records into Supabase for later embedding workflows
Deliverables
- A dynamic web-scraper capable of handling varied page layouts and nested links
- Data extraction pipeline for regulatory content from two high-complexity sources (Eurlex + SEC)
- Schema design and ingestion for structured records into Supabase
- Pipeline ready for vector embedding preparation
Tech Stack
Tools Used
- Web scraping libraries (e.g., Python requests + BeautifulSoup + Selenium)
- Supabase API for data insertion
Language / Techniques Used
- Python
- Layout adaptation logic for varying HTML structures
- Metadata and content extraction
- Schema normalization
Models Used
- None at this stage (pre-embedding)
Skills Used
- Web scraping of complex regulatory websites
- Data modelling and schema design
- Database ingestion and API integration
Databases Used
- Supabase (PostgreSQL backend)
Web / Cloud Servers Used
- Hosted Python script environment (e.g., AWS Lambda / EC2 or similar) , specify if different
What Were the Technical Challenges Faced During Project Execution
- Handling websites where content is hidden behind multiple linked pages or documents (not just a straightforward HTML text)
- Extracting consistent metadata (date, category) from heterogeneous layouts and formats
- Building scraper logic that is resilient to changes in page structure
- Ingesting and aligning data from disparate sources into a unified schema
How the Technical Challenges Were Solved
- Built dynamic layout detection to switch scraper strategies (direct text vs. document link)
- Created metadata extraction modules that parse comments, categories, and nested links
- Developed schema normalization logic to unify fields across sources
- Used Supabase API to automate ingestion, allowing structured data to be ready for embedding workflows
Business Impact
- Enabled the foundation for a RAG agent targeting regulatory content , accelerating model readiness
- Reduced manual data gathering effort and improved reliability of source ingestion
- Created structured, clean data sets which enhance downstream embedding quality and retrieval accuracy
Positioned the client to scale regulatory coverage (7+ sources) without manual bottlenecks





















