This project involves developing an AI-driven web scraper that extracts and structures information from websites based on user inputs. The scraper utilizes advanced Natural Language Processing (NLP) capabilities through a fine-tuned LLaMA 3.1 model, specifically fine-tuned using the Supervised Fine-Tuning (SFT) technique with LoRA and the Unsloth library. The application features a user-friendly Streamlit interface, allowing users to enter website URLs, specify content extraction parameters, and receive structured data outputs.
Python: Main programming language used for the entire project.
Selenium: For automated browsing and scraping of web pages.
BeautifulSoup: For parsing HTML and extracting content.
Streamlit: To build the interactive web application interface.
LLaMA 3.1 Model: For parsing and processing scraped content using NLP.
LoRA (Low-Rank Adaptation): Fine-tuning technique applied to the LLaMA model.
Unsloth Library: Used to fine-tune LLaMA 3.1 model efficiently.
LangChain-Ollama: For structured data extraction and NLP parsing.
Automated Web Scraping: Enter a URL, and the application automatically scrapes data from the website.
NLP-Driven Content Parsing: Extract and structure specific information using a fine-tuned LLaMA 3.1 model.
Streamlit UI: Simple and intuitive user interface for seamless interaction.
Customisable Data Extraction: Users can define the parsing parameters to control the type of data extracted.
Fig. 1 Streamlit Interface