π₯οΈ Backend Documentation
π Overview
The backend of the Web Scraping Project is responsible for:
-
Managing web scraping tasks using Requests, BeautifulSoup or Selenium.
-
Processing and cleaning scraped data .
-
Providing an API to serve the scraped data to the frontend .
-
Handling database operations for storing and retrieving scraped data.
π Features:
- β Scrapes static websites using Requests and BeautifulSoup
- β Scrapes dynamic websites using Selenium
- β Clean scraped data
- β
Preview or save the scraped data as a
.txt
file - β Reliable and secure REST API with JWT Token
Installation:
This guide will walk you through setting up and running the backend of the project.
1οΈβ£ Create a Virtual Environment (Recommended)
Using a virtual environment ensures dependencies are managed properly and avoid conflicts.
Run the following command to create one:
Then activate it:
Windows:
Mac/Linux:
2οΈβ£ Install the Required Python Version
Ensure you have the correct Python version installed.
3οΈβ£ Install Dependencies
After activating the virtual environment, install all required dependencies:
4οΈβ£ Code Quality & Standards
To maintain code quality and consistency, install and configure the following tools:
5οΈβ£ Navigate to the Backend Directory
Move into the backend project folder:
6οΈβ£ Run the Backend Server
Start the Flask application:
Windows:
Mac/Linux:
Code
1. app.py (API Endpoints):
π Scrape a Website (/scrape
)
π Overview
This route scrapes a static website using BeautifulSoup or Requests and returns the extracted HTML.
Defines a route for a specific URL and specifies which HTTP method is allowed
this Function calls the scrapes_with_bs4 or scrapes_with_requests function that scrapes static websites and return the data as HTML
π© Request Parameters
JSON Body (Required Fields)
Parameter | Type | Required | Description |
---|---|---|---|
url |
string |
β Yes | The URL of the website to scrape. |
scraping_method |
string |
β Yes | The scraping method: "requests" , "bs4" , or "selenium" . |
clean_data |
boolean |
β No (default: false ) |
Whether to clean the scraped data. |
company_name |
string |
β
Yes (for "selenium" ) |
The name of the company (used for Selenium-based scraping). |
Headers (Optional)
Header | Type | Required | Description |
---|---|---|---|
Authorization |
string |
β No | JWT token (Required for storing scraping history). |
π Processing Steps
- Retrieve JSON Data: Extracts the URL, scraping method, and optional parameters from the request.
- Validation:
- Ensures the
url
is provided and starts withhttps://
. If it starts withwww.
, we will addhttps://
- Requires
scraping_method
to be one of"requests"
,"bs4"
, or"selenium"
. - If using
selenium
,company_name
is required. - Call the Appropriate Scraping Function:
requests
βscrape_with_requests(url)
bs4
βscrape_with_bs4(url, clean=clean_data)
selenium
βscrape_with_selenium(url, company_name, clean=clean_data)
- Store Scraping History (if JWT token is provided).
- Return JSON Response with the scraped data.
πResponse:
β Success Response
HTTP Status Code: 201 Created
{
"message": "URL Scraped with selenium and content saved",
"status": 1,
"scrape_result": "<scraped HTML data>"
}
Error Responses
The /scrape
endpoint may return the following error responses:
HTTP Status Code | Error Message | Description |
---|---|---|
400 | "error": "URL is required" |
The url field is missing in the request body. |
400 | "error": "Scraping method is required" |
The scraping_method field is missing in the request body. |
400 | "error": "Company name is required for Selenium" |
The company_name field is required when using "selenium" as the scraping_method . |
400 | "error": "Invalid scraping method" |
The provided scraping_method is not recognized (must be "requests" , "bs4" , or "selenium" ). |
401 | "error": "Invalid or missing token" |
The request is missing an authorization token or contains an invalid one. |
500 | "error": "Internal Server Error" |
An unexpected server error occurred. |
π Note: Ensure all required fields are provided in the JSON request body to avoid errors.
Why jsonify file ?
It's better to return JSON to the frontend because JSON is lightweight, structured, and universally supported by typeScript
π Verify Authentication (/auth
)
Method: GET
Description: Checks if the provided JWT token is valid.
πΉ Request Headers
Header | Type | Required | Description |
---|---|---|---|
Authorization | String | β Yes | Bearer Token required for authentication. |
πΉ Responses
β
Success (200 OK
)
π Login (/login
)
Method: POST
Description: Authenticates a user and returns a JWT token.
πΉ Request Body (JSON)
Parameter | Type | Required | Description |
---|---|---|---|
email |
String | β Yes | User email. |
password |
String | β Yes | User password. |
πΉ Example Request
πΉ Responses
β
Success (200 OK
)
β Errors
HTTP Code | Message |
---|---|
400 | "error": "Email does not exist" |
400 | "error": "Incorrect password, try again" |
π JWT Token Authentication
In this API, JSON Web Tokens (JWT) are used for user authentication and authorization. JWTs allow secure communication between the client and the server without storing session data.
How JWT Works
-
User Logs In
- The user submits their email and password. - If credentials are valid, a JWT is generated: ```bash token = jwt.encode( { "user": email, # Store user email in the token "user_id": user.id, # Store user ID in the token "exp": datetime.utcnow() + timedelta(seconds=1000) # Expiration time }, app.config["SECRET_KEY"], # Secret key for encoding algorithm="HS256" ) ```
The token is then sent to the client in the response.
-
Client Sends Token in Requests
- The client includes the token in the Authorization header:
- The server verifies the token before allowing access.
-
Server Validates the Token
- When a request is received, the token is decoded and verified:
decoded_token = jwt.decode(token, app.config["SECRET_KEY"], algorithms=["HS256"]) user_id = decoded_token["user_id"] # Extract user ID
- If the token is valid, the request is processed.
- If the token is expired or invalid, an error is returned.
Token Expiration & Security
Expiration (exp) ensures that tokens are only valid for a limited time (e.g., 1000 seconds).
Secret Key (SECRET_KEY) is used for signing and verifying the token to prevent tampering.
Bearer Authentication method is used to send the token securely.
π Sign Up (/sign-up)
Method: POST Description: Registers a new user.
πΉ Request Body (JSON)
Parameter | Type | Required | Description |
---|---|---|---|
email |
String | β Yes | User email. |
username |
String | β Yes | User name. |
password |
String | β Yes | User password. |
repeat_password |
String | β Yes | Must match the password. |
πΉ Example Request
{
"email": "newuser@example.com",
"userName": "newuser",
"password": "password123",
"repeat_password": "password123"
}
πΉ Responses
β Success Response (201 Created)
β Errors
HTTP Code | Message |
---|---|
400 | "error": "Email already exists" |
400 | "error": "Passwords don't match" |
400 | "error": "Password must be at least 7 characters" |
π Password Hashing
In this API, user passwords are securely stored using PBKDF2 (Password-Based Key Derivation Function 2) with SHA-256 hashing. This ensures that passwords are not stored in plain text, enhancing security.
How It Works:
- When a user signs up, the password is hashed using:
-
The hashed password is stored in the database instead of the plain password.
-
During login, the entered password is hashed again and compared with the stored hash:
- If the hashes match, authentication is successful.
π View Scraping History (/history
)
Method: GET
Description: Retrieves the scraping history of the logged-in user.
Authentication: β
Requires a valid JWT token in the Authorization header.
πΉ Request
Headers:
Header | Type | Required | Description |
---|---|---|---|
Authorization | String | β Yes | Bearer token for authentication |
πΉ Responses
β
Success (200 OK
)
[
{
"url": "https://example.com",
"scraped_data": "<html>...</html>",
"date": "2024-03-10 15:30:00"
}
]
2. scraper.py
scrape_with_Requests:
Description:
Scrapes a static website and returns extracted text.
Parameters
Parameter | Type | Required | Description |
---|---|---|---|
url | str | β | The URL to scrape |
Request Format
Send a POST request with a JSON body:
Response Format:
Returns
- HTML
If an error occurs, returns a JSON object with an error message.
scrape_with_bs4:
Description:
Uses BeautifulSoup to parse and prettify the HTML content. It can also clean the HTML to return readable text
Parameters
Parameter | Type | Required | Description |
---|---|---|---|
url | str | β | The URL to scrape |
clean | bool | β | If True, extracts only readable text |
Request Format
Send a POST request with a JSON body:
Response Format:
Returns
- Prettified HTML if clean=False
- Readable formatted text if clean=True
- Returns an error message if scraping fails.
scrape_with_selenium :
Description:
Uses Selenium WebDriver to scrape dynamic web pages that rely on JavaScript execution. It can also clean the HTML to return readable text
Parameters
Parameter | Type | Required | Description |
---|---|---|---|
url | str | β | The URL of the page to scrape |
company_name | str | β | The company name to search for |
clean | bool | β | If True, extracts only readable text |
Environment Variables
CHROME_PATH: Path to the Chrome WebDriver executable.
Returns
- Scraped page title
- Prettified HTML or clean text, based on clean flag.
π How the Scraper Function Works
1οΈβ£ Scraping with Requests (scrape_with_requests
)
Step | Description |
---|---|
1οΈβ£ Send HTTP Request | The function sends a request to the URL using requests.get(url) . |
2οΈβ£ Check Response Status | If the response is not 200 OK , an error is returned. |
3οΈβ£ Extract Raw HTML | The function retrieves and returns the raw HTML using response.text . |
2οΈβ£ Scraping with BeautifulSoup (scrape_with_bs4
)
Step | Description |
---|---|
1οΈβ£ Send HTTP Request | Requests the webpage's HTML content using requests.get(url) . |
2οΈβ£ Check Response Status | If response isnβt 200 OK , an error is returned. |
3οΈβ£ Parse HTML | The function parses the HTML with BeautifulSoup() . |
4οΈβ£ Clean & Prettify Output | Returns cleaned text or formatted HTML using soup.prettify() . |
3οΈβ£ Scraping with Selenium (scrape_with_selenium
)
Step | Description |
---|---|
1οΈβ£ Setup Selenium WebDriver | Configures Chrome WebDriver for headless browsing. |
2οΈβ£ Load Webpage | Opens the webpage using driver.get(url) . |
3οΈβ£ Handle JavaScript & Dynamic Content | Waits for JavaScript-rendered elements to load. |
4οΈβ£ Extract Page Source | Retrieves HTML content using driver.page_source . |
5οΈβ£ Clean & Return Data | Parses the HTML with BeautifulSoup or returns raw HTML. |
π Error Handling
The scraper function handles errors and returns structured JSON responses.
Error Type | Status Code | Example Response |
---|---|---|
Missing URL | 400 Bad Request |
json {"error": "URL is required"} |
Invalid URL | 400 Bad Request |
json {"error": "Failed to retrieve content"} |
Request Timeout | 500 Server Error |
json {"error": "An error occurred: timeout"} |
Parsing Error | 500 Server Error |
json {"error": "Failed to parse HTML"} |
Selenium WebDriver Error | 500 Server Error |
json {"error": "Selenium WebDriver error"} |
β If an error occurs, the function returns a structured JSON error message instead of crashing.
π Comparison of Scraping Methods
Method | Best For | Pros | Cons |
---|---|---|---|
Requests | Static websites | β Fast, β Lightweight | β No JavaScript support |
BeautifulSoup | Cleaning & parsing HTML | β Easy to use, β Lightweight | β Needs requests first |
Selenium | JavaScript-heavy pages | β Handles dynamic content | β Slower, β Requires WebDriver |
3. config.py :
π Overview
This module configures the Flask application by:
-
Loading environment variables from a
.env
file. -
Setting up the database connection using SQLAlchemy.
-
Enabling CORS to allow cross-origin requests.
-
Managing security settings with a secret key.
π Flask App Initialization
- CORS (Cross-Origin Resource Sharing) is enabled to allow requests from different domains.
- supports_credentials=True allows cookies and authentication headers in requests.
π Environment Variables (.env)
- This module loads sensitive configurations from a .env file using dotenv.
- The .env file is not included in version control (Git) to protect sensitive information.
load_dotenv() # Load environment variables from .env
app.config["SECRET_KEY"] = os.getenv("SECRET_KEY")
- SECRET_KEY β Used for secure sessions and JWT authentication.
ποΈ Database Configuration
database_uri = os.getenv("DATABASE_URI")
app.config["SQLALCHEMY_DATABASE_URI"] = database_uri
app.config["SQLALCHEMY_TRACK_MODIFICATIONS"] = False
db = SQLAlchemy(app) # Create a database instance
- SQLALCHEMY_DATABASE_URI β Defines the database connection.
- SQLALCHEMY_TRACK_MODIFICATIONS=False β Disables unnecessary tracking to improve performance.