diff --git a/README.md b/README.md index e69de29..774dbf3 100644 --- a/README.md +++ b/README.md @@ -0,0 +1,120 @@ +# Ethix + +A sustainability platform that helps users detect and report greenwashing - misleading environmental claims made by companies and products. Powered by AI analysis using Google Gemini and Ollama. + +## About + +Ethix enables users to: + +- **Scan Products**: Use camera to detect brand logos and analyze environmental claims +- **Report Greenwashing**: Submit incidents with evidence (images or PDF reports) +- **Browse Reports**: Explore company sustainability reports and user-submitted incidents +- **Chat with AI**: Get answers about sustainability, recycling, and eco-friendly alternatives + +## Technology Overview + +### Frontend + +- SvelteKit with Svelte 5 +- TypeScript +- TailwindCSS 4 +- Tauri (desktop builds) +- Three.js (3D effects) + +### Backend + +- Flask (Python) +- Google Gemini AI +- Ollama (embeddings and vision) +- ChromaDB (vector database) +- MongoDB (document storage) + +## Quick Start + +### Prerequisites + +- Node.js 18+ or Bun +- Python 3.10+ +- MongoDB instance +- Google API Key + +### Backend Setup + +```bash +cd backend +python -m venv venv +source venv/bin/activate +pip install -r requirements.txt +python app.py +``` + +### Frontend Setup + +```bash +cd frontend +bun install # or npm install +bun run dev # or npm run dev +``` + +### Using Docker Compose + +```bash +docker-compose up +``` + +## Services + +| Service | Port | Description | +|---------|------|-------------| +| Frontend | 5173 | SvelteKit development server | +| Backend | 5000 | Flask API server | + +## External Dependencies + +The application requires access to: + +| Service | Purpose | +|---------|---------| +| ChromaDB | Vector storage for RAG | +| Ollama | Embeddings and vision models | +| MongoDB | Incident and metadata storage | +| Google Gemini | AI analysis and chat | + +## Key Features + +### Greenwashing Detection + +Submit product photos or company PDF reports for AI-powered analysis. The system: + +1. Detects brand logos using vision AI +2. Extracts text from documents +3. Searches for relevant context in the database +4. Provides structured verdicts with confidence levels + +### RAG-Powered Chat + +The AI assistant uses Retrieval-Augmented Generation to provide accurate, context-aware responses about sustainability topics. + +### Catalogue System + +Browse and search: + +- Company sustainability reports with semantic search +- User-submitted greenwashing incidents +- Filter by category and view detailed analysis + +## Environment Variables + +### Backend (.env) + +```env +GOOGLE_API_KEY=your_google_api_key +MONGO_URI=your_mongodb_connection_string +CHROMA_HOST=http://your-chromadb-host +OLLAMA_HOST=https://your-ollama-host +``` + +## Documentation + +- [Backend Documentation](./backend/README.md) +- [Frontend Documentation](./frontend/README.md) \ No newline at end of file diff --git a/backend/README.md b/backend/README.md new file mode 100644 index 0000000..a6b86bd --- /dev/null +++ b/backend/README.md @@ -0,0 +1,151 @@ +# Ethix Backend + +A Flask-based API server for the Ethix greenwashing detection platform. This backend provides AI-powered analysis of products and companies to identify misleading environmental claims. + +## Technology Stack + +| Component | Technology | +|-----------|------------| +| Framework | Flask | +| AI/LLM | Google Gemini, Ollama | +| Vector Database | ChromaDB | +| Document Store | MongoDB | +| Embeddings | Ollama (nomic-embed-text) | +| Vision AI | Ollama (ministral-3) | +| Computer Vision | OpenCV, Ultralytics (YOLO) | +| Document Processing | PyPDF, openpyxl, pandas | + +## Prerequisites + +- Python 3.10+ +- MongoDB instance +- Access to ChromaDB server +- Access to Ollama server +- Google API Key (for Gemini) + +## Environment Variables + +Create a `.env` file in the backend directory: + +```env +GOOGLE_API_KEY=your_google_api_key +MONGO_URI=your_mongodb_connection_string +CHROMA_HOST=http://your-chromadb-host +OLLAMA_HOST=https://your-ollama-host +``` + +| Variable | Description | Default | +|----------|-------------|---------| +| `GOOGLE_API_KEY` | Google Gemini API key | (required) | +| `MONGO_URI` | MongoDB connection string | (required) | +| `CHROMA_HOST` | ChromaDB server URL | `http://chroma.sirblob.co` | +| `OLLAMA_HOST` | Ollama server URL | `https://ollama.sirblob.co` | + +## Installation + +1. Create and activate a virtual environment: + +```bash +python -m venv venv +source venv/bin/activate # On Windows: venv\Scripts\activate +``` + +2. Install dependencies: + +```bash +pip install -r requirements.txt +``` + +## Running the Server + +### Development + +```bash +python app.py +``` + +The server will start on `http://localhost:5000`. + +### Production + +```bash +gunicorn -w 4 -b 0.0.0.0:5000 app:app +``` + +## API Endpoints + +### Gemini AI + +| Method | Endpoint | Description | +|--------|----------|-------------| +| POST | `/api/gemini/ask` | Chat with AI using RAG context | +| POST | `/api/gemini/rag` | Query with category filtering | +| POST | `/api/gemini/vision` | Vision analysis (not implemented) | + +### Incidents + +| Method | Endpoint | Description | +|--------|----------|-------------| +| POST | `/api/incidents/submit` | Submit a greenwashing report | +| GET | `/api/incidents/list` | Get all confirmed incidents | +| GET | `/api/incidents/` | Get specific incident details | + +### Reports + +| Method | Endpoint | Description | +|--------|----------|-------------| +| GET | `/api/reports/` | List all company reports | +| POST | `/api/reports/search` | Semantic search for reports | +| GET | `/api/reports/view/` | Download a report file | + +### RAG + +| Method | Endpoint | Description | +|--------|----------|-------------| +| POST | `/api/rag/ingest` | Ingest document chunks | +| POST | `/api/rag/search` | Search vector database | + +## External Services + +The backend integrates with the following external services: + +| Service | URL | Purpose | +|---------|-----|---------| +| ChromaDB | `http://chroma.sirblob.co` | Vector storage and similarity search | +| Ollama | `https://ollama.sirblob.co` | Embeddings and vision analysis | + +## Docker + +Build and run using Docker: + +```bash +docker build -t ethix-backend . +docker run -p 5000:5000 --env-file .env ethix-backend +``` + +Or use Docker Compose from the project root: + +```bash +docker-compose up backend +``` + +## Core Features + +### Greenwashing Detection + +The incident submission pipeline: + +1. User uploads product image or company PDF +2. Vision model detects brand logos (for products) +3. PDF text extraction (for company reports) +4. Embedding generation for semantic search +5. RAG context retrieval from ChromaDB +6. Gemini analysis with structured output +7. Results stored in MongoDB and ChromaDB + +### RAG (Retrieval-Augmented Generation) + +- Supports CSV, PDF, TXT, and XLSX file ingestion +- Documents are chunked and batched for embedding +- Prevents duplicate ingestion of processed files +- Semantic search using cosine similarity diff --git a/backend/scripts/populate_db.py b/backend/scripts/populate_db.py index 73491b1..8c57db9 100644 --- a/backend/scripts/populate_db.py +++ b/backend/scripts/populate_db.py @@ -12,7 +12,7 @@ from src .rag .ingest import process_file from src .rag .store import ingest_documents from src .mongo .metadata import is_file_processed ,log_processed_file -def populate_from_dataset (dataset_dir ,category =None ): +def populate_from_dataset (dataset_dir ,category =None ,force =False ): dataset_path =Path (dataset_dir ) if not dataset_path .exists (): print (f"Dataset directory not found: {dataset_dir }") @@ -27,9 +27,9 @@ def populate_from_dataset (dataset_dir ,category =None ): for file_path in dataset_path .glob ('*'): if file_path .is_file ()and file_path .suffix .lower ()in ['.csv','.pdf','.txt','.xlsx']: - if is_file_processed (file_path .name ): + if not force and is_file_processed (file_path .name ): print (f"Skipping {file_path .name } (already processed)") - continue + continue print (f"Processing {file_path .name }...") try : @@ -48,15 +48,23 @@ def populate_from_dataset (dataset_dir ,category =None ): print (f"\nFinished! Processed {files_processed } files. Total chunks ingested: {total_chunks }") -if __name__ =="__main__": - parser =argparse .ArgumentParser (description ="Populate vector database from dataset files") - parser .add_argument ("--category","-c",type =str ,help ="Category to assign to ingested documents") - parser .add_argument ("--dir","-d",type =str ,default =None ,help ="Dataset directory path") - args =parser .parse_args () +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Populate vector database from dataset files") + parser.add_argument("--category", "-c", type=str, help="Category to assign to ingested documents") + parser.add_argument("--dir", "-d", type=str, default=None, help="Dataset directory path") + parser.add_argument("--force", "-f", action="store_true", help="Force re-processing of files even if marked as processed") + args = parser.parse_args() - if args .dir : - dataset_dir =args .dir - else : - dataset_dir =os .path .join (os .path .dirname (__file__ ),'../dataset') + # Check vector store mode + use_atlas = os.environ.get("ATLAS_VECTORS", "false").lower() == "true" + store_name = "MongoDB Atlas Vector Search" if use_atlas else "ChromaDB" + print(f"--- Vector Store Mode: {store_name} ---") - populate_from_dataset (dataset_dir ,category =args .category ) + if args.dir: + dataset_dir = args.dir + else: + dataset_dir = os.path.join(os.path.dirname(__file__), '../dataset') + + # Note: We need to pass force flag to populate_from_dataset ideally, + # but the function signature doesn't have it. I'll modify the function signature too. + populate_from_dataset(dataset_dir, category=args.category, force=args.force) diff --git a/backend/src/chroma/chroma_store.py b/backend/src/chroma/chroma_store.py new file mode 100644 index 0000000..e03c2ef --- /dev/null +++ b/backend/src/chroma/chroma_store.py @@ -0,0 +1,81 @@ +import chromadb +import os + +CHROMA_HOST = os.environ.get("CHROMA_HOST", "http://chroma.sirblob.co") +COLLECTION_NAME = "rag_documents" + +_client =None + +def get_chroma_client (): + global _client + if _client is None : + _client =chromadb .HttpClient (host =CHROMA_HOST ) + return _client + +def get_collection (collection_name =COLLECTION_NAME ): + client =get_chroma_client () + return client .get_or_create_collection (name =collection_name ) + +def insert_documents (texts ,embeddings ,collection_name =COLLECTION_NAME ,metadata_list =None ): + collection =get_collection (collection_name ) + + ids =[f"doc_{i }_{hash (text )}"for i ,text in enumerate (texts )] + + if metadata_list : + collection .add ( + ids =ids , + embeddings =embeddings , + documents =texts , + metadatas =metadata_list + ) + else : + collection .add ( + ids =ids , + embeddings =embeddings , + documents =texts + ) + + return len (texts ) + +def search_documents (query_embedding ,collection_name =COLLECTION_NAME ,num_results =5 ,filter_metadata =None ): + collection =get_collection (collection_name ) + + query_params ={ + "query_embeddings":[query_embedding ], + "n_results":num_results + } + + if filter_metadata : + query_params ["where"]=filter_metadata + + results =collection .query (**query_params ) + + output =[] + if results and results ["documents"]: + for i ,doc in enumerate (results ["documents"][0 ]): + score =results ["distances"][0 ][i ]if "distances"in results else None + meta =results ["metadatas"][0 ][i ]if "metadatas"in results else {} + output .append ({ + "text":doc , + "score":score , + "metadata":meta + }) + + return output + +def delete_documents_by_source (source_file ,collection_name =COLLECTION_NAME ): + collection =get_collection (collection_name ) + results =collection .get (where ={"source":source_file }) + if results ["ids"]: + collection .delete (ids =results ["ids"]) + return len (results ["ids"]) + return 0 + +def get_all_metadatas (collection_name =COLLECTION_NAME ,limit =None ): + collection =get_collection (collection_name ) + + if limit : + results =collection .get (include =["metadatas"],limit =limit ) + else : + results =collection .get (include =["metadatas"]) + return results ["metadatas"]if results and "metadatas"in results else [] diff --git a/backend/src/chroma/vector_store.py b/backend/src/chroma/vector_store.py index 5b0f24b..998ccc0 100644 --- a/backend/src/chroma/vector_store.py +++ b/backend/src/chroma/vector_store.py @@ -1,80 +1,35 @@ -import chromadb +import os +from . import chroma_store +from ..mongo import vector_store as mongo_store -CHROMA_HOST ="http://chroma.sirblob.co" -COLLECTION_NAME ="rag_documents" +def _use_atlas(): + return os.environ.get("ATLAS_VECTORS", "false").lower() == "true" -_client =None +def get_chroma_client(): + # Only used by chroma-specific things if external + return chroma_store.get_chroma_client() -def get_chroma_client (): - global _client - if _client is None : - _client =chromadb .HttpClient (host =CHROMA_HOST ) - return _client +def get_collection(collection_name=chroma_store.COLLECTION_NAME): + if _use_atlas(): + return mongo_store.get_collection(collection_name) + return chroma_store.get_collection(collection_name) -def get_collection (collection_name =COLLECTION_NAME ): - client =get_chroma_client () - return client .get_or_create_collection (name =collection_name ) +def insert_documents(texts, embeddings, collection_name=chroma_store.COLLECTION_NAME, metadata_list=None): + if _use_atlas(): + return mongo_store.insert_documents(texts, embeddings, collection_name, metadata_list) + return chroma_store.insert_documents(texts, embeddings, collection_name, metadata_list) -def insert_documents (texts ,embeddings ,collection_name =COLLECTION_NAME ,metadata_list =None ): - collection =get_collection (collection_name ) +def search_documents(query_embedding, collection_name=chroma_store.COLLECTION_NAME, num_results=5, filter_metadata=None): + if _use_atlas(): + return mongo_store.search_documents(query_embedding, collection_name, num_results, filter_metadata) + return chroma_store.search_documents(query_embedding, collection_name, num_results, filter_metadata) - ids =[f"doc_{i }_{hash (text )}"for i ,text in enumerate (texts )] +def delete_documents_by_source(source_file, collection_name=chroma_store.COLLECTION_NAME): + if _use_atlas(): + return mongo_store.delete_documents_by_source(source_file, collection_name) + return chroma_store.delete_documents_by_source(source_file, collection_name) - if metadata_list : - collection .add ( - ids =ids , - embeddings =embeddings , - documents =texts , - metadatas =metadata_list - ) - else : - collection .add ( - ids =ids , - embeddings =embeddings , - documents =texts - ) - - return len (texts ) - -def search_documents (query_embedding ,collection_name =COLLECTION_NAME ,num_results =5 ,filter_metadata =None ): - collection =get_collection (collection_name ) - - query_params ={ - "query_embeddings":[query_embedding ], - "n_results":num_results - } - - if filter_metadata : - query_params ["where"]=filter_metadata - - results =collection .query (**query_params ) - - output =[] - if results and results ["documents"]: - for i ,doc in enumerate (results ["documents"][0 ]): - score =results ["distances"][0 ][i ]if "distances"in results else None - meta =results ["metadatas"][0 ][i ]if "metadatas"in results else {} - output .append ({ - "text":doc , - "score":score , - "metadata":meta - }) - - return output - -def delete_documents_by_source (source_file ,collection_name =COLLECTION_NAME ): - collection =get_collection (collection_name ) - results =collection .get (where ={"source":source_file }) - if results ["ids"]: - collection .delete (ids =results ["ids"]) - return len (results ["ids"]) - return 0 - -def get_all_metadatas (collection_name =COLLECTION_NAME ,limit =None ): - collection =get_collection (collection_name ) - - if limit : - results =collection .get (include =["metadatas"],limit =limit ) - else : - results =collection .get (include =["metadatas"]) - return results ["metadatas"]if results and "metadatas"in results else [] +def get_all_metadatas(collection_name=chroma_store.COLLECTION_NAME, limit=None): + if _use_atlas(): + return mongo_store.get_all_metadatas(collection_name, limit) + return chroma_store.get_all_metadatas(collection_name, limit) diff --git a/backend/src/mongo/vector_store.py b/backend/src/mongo/vector_store.py index 2919678..3343f1f 100644 --- a/backend/src/mongo/vector_store.py +++ b/backend/src/mongo/vector_store.py @@ -1,49 +1,142 @@ -from .connection import get_mongo_client +import os +from .connection import get_mongo_client -def insert_rag_documents (documents ,collection_name ="rag_documents",db_name ="vectors_db"): - client =get_mongo_client () - db =client .get_database (db_name ) - collection =db [collection_name ] +DB_NAME = "ethix_vectors" +# Default collection name to match Chroma's default +DEFAULT_COLLECTION_NAME = "rag_documents" - if documents : - result =collection .insert_many (documents ) - return len (result .inserted_ids ) - return 0 +def get_collection(collection_name=DEFAULT_COLLECTION_NAME): + client = get_mongo_client() + db = client[DB_NAME] + return db[collection_name] -def search_rag_documents (query_embedding ,collection_name ="rag_documents",db_name ="vectors_db",num_results =5 ): - client =get_mongo_client () - db =client .get_database (db_name ) - collection =db [collection_name ] +def insert_documents(texts, embeddings, collection_name=DEFAULT_COLLECTION_NAME, metadata_list=None): + collection = get_collection(collection_name) + + docs = [] + for i, text in enumerate(texts): + doc = { + "text": text, + "embedding": embeddings[i], + "metadata": metadata_list[i] if metadata_list else {} + } + # Flatten metadata for easier filtering if needed, or keep in 'metadata' subdoc + # Atlas Vector Search usually suggests keeping it cleaner, but 'metadata' field is fine + # provided index is on metadata.field + + # Add source file to top level for easier deletion + if metadata_list and "source" in metadata_list[i]: + doc["source"] = metadata_list[i]["source"] + + docs.append(doc) + + if docs: + result = collection.insert_many(docs) + return len(result.inserted_ids) + return 0 - pipeline =[ - { - "$vectorSearch":{ - "index":"vector_index", - "path":"embedding", - "queryVector":query_embedding , - "numCandidates":num_results *10 , - "limit":num_results - } - }, - { - "$project":{ - "_id":0 , - "text":1 , - "score":{"$meta":"vectorSearchScore"} - } - } +def search_documents(query_embedding, collection_name=DEFAULT_COLLECTION_NAME, num_results=5, filter_metadata=None): + collection = get_collection(collection_name) + + # Construct Atlas Search Aggregation + # Note: 'index' name depends on what user created. Default is often 'default' or 'vector_index'. + # We'll use 'vector_index' as a convention. + + # Filter construction + pre_filter = None + if filter_metadata: + # Atlas Search filter syntax is different from simple match + # It requires MQL (Mongo Query Language) match or compound operators + # For simplicity in $vectorSearch, strict filters are passed in 'filter' field (if serverless/v2) + # or separate match stage. + + # Modern Atlas Vector Search ($vectorSearch) supports 'filter' inside the operator + + must_clauses = [] + for k, v in filter_metadata.items(): + # Assuming filter_metadata matches metadata fields inside 'metadata' object + # or we map specific known fields. + # In Chroma, filters are flat. In insert above, we put them in 'metadata'. + must_clauses.append({ + "term": { + "path": f"metadata.{k}", + "value": v + } + }) + + if must_clauses: + if len(must_clauses) == 1: + pre_filter = must_clauses[0] + else: + pre_filter = { + "compound": { + "filter": must_clauses + } + } + + pipeline = [ + { + "$vectorSearch": { + "index": "vector_index", + "path": "embedding", + "queryVector": query_embedding, + "numCandidates": num_results * 10, + "limit": num_results + } + } ] + + # Apply filter if exists + if pre_filter: + pipeline[0]["$vectorSearch"]["filter"] = pre_filter - return list (collection .aggregate (pipeline )) + pipeline.append({ + "$project": { + "_id": 0, + "text": 1, + "metadata": 1, + "score": {"$meta": "vectorSearchScore"} + } + }) -def is_file_processed (filename ,log_collection ="ingested_files",db_name ="vectors_db"): - client =get_mongo_client () - db =client .get_database (db_name ) - collection =db [log_collection ] - return collection .find_one ({"filename":filename })is not None + try: + results = list(collection.aggregate(pipeline)) + + # Format to match Chroma output style + # Chroma: [{'text': ..., 'score': ..., 'metadata': ...}] + output = [] + for doc in results: + output.append({ + "text": doc.get("text", ""), + "score": doc.get("score", 0), + "metadata": doc.get("metadata", {}) + }) + return output + + except Exception as e: + print(f"Atlas Vector Search Error: {e}") + # Fallback to empty if index doesn't exist etc + return [] -def log_processed_file (filename ,log_collection ="ingested_files",db_name ="vectors_db"): - client =get_mongo_client () - db =client .get_database (db_name ) - collection =db [log_collection ] - collection .insert_one ({"filename":filename ,"processed_at":1 }) +def delete_documents_by_source(source_file, collection_name=DEFAULT_COLLECTION_NAME): + collection = get_collection(collection_name) + # We added 'source' to top level in insert_documents for exactly this efficiency + result = collection.delete_many({"source": source_file}) + return result.deleted_count + +def get_all_metadatas(collection_name=DEFAULT_COLLECTION_NAME, limit=None): + collection = get_collection(collection_name) + + cursor = collection.find({}, {"metadata": 1, "source": 1, "_id": 0}) + if limit: + cursor = cursor.limit(limit) + + metadatas = [] + for doc in cursor: + meta = doc.get("metadata", {}) + # Ensure 'source' is in metadata if we pulled it out + if "source" in doc and "source" not in meta: + meta["source"] = doc["source"] + metadatas.append(meta) + + return metadatas diff --git a/backend/src/ollama/detector.py b/backend/src/ollama/detector.py index 33105e5..aedb90a 100644 --- a/backend/src/ollama/detector.py +++ b/backend/src/ollama/detector.py @@ -1,5 +1,6 @@ import base64 import json +import os import re from pathlib import Path from typing import Dict ,List ,Optional ,Union @@ -11,8 +12,8 @@ except ImportError : OLLAMA_AVAILABLE =False print ("Ollama not installed. Run: pip install ollama") -DEFAULT_HOST ="https://ollama.sirblob.co" -DEFAULT_MODEL ="ministral-3:latest" +DEFAULT_HOST = os.environ.get("OLLAMA_HOST", "https://ollama.sirblob.co") +DEFAULT_MODEL = "ministral-3:latest" DEFAULT_PROMPT ="""Analyze this image and identify ALL logos, brand names, and company names visible. diff --git a/backend/src/rag/embeddings.py b/backend/src/rag/embeddings.py index 77cb6b6..13875be 100644 --- a/backend/src/rag/embeddings.py +++ b/backend/src/rag/embeddings.py @@ -1,8 +1,9 @@ import ollama import os -client =ollama .Client (host ="https://ollama.sirblob.co") -DEFAULT_MODEL ="nomic-embed-text:latest" +OLLAMA_HOST = os.environ.get("OLLAMA_HOST", "https://ollama.sirblob.co") +client = ollama.Client(host=OLLAMA_HOST) +DEFAULT_MODEL = "nomic-embed-text:latest" def get_embedding (text ,model =DEFAULT_MODEL ): try : diff --git a/backend/src/routes/gemini.py b/backend/src/routes/gemini.py index 9f66a40..a1eddf9 100644 --- a/backend/src/routes/gemini.py +++ b/backend/src/routes/gemini.py @@ -1,37 +1,69 @@ -from flask import Blueprint ,request ,jsonify -from src .rag .gemeni import GeminiClient -from src .gemini import ask_gemini_with_rag +from flask import Blueprint, request, jsonify +from src.rag.gemeni import GeminiClient +from src.gemini import ask_gemini_with_rag +from src.chroma.vector_store import search_documents +from src.rag.embeddings import get_embedding -gemini_bp =Blueprint ('gemini',__name__ ) -brain =None +gemini_bp = Blueprint('gemini', __name__) +brain = None -def get_brain (): - global brain - if brain is None : - brain =GeminiClient () - return brain +def get_brain(): + global brain + if brain is None: + brain = GeminiClient() + return brain -@gemini_bp .route ('/ask',methods =['POST']) -def ask (): - data =request .json - prompt =data .get ("prompt") - context =data .get ("context","") +@gemini_bp.route('/ask', methods=['POST']) +def ask(): + data = request.json + prompt = data.get("prompt") + context = data.get("context", "") - if not prompt : - return jsonify ({"error":"No prompt provided"}),400 + if not prompt: + return jsonify({"error": "No prompt provided"}), 400 - try : - client =get_brain () - response =client .ask (prompt ,context ) - return jsonify ({ - "status":"success", - "reply":response + try: + # Step 1: Retrieve relevant context from ChromaDB (RAG) + print(f"Generating embedding for prompt: {prompt}") + query_embedding = get_embedding(prompt) + + print("Searching ChromaDB for context...") + search_results = search_documents(query_embedding, num_results=15) + + retrieved_context = "" + if search_results: + print(f"Found {len(search_results)} documents.") + retrieved_context = "RELEVANT INFORMATION FROM DATABASE:\n" + for res in search_results: + # Include metadata if useful, e.g. brand name or date + meta = res.get('metadata', {}) + source_info = f"[Source: {meta.get('type', 'doc')} - {meta.get('product_name', 'Unknown')}]" + retrieved_context += f"{source_info}\n{res['text']}\n\n" + else: + print("No relevant documents found.") + + # Step 2: Combine manual context (if any) with retrieved context + full_context = context + if retrieved_context: + if full_context: + full_context += "\n\n" + retrieved_context + else: + full_context = retrieved_context + + # Step 3: Ask Gemini + client = get_brain() + response = client.ask(prompt, full_context) + + return jsonify({ + "status": "success", + "reply": response }) - except Exception as e : - return jsonify ({ - "status":"error", - "message":str (e ) - }),500 + except Exception as e: + print(f"Error in RAG flow: {e}") + return jsonify({ + "status": "error", + "message": str(e) + }), 500 @gemini_bp .route ('/rag',methods =['POST']) def rag (): diff --git a/backend/src/routes/incidents.py b/backend/src/routes/incidents.py index b307df4..191840b 100644 --- a/backend/src/routes/incidents.py +++ b/backend/src/routes/incidents.py @@ -306,7 +306,7 @@ This incident has been documented for future reference and to help inform sustai metadata_list =[metadata ] ) - print (f"✓ Incident #{incident_id } saved to ChromaDB for AI chat context") + print (f"✓ Incident #{incident_id } saved to Vector Store for AI chat context") diff --git a/backend/src/routes/reports.py b/backend/src/routes/reports.py index bdb22b7..5f9f080 100644 --- a/backend/src/routes/reports.py +++ b/backend/src/routes/reports.py @@ -1,14 +1,220 @@ from flask import Blueprint ,jsonify ,request -from src .chroma .vector_store import get_all_metadatas ,search_documents -from src .rag .embeddings import get_embedding +from src .chroma .vector_store import get_all_metadatas ,search_documents, insert_documents +from src .rag .embeddings import get_embedding, get_embeddings_batch +from src.rag.ingest import load_pdf, chunk_text +import os +import base64 +import re +from google import genai reports_bp =Blueprint ('reports',__name__ ) +# Validation prompt for checking if PDF is a legitimate environmental report +REPORT_VALIDATION_PROMPT = """You are an expert document classifier. Analyze the following text extracted from a PDF and determine if it is a legitimate corporate environmental/sustainability report. + +A legitimate report should contain: +1. Company name and branding +2. Environmental metrics (emissions, energy usage, waste, water) +3. Sustainability goals or targets +4. Time periods/years covered +5. Professional report structure + +Analyze this text and respond in JSON format: +{ + "is_valid_report": true/false, + "confidence": "high"/"medium"/"low", + "company_name": "extracted company name or null", + "report_year": "extracted year or null", + "report_type": "sustainability/environmental/impact/ESG/unknown", + "sector": "Tech/Energy/Automotive/Aerospace/Retail/Manufacturing/Other", + "key_topics": ["list of main topics covered"], + "reasoning": "brief explanation of your assessment" +} + +TEXT TO ANALYZE: +""" + +def validate_report_with_ai(text_content): + """Use Gemini to validate if the PDF is a legitimate environmental report.""" + try: + # Take first 15000 chars for validation (enough to assess legitimacy) + sample_text = text_content[:15000] if len(text_content) > 15000 else text_content + + client = genai.Client(api_key=os.environ.get("GOOGLE_API_KEY")) + response = client.models.generate_content( + model="gemini-2.0-flash", + contents=f"{REPORT_VALIDATION_PROMPT}\n{sample_text}" + ) + + # Parse the JSON response + response_text = response.text + + # Extract JSON from response + import json + json_match = re.search(r'\{[\s\S]*\}', response_text) + if json_match: + return json.loads(json_match.group()) + + return {"is_valid_report": False, "reasoning": "Could not parse AI response"} + except Exception as e: + print(f"AI validation error: {e}") + return {"is_valid_report": False, "reasoning": str(e)} + + +@reports_bp.route('/upload', methods=['POST']) +def upload_report(): + """Upload and verify a company environmental report PDF.""" + try: + data = request.json + pdf_data = data.get('pdf_data') + company_name = data.get('company_name', '').strip() + + if not pdf_data: + return jsonify({ + 'status': 'error', + 'step': 'validation', + 'message': 'No PDF data provided' + }), 400 + + # Step 1: Decode PDF + try: + # Remove data URL prefix if present + if ',' in pdf_data: + pdf_data = pdf_data.split(',')[1] + pdf_bytes = base64.b64decode(pdf_data) + except Exception as e: + return jsonify({ + 'status': 'error', + 'step': 'decode', + 'message': f'Failed to decode PDF: {str(e)}' + }), 400 + + # Step 2: Extract text from PDF + import io + from pypdf import PdfReader + + try: + pdf_file = io.BytesIO(pdf_bytes) + reader = PdfReader(pdf_file) + + all_text = "" + page_count = len(reader.pages) + + for page in reader.pages: + text = page.extract_text() + if text: + all_text += text + "\n\n" + + if len(all_text.strip()) < 500: + return jsonify({ + 'status': 'error', + 'step': 'extraction', + 'message': 'PDF contains insufficient text content. It may be image-based or corrupted.' + }), 400 + + except Exception as e: + return jsonify({ + 'status': 'error', + 'step': 'extraction', + 'message': f'Failed to extract text from PDF: {str(e)}' + }), 400 + + # Step 3: Validate with AI + validation = validate_report_with_ai(all_text) + + if not validation.get('is_valid_report', False): + return jsonify({ + 'status': 'rejected', + 'step': 'validation', + 'message': 'This does not appear to be a legitimate environmental/sustainability report.', + 'validation': validation + }), 400 + + # Step 4: Generate filename + detected_company = validation.get('company_name') or company_name or 'Unknown' + detected_year = validation.get('report_year') or '2024' + report_type = validation.get('report_type', 'sustainability') + sector = validation.get('sector', 'Other') + + # Clean company name for filename + safe_company = re.sub(r'[^a-zA-Z0-9]', '-', detected_company).lower() + filename = f"{safe_company}-{detected_year}-{report_type}-report.pdf" + + # Step 5: Save PDF to dataset folder + current_dir = os.path.dirname(os.path.abspath(__file__)) + dataset_dir = os.path.join(current_dir, '..', '..', 'dataset') + os.makedirs(dataset_dir, exist_ok=True) + + file_path = os.path.join(dataset_dir, filename) + + # Check if file already exists + if os.path.exists(file_path): + # Add timestamp to make unique + import time + timestamp = int(time.time()) + filename = f"{safe_company}-{detected_year}-{report_type}-report-{timestamp}.pdf" + file_path = os.path.join(dataset_dir, filename) + + with open(file_path, 'wb') as f: + f.write(pdf_bytes) + + # Step 6: Chunk and embed for RAG + chunks = chunk_text(all_text, target_length=2000, overlap=100) + + if not chunks: + chunks = [all_text[:4000]] + + # Get embeddings in batches + embeddings = get_embeddings_batch(chunks) + + # Create metadata for each chunk + metadata_list = [] + for i, chunk in enumerate(chunks): + metadata_list.append({ + 'source': filename, + 'company_name': detected_company, + 'year': detected_year, + 'sector': sector, + 'report_type': report_type, + 'chunk_index': i, + 'total_chunks': len(chunks), + 'page_count': page_count, + 'type': 'company_report' + }) + + # Insert into ChromaDB + insert_documents(chunks, embeddings, metadata_list=metadata_list) + + return jsonify({ + 'status': 'success', + 'message': 'Report verified and uploaded successfully', + 'filename': filename, + 'validation': validation, + 'stats': { + 'page_count': page_count, + 'text_length': len(all_text), + 'chunks_created': len(chunks), + 'company_name': detected_company, + 'year': detected_year, + 'sector': sector, + 'report_type': report_type + } + }) + + except Exception as e: + print(f"Upload error: {e}") + import traceback + traceback.print_exc() + return jsonify({ + 'status': 'error', + 'step': 'unknown', + 'message': str(e) + }), 500 + + @reports_bp .route ('/',methods =['GET']) def get_reports (): try : - - metadatas =get_all_metadatas () unique_reports ={} @@ -24,12 +230,6 @@ def get_reports (): if filename not in unique_reports : - - - - - - company_name ="Unknown" year ="N/A" sector ="Other" @@ -106,15 +306,10 @@ def search_reports (): try : import re - - query_embedding =get_embedding (query ) + results = search_documents (query_embedding ,num_results =50 ) - - results =search_documents (query_embedding ,num_results =50 ) - - query_lower =query .lower () - + query_lower = query .lower () def extract_company_info (filename ): company_name ="Unknown" @@ -224,10 +419,55 @@ def view_report_file (filename ): import os from flask import send_from_directory - - - current_dir =os .path .dirname (os .path .abspath (__file__ )) dataset_dir =os .path .join (current_dir ,'..','..','dataset') return send_from_directory (dataset_dir ,filename ) + + +@reports_bp.route('/stats', methods=['GET']) +def get_stats(): + """Get real statistics for the landing page""" + try: + # 1. Count Verified Reports (Company Reports) + metadatas = get_all_metadatas() + unique_reports = set() + unique_sectors = set() + + for meta in metadatas: + filename = meta.get('source') or meta.get('filename') + # Check if it's a company report (not incident) + if filename and not filename.startswith('incident_') and meta.get('type') != 'incident_report': + unique_reports.add(filename) + + # Count sectors + sector = meta.get('sector') + if sector and sector != 'Other': + unique_sectors.add(sector) + + verified_reports_count = len(unique_reports) + active_sectors_count = len(unique_sectors) + + # Fallback if no specific sectors found but we have reports + if active_sectors_count == 0 and verified_reports_count > 0: + active_sectors_count = 1 + + # 2. Count Products Scanned / Incidents + from src.mongo.connection import get_mongo_client + client = get_mongo_client() + db = client["ethix"] + # Count all incidents (users scanning products) + incidents_count = db["incidents"].count_documents({}) + + return jsonify({ + "verified_reports": verified_reports_count, + "incidents_reported": incidents_count, + "active_sectors": active_sectors_count + }) + except Exception as e: + print(f"Error fetching stats: {e}") + return jsonify({ + "verified_reports": 0, + "incidents_reported": 0, + "active_sectors": 0 + }) diff --git a/backend/src/scripts/sync_incidents.py b/backend/src/scripts/sync_incidents.py new file mode 100644 index 0000000..1667819 --- /dev/null +++ b/backend/src/scripts/sync_incidents.py @@ -0,0 +1,51 @@ + +import os +import sys +from datetime import datetime + +# Add the backend directory to sys.path so we can import src modules +sys.path.append(os.path.join(os.path.dirname(__file__), '../../')) + +from src.mongo.connection import get_mongo_client +from src.routes.incidents import save_to_chromadb + +def sync_incidents(): + print("Starting sync of incidents from MongoDB to ChromaDB...") + + client = get_mongo_client() + db = client["ethix"] + collection = db["incidents"] + + # Find all confirmed greenwashing incidents + cursor = collection.find({"is_greenwashing": True}) + + count = 0 + synced = 0 + errors = 0 + + for incident in cursor: + count += 1 + incident_id = str(incident["_id"]) + product_name = incident.get("product_name", "Unknown") + print(f"Processing Incident #{incident_id} ({product_name})...") + + try: + # Convert ObjectId to string for the function if needed, + # but save_to_chromadb expects the dict and the id string. + # Convert _id in dict to string just in case + incident["_id"] = incident_id + + save_to_chromadb(incident, incident_id) + synced += 1 + print(f" -> Successfully synced.") + except Exception as e: + errors += 1 + print(f" -> FAILED to sync: {e}") + + print(f"\nSync Complete.") + print(f"Total processed: {count}") + print(f"Successfully synced: {synced}") + print(f"Errors: {errors}") + +if __name__ == "__main__": + sync_incidents() diff --git a/frontend/README.md b/frontend/README.md index 858d179..361b78a 100644 --- a/frontend/README.md +++ b/frontend/README.md @@ -1,7 +1,112 @@ -# Tauri + SvelteKit + TypeScript +# Ethix Frontend -This template should help get you started developing with Tauri, SvelteKit and TypeScript in Vite. +A SvelteKit web application for the Ethix greenwashing detection platform. Scan products, report misleading environmental claims, and chat with an AI assistant about sustainability. -## Recommended IDE Setup +## Technology Stack -[VS Code](https://code.visualstudio.com/) + [Svelte](https://marketplace.visualstudio.com/items?itemName=svelte.svelte-vscode) + [Tauri](https://marketplace.visualstudio.com/items?itemName=tauri-apps.tauri-vscode) + [rust-analyzer](https://marketplace.visualstudio.com/items?itemName=rust-lang.rust-analyzer). +| Component | Technology | +|-----------|------------| +| Framework | SvelteKit | +| UI Library | Svelte 5 | +| Language | TypeScript | +| Styling | TailwindCSS 4 | +| Build Tool | Vite | +| Desktop App | Tauri | +| Icons | Iconify | +| 3D Graphics | Three.js | +| Markdown | marked | + +## Prerequisites + +- Node.js 18+ or Bun +- Backend server running on `http://localhost:5000` + +## Installation + +Using Bun (recommended): + +```bash +bun install +``` + +Or using npm: + +```bash +npm install +``` + +## Development + +Start the development server: + +```bash +bun run dev +# or +npm run dev +``` + +The application will be available at `http://localhost:5173`. + +## Building + +### Web Build + +```bash +bun run build +# or +npm run build +``` + +Preview the production build: + +```bash +bun run preview +# or +npm run preview +``` + +### Desktop Build (Tauri) + +```bash +bun run tauri build +# or +npm run tauri build +``` + +## Features + +### Home Page + +- Responsive design with separate mobile and web layouts +- Animated parallax landscape background +- Quick access to all main features + +### AI Chat Assistant + +- Powered by Google Gemini with RAG context +- Real-time conversation interface +- Sustainability and greenwashing expertise +- Message history within session + +### Greenwashing Report Submission + +- Two report types: Product Incident or Company Report +- Image upload for product evidence +- PDF upload for company sustainability reports +- Real-time analysis progress indicator +- Structured verdict display with confidence levels + +### Catalogue Browser + +- Browse company sustainability reports +- View user-submitted incidents +- Category filtering (Tech, Energy, Automotive, etc.) +- Semantic search functionality +- Pagination for large datasets +- Detailed modal views for reports and incidents + +### Product Scanner + +- Camera integration for scanning product labels +- Brand/logo detection via AI vision +- Direct report submission from scan results diff --git a/frontend/src/app.html b/frontend/src/app.html index 1ccbb4a..b1c561b 100644 --- a/frontend/src/app.html +++ b/frontend/src/app.html @@ -3,7 +3,6 @@ - %sveltekit.head% diff --git a/frontend/src/lib/components/CustomTabBar.svelte b/frontend/src/lib/components/CustomTabBar.svelte index 1ff890d..178731d 100644 --- a/frontend/src/lib/components/CustomTabBar.svelte +++ b/frontend/src/lib/components/CustomTabBar.svelte @@ -11,7 +11,7 @@ const tabs = [ { name: "Goals", - route: "/community", + route: "/goal", icon: "ri:flag-fill", activeColor: "#34d399", }, diff --git a/frontend/src/lib/components/HomeScreen.svelte b/frontend/src/lib/components/HomeScreen.svelte index ad209a6..5aa464f 100644 --- a/frontend/src/lib/components/HomeScreen.svelte +++ b/frontend/src/lib/components/HomeScreen.svelte @@ -43,7 +43,11 @@
- + Ethix Logo

Ethix

@@ -213,13 +217,17 @@ .logo-icon { width: 48px; height: 48px; - background: linear-gradient(135deg, #4ade80, #22c55e); border-radius: 16px; display: flex; align-items: center; justify-content: center; - color: #052e16; - box-shadow: 0 4px 16px rgba(74, 222, 128, 0.4); + overflow: hidden; + } + + .logo-img { + width: 100%; + height: 100%; + object-fit: contain; } .app-name { diff --git a/frontend/src/lib/components/MobileHomePage.svelte b/frontend/src/lib/components/MobileHomePage.svelte index 0ed9c0f..d918d02 100644 --- a/frontend/src/lib/components/MobileHomePage.svelte +++ b/frontend/src/lib/components/MobileHomePage.svelte @@ -71,10 +71,10 @@
-
@@ -222,6 +222,13 @@ display: flex; align-items: center; justify-content: center; + overflow: hidden; + } + + .avatar-logo { + width: 36px; + height: 36px; + object-fit: contain; } .header-text { diff --git a/frontend/src/lib/components/WebHomePage.svelte b/frontend/src/lib/components/WebHomePage.svelte index ab54b30..b4fa084 100644 --- a/frontend/src/lib/components/WebHomePage.svelte +++ b/frontend/src/lib/components/WebHomePage.svelte @@ -1,650 +1,489 @@ - - -
- - -
-
-
- - See the real impact -
-

- Know What
You Buy. -

-

- Scan a product. See if it's actually good for the planet. Find - better alternatives if it's not. Simple as that. -

- -
- -
-
-
- {#key scoreIndex} -
- {scores[scoreIndex].label} -
- {/key} -
-
-
- -
- {#key scoreIndex} -
- -
-
- {scores[scoreIndex].label} -
-
- {scores[scoreIndex].score} -
-
-
- {/key} -
-
-
-
- -
- -
-
- {#each stats as stat} -
-
{stat.value}
-
{stat.label}
-
- {/each} -
-
- -
-
-

How It Works

-

Tools to help you shop smarter.

-
-
- {#each features as feature} -
-
- -
-

{feature.title}

-

{feature.desc}

-
- {/each} -
-
- -
-
-

Latest News

-

- Updates from the world of sustainability. -

-
-
- {#each news as item (item.id)} - - {/each} -
-
- - -
- - + + +
+ + +
+
+ + + Sustainability Made Simple + + +

+ Know What
You're Buying +

+ +

+ Scan products, detect greenwashing, and make informed choices. + Ethix helps you see through misleading environmental claims. +

+ + +
+
+ +
+ +
+
+ {#each stats as stat} +
+ {stat.value} + {stat.label} +
+ {/each} +
+
+ +
+

How It Works

+

Simple tools to help you shop smarter

+ +
+ {#each features as feature} +
+
+ +
+

{feature.title}

+

{feature.desc}

+
+ {/each} +
+
+ +
+
+
+ + What is Greenwashing? +
+

+ Greenwashing is when companies make misleading claims about how + environmentally friendly their products are. Common tactics + include vague terms like "eco-friendly" without proof, + highlighting minor green features while hiding major harms, and + using green imagery without substance. +

+ + Report suspicious claims + + +
+
+ +
+

Ready to start?

+

+ Every scan helps build a more transparent marketplace. +

+ +
+ + +
+ + diff --git a/frontend/src/lib/components/catalogue/CatalogueHeader.svelte b/frontend/src/lib/components/catalogue/CatalogueHeader.svelte index db19c4b..d460187 100644 --- a/frontend/src/lib/components/catalogue/CatalogueHeader.svelte +++ b/frontend/src/lib/components/catalogue/CatalogueHeader.svelte @@ -24,16 +24,15 @@
-

Sustainability Database

-

+

{#if viewMode === "company"} Search within verified company reports and impact assessments {:else} @@ -42,19 +41,18 @@

-
Company
(selectedCategory = category)} > {category} diff --git a/frontend/src/lib/components/catalogue/ReportCard.svelte b/frontend/src/lib/components/catalogue/ReportCard.svelte index 6b8ebf8..c734cef 100644 --- a/frontend/src/lib/components/catalogue/ReportCard.svelte +++ b/frontend/src/lib/components/catalogue/ReportCard.svelte @@ -43,57 +43,63 @@ +
+ {:else if analysisResult.is_greenwashing} +
Greenwashing Detected! - {:else} -
- -
-

Report Analyzed

- {/if} -
-
- Verdict: - {analysisResult.analysis?.verdict || "N/A"} -
-
- Confidence: - - {analysisResult.analysis?.confidence || "unknown"} - -
- {#if analysisResult.is_greenwashing} +
+
+ Verdict: + {analysisResult.analysis?.verdict || + "N/A"} +
+
+ Confidence: + + {analysisResult.analysis?.confidence || + "unknown"} + +
Severity:
- {/if} -
- Analysis: -

- {analysisResult.analysis?.reasoning || - "No details available"} -

+
+ Analysis: +

+ {analysisResult.analysis?.reasoning || + "No details available"} +

+
+ {#if analysisResult.detected_brand && analysisResult.detected_brand !== "Unknown"} +
+ Detected Brand: + {analysisResult.detected_brand} +
+ {/if}
- {#if analysisResult.detected_brand && analysisResult.detected_brand !== "Unknown"} + + + {:else} + +
+ +
+

Report Analyzed

+ +
- Detected Brand: + Verdict: {analysisResult.detected_brand}{analysisResult.analysis?.verdict || + "N/A"}
- {/if} -
+
+ Confidence: + + {analysisResult.analysis?.confidence || + "unknown"} + +
+
+ Analysis: +

+ {analysisResult.analysis?.reasoning || + "No details available"} +

+
+ {#if analysisResult.detected_brand && analysisResult.detected_brand !== "Unknown"} +
+ Detected Brand: + {analysisResult.detected_brand} +
+ {/if} +
- + + {/if}
{:else}
@@ -263,23 +483,25 @@
-
- -
- - + {#if reportType === "product"} +
+ +
+ + +
-
+ {/if}
@@ -366,7 +588,11 @@ width="24" class="pulse" > - Analyzing Report + + {reportType === "product" + ? "Analyzing Report" + : "Verifying & Uploading Report"} +
{#each progressSteps as step} @@ -417,9 +643,19 @@ onclick={handleSubmit} type="button" > - - Analyze for Greenwashing + {#if reportType === "product"} + + Analyze for Greenwashing + {:else} + + Upload & Verify Report + {/if} {/if}
@@ -443,7 +679,7 @@ .content-container { position: relative; z-index: 10; - padding: 80px 20px 40px; + padding: 120px 20px 40px; max-width: 500px; margin: 0 auto; } @@ -821,6 +1057,51 @@ background: rgba(255, 255, 255, 0.2); } + .action-buttons { + display: flex; + gap: 12px; + margin-top: 24px; + flex-wrap: wrap; + justify-content: center; + } + + .btn-primary { + background: #22c55e; + color: #052e16; + padding: 12px 24px; + border-radius: 50px; + font-weight: 700; + cursor: pointer; + display: flex; + align-items: center; + gap: 8px; + transition: all 0.2s; + text-decoration: none; + border: none; + } + + .btn-primary:hover { + background: #16a34a; + transform: translateY(-2px); + } + + .topics-list { + display: flex; + flex-wrap: wrap; + gap: 8px; + margin-top: 8px; + } + + .topic-tag { + background: rgba(34, 197, 94, 0.1); + color: #4ade80; + padding: 4px 12px; + border-radius: 20px; + font-size: 12px; + font-weight: 600; + border: 1px solid rgba(34, 197, 94, 0.2); + } + .error-message { display: flex; align-items: center; diff --git a/frontend/tsconfig.json b/frontend/tsconfig.json index f4d0a0e..4344710 100644 --- a/frontend/tsconfig.json +++ b/frontend/tsconfig.json @@ -11,9 +11,4 @@ "strict": true, "moduleResolution": "bundler" } - // Path aliases are handled by https://svelte.dev/docs/kit/configuration#alias - // except $lib which is handled by https://svelte.dev/docs/kit/configuration#files - // - // If you want to overwrite includes/excludes, make sure to copy over the relevant includes/excludes - // from the referenced tsconfig.json - TypeScript does not merge them in }