AgenticFlow

Product details

Q: Can it scrape a url or 2 for its knowledgebase before performing each task? As storage seems low is why i ask.

Just trying to figure out if there are ways to increase knowledge base without filling up in app storage.

jsamplesjrPLUSJun 3, 2025
Founder Team
SeanP_AgenticFlowAI

SeanP_AgenticFlowAI

Jun 4, 2025

A: Hey Jsamplesjr,

That's a really smart question about managing knowledge and storage, and you've hit on a key concept!

1. Understanding AgenticFlow Knowledge Storage (It's Not Just File Size):

You're right, the storage limits (e.g., 100MB on Tier 1/2, up to 2GB on Tier 4) might seem modest if you're thinking purely in terms of raw PDF or DOCX file sizes.

However, our "Knowledge Storage" refers to the space taken up by the vectorized embeddings of your content. When you upload a document or provide a URL, we process the text, break it into meaningful chunks, and then convert those chunks into these special numerical representations (embeddings) that the AI uses for fast, semantic searching (this is the RAG part).

A single embedding for a text chunk is quite small (e.g., around 6KB for 1536 dimensions). This means 1GB of our "Knowledge Storage" can actually hold a massive amount of textual information – think tens of thousands, or even hundreds of thousands, of text chunks. So, that 2GB on Tier 4 is indeed "a lot a lot" of actual, usable knowledge for your agents.

2. Dynamically Scraping URLs Before Each Task (Your Excellent Idea):

Yes, you can absolutely design your AgenticFlow agents or workflows to scrape a URL (or a couple of URLs) for fresh context before performing each task, rather than relying solely on pre-loaded, static knowledge. This is a great way to work with dynamic information or to augment a smaller persistent knowledge base.

Here's how:

Workflow/Agent Step 1: Web Scraping:
When a task starts (e.g., user asks the agent a question), the first step can be to use our Web Scraping node or a more robust MCP like Firecrawl (https://agenticflow.ai/mcp/firecrawl) or Apify (https://agenticflow.ai/mcp/apify) to fetch live content from the specific URL(s) relevant to that task.

Workflow/Agent Step 2: AI Processing:
The scraped text from these URLs is then passed as dynamic, just-in-time context to a subsequent LLM node along with the user's original query or the main task input.

The LLM uses this freshly scraped information (plus any information it retrieves from your persistent vectorized Knowledge Base, if you've also configured one) to generate its response or complete the task.

Advantages of This "Just-in-Time" Scraping:
Always Fresh Info: The agent uses the most up-to-date content from the web for that specific task.

Optimizes Persistent Storage: You reserve your persistent Knowledge Storage (the 100MB-2GB) for core, foundational, or less volatile information, while highly dynamic info is fetched on demand.

Targeted Knowledge: You scrape only the pages most relevant to the immediate task, providing highly focused context to the LLM.

Considerations:
- Scraping Time: Each live scrape adds a little to the task execution time.
- Reliability: Success depends on the target site's accessibility and structure (robust scrapers like Apify/Firecrawl help here).
- Credit Usage: Each web scraping step and LLM processing step will consume AgenticFlow credits.

This dynamic scraping approach is a very powerful way to keep your agents informed with the latest data without necessarily filling up all your persistent vectorized storage with content that changes daily. You're thinking exactly right!

— Sean

Share
Helpful?
Log in to join the conversation
Related questions
View product details