Why don't you also use RabbitMQ for this?

I'm building a web scraper. So I have to separate downloading with parsing. The best choice I have is a simple microservice program (service a is primary so it sends tasks to service A or B. download as service B, parser as service C) (each service is a separate application)

0

20.12.2020

Roman Sharkov

Daddy
I'm building a web scraper. So I have to separate ...

but why separate it? 🤔

0

20.12.2020

Daddy

Roman Sharkov
but why separate it? 🤔

for better performance and maintainability. For example, if I have 30,000 pages to scrape, my primary service passes each page URL as a message to my downloader service. my downloader service manages the download process and any related topic like avoid opening more than 4 connections to each website. or managing failure downloads or... my downloader focuses on good downloading. then it will return the downloaded HTML document back to my primary service. And then, my primary service have to pass downloaded HTML to my parser service, and so on

0

20.12.2020

Roman Sharkov

Daddy
for better performance and maintainability. For e...

why not just use an object storage and pass IDs around?

0

20.12.2020

Raptor Blue Bear

Roman Sharkov
why not just use an object storage and pass IDs ar...

This would be ideal

0

20.12.2020

Daddy

Roman Sharkov
why not just use an object storage and pass IDs ar...

What about NoSQL databases like MongoDB? and saving each HTML as a MongoDB document? I realized that my HTML documents are not 3MB. they are 250KB up to 1.2MB (sometimes 3MB, maximum)

0

20.12.2020

Raptor Blue Bear

Daddy
What about NoSQL databases like MongoDB? and savin...

Do you care about losing a document or a few documents randomly without explanation or warning, if yes then Mongo is for you

0

20.12.2020

Daddy

Raptor Blue Bear
Do you care about losing a document or a few docum...

It's important, i think.

0

20.12.2020

Roy King

Daddy
It seems to be good, but each HTML document contai...

use ftp

0

25.10.2021

Roy King

Daddy
for better performance and maintainability. For e...

Try wget to create proof of concept on peformance issues

0

25.10.2021

Daddy · Accepted Answer

Daddy

It seems to be good, but each HTML document contains 2MB up to 6MB of Data. So can RabbitMQ store for example 10,000 (or 20,000) message ? (15.000 * 3MB = 45GB Data!) In other words, I think these huge messages with this huge size can seriously damage to delivery speed It may take about 1 minute to deliver each message! Maybe I'm wrong.

0

20.12.2020

Похожие чаты

Why don't you also use RabbitMQ for this?

14 ответов

Похожие вопросы