It seems to be good, but each HTML document contains 2MB up to 6MB of Data. So can RabbitMQ store for example 10,000 (or 20,000) message ? (15.000 * 3MB = 45GB Data!) In other words, I think these huge messages with this huge size can seriously damage to delivery speed It may take about 1 minute to deliver each message! Maybe I'm wrong.
Store your html payload in a storage like s3 and add the key (filename) to rabbitmq
do you really need to send these files?! why? why can't you just request them on demand?
Hmm, 🤔 I should give a test this way. the only weird thing is performance issues. But I should give it a test. Thanks.
I'm building a web scraper. So I have to separate downloading with parsing. The best choice I have is a simple microservice program (service a is primary so it sends tasks to service A or B. download as service B, parser as service C) (each service is a separate application)
but why separate it? 🤔
for better performance and maintainability. For example, if I have 30,000 pages to scrape, my primary service passes each page URL as a message to my downloader service. my downloader service manages the download process and any related topic like avoid opening more than 4 connections to each website. or managing failure downloads or... my downloader focuses on good downloading. then it will return the downloaded HTML document back to my primary service. And then, my primary service have to pass downloaded HTML to my parser service, and so on
why not just use an object storage and pass IDs around?
This would be ideal
What about NoSQL databases like MongoDB? and saving each HTML as a MongoDB document? I realized that my HTML documents are not 3MB. they are 250KB up to 1.2MB (sometimes 3MB, maximum)
Do you care about losing a document or a few documents randomly without explanation or warning, if yes then Mongo is for you
It's important, i think.
Try wget to create proof of concept on peformance issues
Обсуждают сегодня