this format: input – <single_image> output – <entire_text_on_the_image> (not characted by character)?
Ideally:
- No bboxes
- No additional input text
- Fast
- Accurate
- Ideally easy to fine-tune with a dataset sample in an instruction
- No need to specify boxes in a dataset for training
Everything i tried so far is either very slow or very inaccurate
Обсуждают сегодня