Похожие чаты

I am scraping a website and some pages have utf8

and others reencode the already valid utf8 with utf8_encode being then a valid utf8 but not really valid.

i want to detect what page is in what to use utf8_decode function when needed

how would you do that?

Example:
$a = "Año"; //"Año"
$b = utf8_encode($a); // "Año"

echo mb_detect_encoding($a, ['ASCII', 'UTF-8', 'ISO-8859-1'], false);
echo mb_detect_encoding($b, ['ASCII', 'UTF-8', 'ISO-8859-1'], false);

Both give the same result. "UTF-8"

8 ответов

37 просмотров
Fabian-Pastor Автор вопроса

mb_detect_encoding is broken by design. Any other way?

What tool or library do you use for scraping?

Fabian-Pastor Автор вопроса
Fabian Pastor
DOMDocument + DOMXPath

I assume it's a small job or are you using some way to proxy your queries too?

Fabian-Pastor Автор вопроса
Fabian-Pastor Автор вопроса
Dinosaar Dogg
What tool or library do you use for scraping?

it was DOMDocument not knowing when an html is utf8 and assuming ISO-8859-1

PHP is not suited to scrape data from website, may I could suggest you htmlagilitypack and c#? Is really powerful, and if you need to get dynamic content, load via Ajax, you can use selenium

Похожие вопросы

Обсуждают сегодня

Господа, а что сейчас вообще с рынком труда на делфи происходит? Какова ситуация?
Rꙮman Yankꙮvsky
29
А вообще, что может смущать в самой Julia - бы сказал, что нет единого стандартного подхода по многим моментам, поэтому многое выглядит как "хаки" и произвол. Короче говоря, с...
Viktor G.
2
@Benzenoid can you tell me the easiest, and safest way to bu.y HEX now?
Živa Žena
20
This is a question from my wife who make a fortune with memes 😂😂 About the Migration and Tokens: 1. How will the old tokens be migrated to the new $LGCYX network? What is th...
🍿 °anton°
2
30500 за редактор? )
Владимир
47
а через ESC-код ?
Alexey Kulakov
29
What is the Dex situation? Agora team started with the Pnetwork for their dex which helped them both with integration. It’s completed but as you can see from the Pnetwork ann...
Ben
1
Гайс, вопрос для разносторонее развитых: читаю стрим с юарта, нада выделять с него фреймы с определенной структурой, если ли чо готовое, или долбаться с ринг буффером? нада у...
Vitaly
9
Anyone knows where there are some instructions or discort about failed bridge transactions ?
Jochem
21
@lozuk how do I get my phex copies of my ehex from a atomic wallet, to move to my rabby?
Justfrontin 👀
11
Карта сайта