Похожие чаты

I am scraping a website and some pages have utf8

and others reencode the already valid utf8 with utf8_encode being then a valid utf8 but not really valid.

i want to detect what page is in what to use utf8_decode function when needed

how would you do that?

Example:
$a = "Año"; //"Año"
$b = utf8_encode($a); // "Año"

echo mb_detect_encoding($a, ['ASCII', 'UTF-8', 'ISO-8859-1'], false);
echo mb_detect_encoding($b, ['ASCII', 'UTF-8', 'ISO-8859-1'], false);

Both give the same result. "UTF-8"

8 ответов

34 просмотра
Fabian-Pastor Автор вопроса

mb_detect_encoding is broken by design. Any other way?

What tool or library do you use for scraping?

Fabian-Pastor Автор вопроса
Fabian Pastor
DOMDocument + DOMXPath

I assume it's a small job or are you using some way to proxy your queries too?

Fabian-Pastor Автор вопроса
Fabian-Pastor Автор вопроса
Dinosaar Dogg
What tool or library do you use for scraping?

it was DOMDocument not knowing when an html is utf8 and assuming ISO-8859-1

PHP is not suited to scrape data from website, may I could suggest you htmlagilitypack and c#? Is really powerful, and if you need to get dynamic content, load via Ajax, you can use selenium

Похожие вопросы

Обсуждают сегодня

Карта сайта