and others reencode the already valid utf8 with utf8_encode being then a valid utf8 but not really valid.
i want to detect what page is in what to use utf8_decode function when needed
how would you do that?
Example:
$a = "Año"; //"Año"
$b = utf8_encode($a); // "Año"
echo mb_detect_encoding($a, ['ASCII', 'UTF-8', 'ISO-8859-1'], false);
echo mb_detect_encoding($b, ['ASCII', 'UTF-8', 'ISO-8859-1'], false);
Both give the same result. "UTF-8"
mb_detect_encoding is broken by design. Any other way?
What tool or library do you use for scraping?
DOMDocument + DOMXPath
I assume it's a small job or are you using some way to proxy your queries too?
just direct connection, no proxies :D
it was DOMDocument not knowing when an html is utf8 and assuming ISO-8859-1
oh glad you figured it out
PHP is not suited to scrape data from website, may I could suggest you htmlagilitypack and c#? Is really powerful, and if you need to get dynamic content, load via Ajax, you can use selenium
Обсуждают сегодня