I am scraping a website and some pages have utf8

Question

I am scraping a website and some pages have utf8

and others reencode the already valid utf8 with utf8_encode being then a valid utf8 but not really valid.

i want to detect what page is in what to use utf8_decode function when needed

how would you do that?

Example:
$a = "Año"; //"Año"
$b = utf8_encode($a); // "AÃ±o"

echo mb_detect_encoding($a, ['ASCII', 'UTF-8', 'ISO-8859-1'], false);
echo mb_detect_encoding($b, ['ASCII', 'UTF-8', 'ISO-8859-1'], false);

Both give the same result. "UTF-8"

0

20.02.2022

8 ответов

37 просмотров

Fabian Pastor Автор вопроса

mb_detect_encoding is broken by design. Any other way?

0

20.02.2022

Fabian Pastor Автор вопроса

Dinosaar Dogg
What tool or library do you use for scraping?

DOMDocument + DOMXPath

0

20.02.2022

Dinosaar Dogg

Fabian Pastor
DOMDocument + DOMXPath

I assume it's a small job or are you using some way to proxy your queries too?

0

20.02.2022

Fabian Pastor Автор вопроса

Dinosaar Dogg
I assume it's a small job or are you using some wa...

just direct connection, no proxies :D

0

20.02.2022

Fabian Pastor Автор вопроса

Dinosaar Dogg
What tool or library do you use for scraping?

it was DOMDocument not knowing when an html is utf8 and assuming ISO-8859-1

0

20.02.2022

Dinosaar Dogg

Fabian Pastor
it was DOMDocument not knowing when an html is utf...

oh glad you figured it out

0

20.02.2022

Marco Abagnale

PHP is not suited to scrape data from website, may I could suggest you htmlagilitypack and c#? Is really powerful, and if you need to get dynamic content, load via Ajax, you can use selenium

0

20.02.2022

Dinosaar Dogg · Accepted Answer

Dinosaar Dogg

What tool or library do you use for scraping?

0

20.02.2022

Похожие чаты

I am scraping a website and some pages have utf8

8 ответов

Похожие вопросы