![]() ![]() The Content-Type HTTP header of the web page is parsed using theĪnd the result is stored in the ntentType object. Text/html, application/xhtml+xml, application/xml Thus the context parameter of the Page function will have different values: Content types The web pages with various content types are parsed differently and HTTP header setting in the requests from the scraper,Įither in Start URLs, Pseudo URLs or in the Prepare request function. Types, and you're still receiving invalid responses, be sure to override the Accept HTML and XML are preferred over JSON and other types. Note that while the default Accept HTTP header will allow any content type to be received, Use the Additional MIME types ( additionalMimeTypes) input option. If you want the crawler to process other content types, Content typesīy default, Cheerio Scraper only processes web pages with the text/html, application/json, application/xml, application/xhtml+xml MIME content types (as reported by the Content-Type HTTP header),Īnd skips pages with other content types. ![]() If you'd like to learn more about the inner workings of the scraper, see the respective documentation. Under the hood, Cheerio Scraper is built using the CheerioCrawler classįrom Crawlee.
0 Comments
Leave a Reply. |