web

This module contains the class WebTarget

Classes

Web Target

class dysdera.web.WebTarget(session: aiohttp.ClientSession, url: URL, timeout: int, refer: URL | None = None, if_modified_since: datetime | None = None)[source]

class for all html or not html pages we will meet during the crawling

It represents a web page that can be either HTML or non-HTML. It provides methods for extracting information from the page, such as links, titles, text, and metadata. By now it can perform link extraction only on html pages.

Methods:
  • lxml_is_html: Checks if the page is HTML based on its content.

  • extract_links: Extracts links from the page.

  • canonical_url: Gets the canonical URL of the page.

  • extract_titles: Extracts titles from the page.

  • extract_page_title: Extracts the page title.

  • extract_text: Extracts text from the page.

  • extract_figcaptions: Extracts figcaptions from the page.

  • extract_metadata: Extracts metadata from the page.

property request_header: dict

returns the headers for the http request