web
This module contains the class WebTarget
Classes
Web Target
- class dysdera.web.WebTarget(session: aiohttp.ClientSession, url: URL, timeout: int, refer: URL | None = None, if_modified_since: datetime | None = None)[source]
class for all html or not html pages we will meet during the crawling
It represents a web page that can be either HTML or non-HTML. It provides methods for extracting information from the page, such as links, titles, text, and metadata. By now it can perform link extraction only on html pages.
- Methods:
lxml_is_html: Checks if the page is HTML based on its content.
extract_links: Extracts links from the page.
canonical_url: Gets the canonical URL of the page.
extract_titles: Extracts titles from the page.
extract_page_title: Extracts the page title.
extract_text: Extracts text from the page.
extract_figcaptions: Extracts figcaptions from the page.
extract_metadata: Extracts metadata from the page.
- property request_header: dict
returns the headers for the http request