RobotsMatcher

this this(): Create a RobotsMatcher with the default matching strategy. The default matching strategy is longest-match as opposed to the former internet draft that provisioned first-match strategy. Analysis shows that longest-match, while more restrictive for crawlers, is what webmasters assume when writing directives. For example, in case of conflicting matches (both Allow and Disallow), the longest match is the one the user wants. For example, in case of a robots.txt file that has the following rules Allow: / Disallow: /cgi-bin it's pretty obvious what the webmaster wants: they want to allow crawl of every URI except /cgi-bin. However, according to the expired internet standard, crawlers should be allowed to crawl everything with such a rule.

Members

Functions

AllowedByRobots bool AllowedByRobots(string robots_body, const(string[]) user_agents, string url): Returns true iff 'url' is allowed to be fetched by any member of the "user_agents" vector. 'url' must be %-encoded according to RFC3986.
HandleAllow void HandleAllow(int line_num, string value): Undocumented in source. Be warned that the author may not have intended to support it.
HandleDisallow void HandleDisallow(int line_num, string value): Undocumented in source. Be warned that the author may not have intended to support it.
HandleRobotsEnd void HandleRobotsEnd(): Undocumented in source. Be warned that the author may not have intended to support it.
HandleRobotsStart void HandleRobotsStart(): Parse callbacks. Protected because used in unittest. Never override RobotsMatcher, implement googlebot::RobotsParseHandler instead.
HandleSitemap void HandleSitemap(int line_num, string value): Undocumented in source. Be warned that the author may not have intended to support it.
HandleUnknownAction void HandleUnknownAction(int line_num, string action, string value): Undocumented in source. Be warned that the author may not have intended to support it.
HandleUserAgent void HandleUserAgent(int line_num, string user_agent): Undocumented in source. Be warned that the author may not have intended to support it.
InitUserAgentsAndPath void InitUserAgentsAndPath(const(string)[] user_agents, string path): Initialize next path and user-agents to check. Path must contain only the path, params, and query (if any) of the url and must start with a '/'.
OneAgentAllowedByRobots bool OneAgentAllowedByRobots(string robots_txt, string user_agent, string url): Do robots check for 'url' when there is only one user agent. 'url' must be %-encoded according to RFC3986.
disallow bool disallow(): Returns true if we are disallowed from crawling a matching URI.
disallow_ignore_global bool disallow_ignore_global(): Returns true if we are disallowed from crawling a matching URI. Ignores any rules specified for the default user agent, and bases its results only on the specified user agents.
ever_seen_specific_agent bool ever_seen_specific_agent(): Returns true iff, when AllowedByRobots() was called, the robots file referred explicitly to one of the specified user agents.
matching_line int matching_line(): Returns the line that matched or 0 if none matched.
seen_any_agent bool seen_any_agent(): Returns true if any user-agent was seen.

Static functions

IsValidUserAgentToObey bool IsValidUserAgentToObey(string user_agent): Verifies that the given user agent is valid to be matched against robots.txt. Valid user agent strings only contain the characters [a-zA-Z_-].

Structs

Match struct Match: Instead of just maintaining a Boolean indicating whether a given line has matched, we maintain a count of the maximum number of characters matched by that pattern.
MatchHierarchy struct MatchHierarchy: For each of the directives within user-agents, we keep global and specific match scores.

Variables

allow_ MatchHierarchy allow_;: Characters of 'url' matching Allow.
disallow_ MatchHierarchy disallow_;: Characters of 'url' matching Disallow.
ever_seen_specific_agent_ bool ever_seen_specific_agent_;: True if we ever saw a block for our agent.
match_strategy_ RobotsMatchStrategy match_strategy_;: Undocumented in source.
path_ string path_;: The path we want to pattern match. Not owned and only a valid pointer during the lifetime of *AllowedByRobots calls.
seen_global_agent_ bool seen_global_agent_;: True if processing global agent rules.
seen_separator_ bool seen_separator_;: True if saw any key: value pair.
seen_specific_agent_ bool seen_specific_agent_;: True if processing our specific agent.
user_agents_ const(string)[] user_agents_;: Undocumented in source.

Inherited Members

From RobotsParseHandler

HandleRobotsStart void HandleRobotsStart(): Undocumented in source.
HandleRobotsEnd void HandleRobotsEnd(): Undocumented in source.
HandleUserAgent void HandleUserAgent(int line_num, string value): Undocumented in source.
HandleAllow void HandleAllow(int line_num, string value): Undocumented in source.
HandleDisallow void HandleDisallow(int line_num, string value): Undocumented in source.
HandleSitemap void HandleSitemap(int line_num, string value): Undocumented in source.
HandleUnknownAction void HandleUnknownAction(int line_num, string action, string value): Any other unrecognized name/value pairs.

RobotsMatcher

Constructors

Members

Functions

Static functions

Structs

Variables

Inherited Members

From RobotsParseHandler

Meta

Source