RobotsMatcher

Undocumented in source.

Constructors

this
this()

Create a RobotsMatcher with the default matching strategy. The default matching strategy is longest-match as opposed to the former internet draft that provisioned first-match strategy. Analysis shows that longest-match, while more restrictive for crawlers, is what webmasters assume when writing directives. For example, in case of conflicting matches (both Allow and Disallow), the longest match is the one the user wants. For example, in case of a robots.txt file that has the following rules Allow: / Disallow: /cgi-bin it's pretty obvious what the webmaster wants: they want to allow crawl of every URI except /cgi-bin. However, according to the expired internet standard, crawlers should be allowed to crawl everything with such a rule.

Members

Functions

AllowedByRobots
bool AllowedByRobots(string robots_body, const(string[]) user_agents, string url)

Returns true iff 'url' is allowed to be fetched by any member of the "user_agents" vector. 'url' must be %-encoded according to RFC3986.

HandleAllow
void HandleAllow(int line_num, string value)
Undocumented in source. Be warned that the author may not have intended to support it.
HandleDisallow
void HandleDisallow(int line_num, string value)
Undocumented in source. Be warned that the author may not have intended to support it.
HandleRobotsEnd
void HandleRobotsEnd()
Undocumented in source. Be warned that the author may not have intended to support it.
HandleRobotsStart
void HandleRobotsStart()

Parse callbacks. Protected because used in unittest. Never override RobotsMatcher, implement googlebot::RobotsParseHandler instead.

HandleSitemap
void HandleSitemap(int line_num, string value)
Undocumented in source. Be warned that the author may not have intended to support it.
HandleUnknownAction
void HandleUnknownAction(int line_num, string action, string value)
Undocumented in source. Be warned that the author may not have intended to support it.
HandleUserAgent
void HandleUserAgent(int line_num, string user_agent)
Undocumented in source. Be warned that the author may not have intended to support it.
InitUserAgentsAndPath
void InitUserAgentsAndPath(const(string)[] user_agents, string path)

Initialize next path and user-agents to check. Path must contain only the path, params, and query (if any) of the url and must start with a '/'.

OneAgentAllowedByRobots
bool OneAgentAllowedByRobots(string robots_txt, string user_agent, string url)

Do robots check for 'url' when there is only one user agent. 'url' must be %-encoded according to RFC3986.

disallow
bool disallow()

Returns true if we are disallowed from crawling a matching URI.

disallow_ignore_global
bool disallow_ignore_global()

Returns true if we are disallowed from crawling a matching URI. Ignores any rules specified for the default user agent, and bases its results only on the specified user agents.

ever_seen_specific_agent
bool ever_seen_specific_agent()

Returns true iff, when AllowedByRobots() was called, the robots file referred explicitly to one of the specified user agents.

matching_line
int matching_line()

Returns the line that matched or 0 if none matched.

seen_any_agent
bool seen_any_agent()

Returns true if any user-agent was seen.

Static functions

IsValidUserAgentToObey
bool IsValidUserAgentToObey(string user_agent)

Verifies that the given user agent is valid to be matched against robots.txt. Valid user agent strings only contain the characters [a-zA-Z_-].

Structs

Match
struct Match

Instead of just maintaining a Boolean indicating whether a given line has matched, we maintain a count of the maximum number of characters matched by that pattern.

MatchHierarchy
struct MatchHierarchy

For each of the directives within user-agents, we keep global and specific match scores.

Variables

allow_
MatchHierarchy allow_;

Characters of 'url' matching Allow.

disallow_
MatchHierarchy disallow_;

Characters of 'url' matching Disallow.

ever_seen_specific_agent_
bool ever_seen_specific_agent_;

True if we ever saw a block for our agent.

match_strategy_
RobotsMatchStrategy match_strategy_;
Undocumented in source.
path_
string path_;

The path we want to pattern match. Not owned and only a valid pointer during the lifetime of *AllowedByRobots calls.

seen_global_agent_
bool seen_global_agent_;

True if processing global agent rules.

seen_separator_
bool seen_separator_;

True if saw any key: value pair.

seen_specific_agent_
bool seen_specific_agent_;

True if processing our specific agent.

user_agents_
const(string)[] user_agents_;
Undocumented in source.

Inherited Members

From RobotsParseHandler

HandleRobotsStart
void HandleRobotsStart()
Undocumented in source.
HandleRobotsEnd
void HandleRobotsEnd()
Undocumented in source.
HandleUserAgent
void HandleUserAgent(int line_num, string value)
Undocumented in source.
HandleAllow
void HandleAllow(int line_num, string value)
Undocumented in source.
HandleDisallow
void HandleDisallow(int line_num, string value)
Undocumented in source.
HandleSitemap
void HandleSitemap(int line_num, string value)
Undocumented in source.
HandleUnknownAction
void HandleUnknownAction(int line_num, string action, string value)

Any other unrecognized name/value pairs.

Meta