Create a RobotsMatcher with the default matching strategy. The default matching strategy is longest-match as opposed to the former internet draft that provisioned first-match strategy. Analysis shows that longest-match, while more restrictive for crawlers, is what webmasters assume when writing directives. For example, in case of conflicting matches (both Allow and Disallow), the longest match is the one the user wants. For example, in case of a robots.txt file that has the following rules Allow: / Disallow: /cgi-bin it's pretty obvious what the webmaster wants: they want to allow crawl of every URI except /cgi-bin. However, according to the expired internet standard, crawlers should be allowed to crawl everything with such a rule.
Returns true iff 'url' is allowed to be fetched by any member of the "user_agents" vector. 'url' must be %-encoded according to RFC3986.
Parse callbacks. Protected because used in unittest. Never override RobotsMatcher, implement googlebot::RobotsParseHandler instead.
Initialize next path and user-agents to check. Path must contain only the path, params, and query (if any) of the url and must start with a '/'.
Do robots check for 'url' when there is only one user agent. 'url' must be %-encoded according to RFC3986.
Returns true if we are disallowed from crawling a matching URI.
Returns true if we are disallowed from crawling a matching URI. Ignores any rules specified for the default user agent, and bases its results only on the specified user agents.
Returns true iff, when AllowedByRobots() was called, the robots file referred explicitly to one of the specified user agents.
Returns the line that matched or 0 if none matched.
Returns true if any user-agent was seen.
Verifies that the given user agent is valid to be matched against robots.txt. Valid user agent strings only contain the characters [a-zA-Z_-].
Instead of just maintaining a Boolean indicating whether a given line has matched, we maintain a count of the maximum number of characters matched by that pattern.
For each of the directives within user-agents, we keep global and specific match scores.
Characters of 'url' matching Allow.
Characters of 'url' matching Disallow.
True if we ever saw a block for our agent.
The path we want to pattern match. Not owned and only a valid pointer during the lifetime of *AllowedByRobots calls.
True if processing global agent rules.
True if saw any key: value pair.
True if processing our specific agent.
Any other unrecognized name/value pairs.