Automated web crawlers are an important tool that will help to crawl and index content on the internet. Webmasters use this to their advantage as it allows them to curate their content in a way that is beneficial to their brand, and will keep the crawlers away from the irrelevant content. Here, you will find standard ways to control the crawling and indexing of your website's content. The methods described are (for the most part) supported by all of the major search engines and web crawlers. Most websites will not have default settings for restricting crawling, indexing, and serving links in search results, so to start off you will not have to really do anything with your content. If you would like all of your pages contained in a website to be indexed, you will not have to modify anything. There is no need to make a robots.txt file if you are okay with all URLs contained in the site being crawled and indexed by search engines.
Starting Off
Search engines will go through two important stages to make a website's content available in search results to users—crawling and indexing. Crawling is when the search engine's crawlers (bots) access a webpage that is available publicly. For the most part, this only means that the bot looks at the webpage and will follow the links on the page the same way that a human would. Indexing is when there is information gathered about the pages so that it can be displayed in a search results page. The difference between crawling and indexing is vital. Many people tend to be confused about the two, and it can lead to a web page either appearing or not appearing on search results. It is possible for a page to be crawled but not indexed, but only rarely is a page indexed but not crawled. Also, if you're aiming to prevent the indexing of a page, you will need to allow the URL to be crawled or at least attempted to be crawled.Here, you will find help on controlling aspects of crawling and indexing, so that you can best determine how you would prefer your own content be accessed by the bots that crawl, and how you would like your content to be presented in the search results to users.
It is possible that in a particular situation, you would not want a crawler to access a certain area of a server. Be it because of limited server resources, problems with the URL or linking structure. If this is the case there would be a never ending number of URLs, and it would be impossible for all of them to be crawled.
Other times, you'd want to control how your content is indexed, and how it is presented within search results. You might not want your pages indexed at all, or would like them to appear without a certain part of the content.
NOTE: do not use these methods when controlling access to content that is private. You should be using a more secure method to hide content that is not for the public.
ALSO: it is possible for a page to be indexed but never crawled—these processes are not dependent on one another. If a page has enough information available and is deemed relevant to users, a search engine might decide to index it in search results even if it was never crawled. That is why it is important to be able to have control over which content is crawled and indexed.
It is possible to control indexing in a way that only affects one page at a time by using some information that is contained within each page as it is crawled by a bot. You may use a specific meta tag embedded at the top of the HTML page or a specific HTTP element on the header that is served with all content on the website—both of these methods will give you some control over how your page is indexed.
Robots.txt
When using a robots.txt file, it has to be located on the top-level of the directory of the host and must be accessible via the correct protocol and port number. The most widely accepted protocol for robots.txt are http and https. Google will also accept robots.txt files with a FTP protocol, and use an anonymous login. The directives that are listed within the file will only apply to the host, protocol, and port number in which the file is hosted. Also know that the URLs for robots.txt files are case sensitive.When a robots.txt file is fetched, the outcome will be full allow, full disallow, or conditional allow. A robots.txt file can be created using almost any text editor, as long as it allows for the creation of standard ASCII or UTF-8 text files. Don't use a word processor, as they sometimes add unexpected characters and will mess up the code.
Not really sure what the robots.txt file looks like? Here are a few examples to get you familiar.
To allow all content to be crawled, you will see:
user-agent: *disallow
or you will see
user-agent: *
allow: /
While both of these entries are totally valid, if you do want all of your content to be crawled, it is not mandatory for you to create and employ a robots.txt file, and in-fact it is recommended that one isn't used.
To disallow a whole website to be crawled, you will see:
user-agent: *disallow: /
To disallow the crawling of specific parts of a website you will see something like:
user-agent: *disallow: /junk/
disallow: /calendar/
It should be noted that you should be using proper authentication if you want to block access to private content on the website, do not rely on robots.txt for this. If you do use robots.txt to block the private content, it could still be indexed but not crawled, and the robots.txt file can be seen by anyone, thus potentially exposing your private content and showing the location of it.
To allow access for a single crawler, you will see:
user-agent: Googlebot-newsdisallow:
user-agent: *
disallow: /
To allow access to every crawler except one, you will see:
user-agent: unnecessarybotdisallow: /
user-agent: *
disallow:
Robots Meta Tag and X-Robots Tag
A robots meta tag may be added at the top of an HTML page in the
header section. It will indicate whether or not a search engine will
index a particular page on the website. This tag is applicable to all
search engines, and you have the control to change which search engines
are allowed or blocked when you specify the name of the user-agent in
place of “robots” within the code. This code would look like noindex” /> In the situation when there is content that is not HTML (like a document file) it may also be crawled and indexed by a search engine. When this happens, it is not possible to add a meta tag on individual pages, rather you will use an HTTP header to indicate the response. The header will not be able to be seen by the visitors of the website, and is actually not part of the content. An x-robots tag would be included with the header.
APIs-Google User Agent
This user-agent is specific to Google, and will deliver any push
notification messages. App developers are able to request these
notifications in order to get rid of the need for constantly polling
servers to find out if resources are different than they used to be. In
order to make sure that there is nobody abusing this service, Google
asks of developers to prove that they own the domain prior to allowing
them to register a URL with a domain as the place that they would like
to receive messages. APIs-Google will send all push notifications employing an HTTP POST request. If there is a failure here due to something that may be temporary, APIs-Google will send the notification again. If this still does not work, APIs-Google will keep trying, sometimes up to a few days.
APIs-Google accesses sites at a rate that varies by the amount of push notification requests that have been created for your site's servers, the number of retries that are occurring, and by how quickly the resources being monitored are being updated. Because of this, the traffic patterns for APIs-Google may be consistent or they can be sporadic—it all just depends.
When you are the administrator of a domain that has multiple subdomains that are owned/administered separately, one of those admins might have put in place applications that send push notifications. If you'd like to block APIs-Google, it is advised that you first contact any of the administrators that could have set up an application that applies. Also, you could use the normal robots.txt directives to block the APIs-Google from gaining access to your website. If doing this, you will need to specify APIs-Google as the user-agent in the robots.txt file. It is possible to control APIs-Google separately from Googlebot because they follow different directives.
APIs-Google uses HTTPS in order to deliver push notifications, and it requires a website to have a valid SSL certificate. A certificate that is not valid would include: a self-assigned certificate, a certificate that has been revoked, and a certificate that has been signed by a source that is not trusted. To stop requests for retry, the application should be well-designed and should respond within seconds to a notification message.
Every now and then the IP addresses utilized by APIs-Google will change, and individuals can set their user-agent to whatever they'd like. The best way to make sure that Google is accessing the site is to use a reverse DNS lookup—this is similar to the way that you would verify that a bot trying to access your server is a true Googlebot. When this is the case, you will need to look in your logs for any IP address that is associated with the APIs-Google user-agent, and the lookup will identify the domain as “googlebot.com”.
Google's Crawlers
Crawler is actually a generic term used to describe any robot or
spider program that is used to automatically scan and discover websites
through the act of following links from one web page to the next.
Googlebot is Google's main crawler and is widely used. When there are more than one user-agents in a robots.txt file, the one that is most specific is the one that Google will follow. If you prefer for all of Google to have the ability to crawl your web pages, you will actually not need a robots.txt file at all. If you aim to block or allow any of Google's crawlers from having access to any of your content, this can be done by specifying Googlebot as the user-agent.
0 comments:
Post a Comment