Robots.txt is a text file to indicate search engines that certain part of your website is not to be index.
The simple robots.txt file is build of three items:
- Comments: Lines start with character # is considered
as comments. You could write comments on the top of the file. Please see the
below example for further detail.
- User-agent: The User-agent specifies the name of the
search engine robot. You could either specify the name of the specific search
engine robot or use the wildcard character "*" to specify all robots. Please
find some most popular User-agent’s list in below tool.
- Disallow: The ‘Disallow’ directive line specify the name of the folder or
file which you don’t want the specific robot to index. All the paths should
starts from the root of the website.
Look at the below example for a sample Robots.txt:
# Robots.txt file for http://www.googlehelpline.com/
# Contact webmaster@googlehelpline.com
User-agent: *
Disallow: /myalbum
Disallow: /foo.html
User-agent: Google
Disallow: /temp
Steps to build robots.txt:
- Identify content of your website which you would like
to disallow for search engine indexing.
- Create a blank text file with name “Robots.txt”.
- Copy the above example format in “Robots.txt”.
- Modify the content of file as instructed.
- Upload the file in root directory of your website and check if you could access it through www.yourwebsite.com/robots.txt
Examples :
- Allow all robots to visit all files.
User-agent: *
Disallow:
- Keep all robots out.
User-agent: *
Disallow: /
- Keep all robots away from the ‘temp’ and ‘images’ directories:
User-agent: *
Disallow: /temp/
Disallow: /images/
- Keep google from all files on the server:
User-agent: googlebot
Disallow: /
- Keep googlebot away to index test.htm file:
User-agent: googlebot
Disallow: test.htm
Robots.txt FAQs :
Question: Does robots.txt file allows me to hide just part of a web page from the search engines listing?
Answer: No
Question: How long does it takes for Robots.txt to start working?
Answer: Once you upload the robots.txt in the root directory of your website, it starts working immediately. Whenever the next
search engine crawler will come to your website will first look at the robots.txt file and then will go further.
Question: How to remove non html pages (i.e. word document) from Google cache?
Answer: Some time you have NON HTML documents(i.e. Excel, Word, PDF docs) in your website whom you do not want to index for search.
The problem with these documents is you cannot use html meta tag to stop search engines to index them.
Robots.txt is the best solution for this kind of problems. Below is the code where you could restrict all doc files to be index.
User-agent: googlebot
Disallow: /*.doc
Even you could specify the name of the document you don't want to index.
User-agent: googlebot
Disallow: /test.xls