Ads

Thursday, May 22, 2014

ROBOTS.TXT - How to Block Pages, Directories?


ROBOTS.TXT  RULES:



Here are the ROBOTS.TXT commands which can be used by creating a notepad file and place these codes if you would like to block a page, file, directory or whole site.

List of ROBOTS.TXT COMMANDS

  • To block the entire site, use a forward slash.
    Disallow: /
  • To block a directory and everything in it, follow the directory name with a forward slash.
    Disallow: /junk-directory/
  • To block a page, list the page.
    Disallow: /private_file.html
  • To remove a specific image from Google Images, add the following:
    User-agent: Googlebot-Image
    Disallow: /images/dogs.jpg 
  • To remove all images on your site from Google Images:
    User-agent: Googlebot-Image
    Disallow: / 
  • To block files of a specific file type (for example, .gif), use the following:
    User-agent: Googlebot
    Disallow: /*.gif$
  • To prevent pages on your site from being crawled, while still displaying AdSense ads on those pages, disallow all bots other than Mediapartners-Google. This keeps the pages from appearing in search results, but allows the Mediapartners-Google robot to analyze the pages to determine the ads to show. The Mediapartners-Google robot doesn't share pages with the other Google user-agents. For example:
    User-agent: *
    Disallow: /
    
    User-agent: Mediapartners-Google
    Allow: /
Note that directives are case-sensitive. For instance, Disallow: /junk_file.asp would block http://www.example.com/junk_file.asp, but would allow http://www.example.com/Junk_file.asp. Googlebot will ignore white-space (in particular empty lines)and unknown directives in the robots.txt.

Pattern matching

Googlebot (but not all search engines) respects some pattern matching.
  • To match a sequence of characters, use an asterisk (*). For instance, to block access to all subdirectories that begin with private:
    User-agent: Googlebot
    Disallow: /private*/
  • To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
    User-agent: Googlebot
    Disallow: /*?
  • To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
    User-agent: Googlebot 
    Disallow: /*.xls$
    You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
    User-agent: *
    Allow: /*?$
    Disallow: /*?
    The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).
    The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
Save your robots.txt file by downloading the file or copying the contents to a text file and saving as robots.txt. Save the file to the highest-level directory of your site. The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.example.com/robots.txt is a valid location, but http://www.example.com/mysite/robots.txt is not.

REF: https://support.google.com/webmasters/answer/156449?hl=en 

No comments:

Google Latest Algorithm Updates 2013 | SEO News, SEO Updates EMD, Penguin, Panda Headline Animator