I discovered 2 months ago that a large site I work on had not been indexed by Google for A Long Time. When I say not indexed, I mean that, when I search for this site, pages on a certain subdomain had a Cached date from many months ago.
I’ve learned a lot while trying to fix this problem, so I thought I’d share it here.
If your cached date is significantly in the past, then one of two things is happening:
- Google is not crawling
- Google is crawling but not indexing
Google is not crawling
First, check your robots.txt. If you have it, it could have an error in it. Even if you do not have it, it could still be your problem.
Sign up for Google Webmaster Tools, verify your site, and see if Google reports any errors in Tools > Analyze robots.txt.
If it reports errors, then your path is obvious. If it does not, then look at the Status.
Network unreachable: robots.txt unreachable
Your ISP may be blocking the Googlebot. It can make so many crawling requests than an ISP may mistake them for ‘abuse’. Talk to your ISP or system administrator to see if they are blocking IPs in the 66.249. range. (You should research other IP blocks for the Googlebot, as they could change any time.)
If your ISP swears this isn’t the case, but you don’t see the Googlebot in your server logs, keep asking them until they’re sure and extremely sick of you. These logs can be found in /var/log/httpd on an Apache system.
Please check back later.
If Webmaster Tool->Diagnostics->Web Crawl reports 0 errors as well, but you have URLs timing out, look to see if it is the robots.txt timing out. If you see this and/or your httpd logs report the Googlebot is hitting your server, then the Googlebot is not blocked, but it can’t access robots.txt.
When Google requests robot.txt, it can only understand two things:
You may think you are covered if you do not have a robots.txt. That counts as not found, right? Only if it returns a 404 header.
If you have access to your *nix server’s command line, use wget to fetch the website in question. wget will access the website and let you see the information that the browser normally gets. It will then save the page’s content into a file.
wget www.yoursite.com/robots.txt
--13:59:51-- http://www.yoursite.com/robots.txt
Resolving www.yoursite.com...
Connecting to www.yoursite.com| |:80... connected.
HTTP request sent, awaiting response... 404 Not Found
A 404 Not Found means the file is not there. This page will either say 404 on it (you’ve seen them before, trust me) or it will be a custom error page with your branding and a friendly message.
A 404 is an appropriate response for a Googlebot looking for robots.txt. Strangely, my site did NOT return a 404 in this case. See the below.
wget www.yoursite.com/robots.txt
--13:54:13-- www.yoursite.com
Resolving www.yoursite.com...
Connecting to www.yoursite.com| |:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: www.yoursite.com/error.html?errorpage=robots.txt
[following]
–13:54:14– www.yoursite.com/error.html
Reusing existing connection to www.yoursite.com:80
HTTP request sent, awaiting response… 200 OK
A 302 is a temporary redirect, as denoted by the “Moved Temporarily” response. For instance, let’s say you have a store which is closed for a 2-week inventory process. You might use a temporary redirect to point www.yoursite.com/store to a message about the 2-week unavailability. After 2 weeks, you would remove it.
If your site has moved permanently, then you should use a 301 Permanent Redirect. The best example is when you move your site from www.domain.com/yoursite to www.yoursite.com. You want anyone going to www.domain.com/yoursite to be automatically transferred to its new home at www.yoursite.com for the foreseeable future.
Google treats 302s and 301s very differently, as you’ll see here:
302 (Moved temporarily) The server is currently responding to the request with a page from a different location, but the requestor should continue to use the original location for future requests. This code is similar to a 301 in that for a GET or HEAD request, it automatically forwards the requestor to a different location, but you shouldn’t use it to tell the Googlebot that a page or site has moved because Googlebot will continue to crawl and index the original location.
In my example above, the use of a 302 redirect to an error page that doesn’t even return a 404 is telling Google:
“Go here to find robots.txt, except it’s not here, but I say it is a success (200). By the way, don’t index it, either — use the old page (302).” This is an entirely unacceptable response. It doesn’t meet the criteria stated above:
The robots.txt file was not found, but there was no 404 returned. A 200 was returned, but this wasn’t the robots.txt, and the 302 means that Google should ignore that page anyway. The end result: every request for robots.txt ‘times out’ because the Googlebot can’t understand whether it has permission to index the site. It crawls, but it doesn’t save, and the site’s cache date sits for months. It turns out that our switch to this 302 redirect system was exactly timed with when Google stopped indexing.
Your options are:
- Put up a blank robots.txt. This will count as found and read.
- Put up a non-blank robots.txt with whatever criteria you want. This will count as found and read.
- Change your error system to return a 404. You can still have a custom error page that returns a 404 rather than a 200.
- Change the redirect to be a 301. Google does understand a 301 pointing to a robots.txt, or a 301 pointing to a 404.
As an aside, I would recommend, for pagerank purposes, that you be very discriminate in your use of 301 redirects vs 302 redirects. They are not interchangeable, particularly when Google has specifically stated that they will only index the original content, not the redirected content, for a 302.
I didn’t mention Google Sitemaps above on purpose. If you have the described robots.txt problem, creating a Sitemap with urls to crawl will not help you. You will see an OK status, but no 0 urls crawled and 0 indexed.