Jeff
10-03-2005, 03:10 AM
Q. What is Google Sitemaps?
A. From http://www.google.com/webmasters/sitemaps/docs/en/about.html:
About Google Sitemaps
Search engines such as Google discover information about your site by employing software known as "spiders" to crawl the web. Once the spiders find a site, they follow links within the site to gather information about all the pages. The spiders periodically revisit sites to find new or changed content.
Google Sitemaps is an experiment in web crawling. By using Sitemaps to inform and direct our crawlers, we hope to expand our coverage of the web and speed up the discovery and addition of pages to our index.
If your site has dynamic content or pages that aren't easily discovered by following links, you can use a Sitemap file to provide information about the pages on your site. This helps the spiders know what URLs are available on your site and about how often they change.
A Sitemap provides an additional view into your site (just as your home page and HTML site map do). This program does not replace our normal methods of crawling the web. Google still searches and indexes your sites the same way it has done in the past whether or not you use this program. A Sitemap simply gives Google additional information that we may not otherwise discover. Sites are never penalized for using this service. This is a beta program, so we cannot make any predictions or guarantees about when or if your URLs will be crawled or added to our index. Over time, we expect both
coverage and time-to-index to improve as we refine our processes and better understand webmasters' needs.
Also, you can submit updated Sitemaps as your URLs change, but you don't have to, as the spiders will periodically revisit your site (and will use the frequency information you provide in your Sitemap as one of the factors in how often they revisit) and look for new pages.
In fewer words, Google's Sitemaps program is a way to get updates for your website out to the Internet faster. The traditional way is to wait for the websites' robots to crawl your website, checking for new or updated material. With Sitemaps, you can notify Google immediately of updates instead of waiting for the indexing spiders to crawl your site.
There is quite a bit of information on getting things set up, but fortunately there is very little work that needs to be done. This HOWTO will explain how to part 1 of the following (again, from http://www.google.com/webmasters/sitemaps/docs/en/about.html):
Participating is easy
You can participate in the Google Sitemaps program by following these basic steps:
1. Creating a Sitemap (http://www.google.com/webmasters/sitemaps/docs/en/overview.html) in a supported format.
2. Submitting that Sitemap (http://www.google.com/webmasters/sitemaps/docs/en/submit.html) to Google.
3. Updating your Sitemap (http://www.google.com/webmasters/sitemaps/docs/en/submit.html#ping) when your site changes.
There are multiple ways to create your sitemap. You can find information on doing so here: Creating a Sitemap (http://www.google.com/webmasters/sitemaps/docs/en/overview.html). The method we are going to focus on is the one outlined here: Google Sitemap Generator (http://www.google.com/webmasters/sitemaps/docs/en/sitemap-generator.html).
Google suggests connecting to your website via SSH (http://www.google.com/search?q=ssh) in order to run the setup file from the command line, however there is an alternate option available to you via cPanel - cron (http://www.google.com/search?q=cron).
First, let's configure our config.xml file. Note: this assumes that you have already downloaded the Sitemap Generator program files (see here: http://www.google.com/webmasters/sitemaps/docs/en/sitemap-generator.html).
Here I am going to provide an example of a working config.xml so you can see just how many options aren't needed to get set up, and to show you an example of what the correct paths should resemble:
<?xml version="1.0" encoding="UTF-8"?>
<site
base_url="http://www.testmyports.com/"
store_into="/home/jeff/www/testmyports/sitemap.xml.gz"
verbose="1"
>
<directory
path="/home/jeff/www/testmyports"
url="http://www.testmyports.com/"
default_file="index.php"
/>
<url href="http://www.testmyports.com/" />
<filter action="drop" type="wildcard" pattern="*~" />
<filter action="drop" type="regexp" pattern="/\.[^/]*" />
</site>
All you really need are the <site>, <directory>, and <url> XML tags. The <filter> tags are optional, but a good idea to leave in. The ones shown in the example above are unedited and were copied directly from the example config.xml that is provided in the package.
NOTE 1: Notice the trailing "/" at the end of every line that contains "www.testmyports.com" - it is required. The "http://" portion is also required.
NOTE 2: The example config.xml file above is for an Addon Domain called "testmyports.com". As such, the web root for that website is /home/myUserName/www/mySiteName. If you are not making a sitemap for an Addon Domain, then your web root is simply /home/yourUserName/www/.
NOTE 3: The above config.xml is very basic, and makes use of only the options required to generate a sitemap. More options are available to you in the sample config.xml file and are explained in depth here (http://www.google.com/webmasters/sitemaps/docs/en/protocol.html). They will not be covered here.
After you have created your config.xml file, you will need to place it, along with the sitemap_gen.py file on the server. You can upload these files via FTP. Make sure you are using ASCII mode to transfer the files - not binary. Do not use the cPanel File Manager - as it will place the files on the server in binary format, which will contain win32 style linefeeds if you created/edited the config.xml in, say, notepad. FrontPage has been known to cause similiar issues. No matter how you transfer the files, the mode must be ASCII, not binary.
Now, you need to generate the sitemap file by running sitemap_gen.py and passing the config.xml file as an argument. We are also going to pass the --testing argument until we are sure everything is working properly.
After uploading config.xml and sitemap_gen.py to the appropriate directory, and after logging into cPanel and clicking the "Cron jobs" link, you are ready to run your first practice test at generating a sitemap.
1. From the Cron jobs link, click "Standard".
2. Make sure your correct email address is listed at the box at the top.
3. Under "Minute(s)", "Hour(s)", "Month(s)", "Day(s)", and "Weekday(s)", make sure the top option is selected for each one (ie: Every Minute, Every Hour, and so on).
4. In the box that says "Command to run:", enter the following:
python www/sitemap_gen.py --config=www/config.xml --testing
The above command assumes you have placed sitemap_gen.py and config.xml in your www/ directory.
5. Click "Save Crontab"
6. Wait 1 minute or less and you should receive an email with the output.
7. While you wait, click "Go Back"
8. Click "Standard"
9. After your email comes in, click "Delete"
Do NOT forget to delete your crontab!
If you forget to delete your crontab, you are going to be generating a sitemap file every minute. This will prevent your sitemap from working as expected. To recap: after creating the cron job, surf back to the cron jobs portion of cPanel, and be prepared to delete the cron job after your email comes in.
If everything goes well, your email should look similiar to the following:
Reading configuration file: www/config.xml
Walking DIRECTORY "/home/jeff/www/testmyports/"
Sorting and normalizing collected URLs.
Writing Sitemap file "/home/jeff/www/testmyports/sitemap.xml.gz" with 13 URLs
Search engine notification is suppressed.
Count of file extensions on URLs:
1 (no extension)
1 .css
5 .html
1 .ini
2 .php
1 .png
2 /
Number of errors: 0
Number of warnings: 0
Note 1: The line that says "Search engine notification is suppressed." is due to passing the --testing argument. This is ok for now since we are doing just that - testing.
Notice the Number of errors and Number of warnings.
I recommend downloading your sitemap.xml.gz file (notice its location in the corresponding line above), uncompressing it, and viewing the contents. It will look similiar to the following:
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.google.com/schemas/sitemap/0.84"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84
http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">
<url>
<loc>http://www.testmyports.com/</loc>
<lastmod>2005-08-14T16:47:25Z</lastmod>
<priority>1.0000</priority>
</url>
<url>
<loc>http://www.testmyports.com/advanced.html</loc>
<lastmod>2005-08-14T16:37:21Z</lastmod>
<priority>0.5000</priority>
</url>
<url>
<loc>http://www.testmyports.com/cgi-bin/</loc>
<lastmod>2005-10-02T02:43:42Z</lastmod>
<priority>0.5000</priority>
</url>
<url>
...
</urlset>
As you can see, every link available under the www/testmyports/ directory has been placed into this file. This is controlled by the following section in the config.xml above:
<directory
path="/home/jeff/www/testmyports"
url="http://www.testmyports.com/"
default_file="index.php"
/>
If you receive errors or warnings, fix them, and try again. When you have created a working sitemaps file, simply re run the cron 1 time (wait for the email, and delete the cron just like before),
but remove the --testing option. After removing the --testing option, you will no longer see "Search engine notification is suppressed.". Instead, you will see the following in the output:
Notifying search engines.
Notifying: www.google.com
Congratulations, you have created your first sitemap! Now you are ready to move on to steps 2 and 3 of the following which are not covered here:
Participating is easy
You can participate in the Google Sitemaps program by following these basic steps:
1. Creating a Sitemap (http://www.google.com/webmasters/sitemaps/docs/en/overview.html) in a supported format.
2. Submitting that Sitemap (http://www.google.com/webmasters/sitemaps/docs/en/submit.html) to Google.
3. Updating your Sitemap (http://www.google.com/webmasters/sitemaps/docs/en/submit.html#ping) when your site changes.
Feel free to post questions, comments, suggestions, or any other type of feedback in this thread.
A. From http://www.google.com/webmasters/sitemaps/docs/en/about.html:
About Google Sitemaps
Search engines such as Google discover information about your site by employing software known as "spiders" to crawl the web. Once the spiders find a site, they follow links within the site to gather information about all the pages. The spiders periodically revisit sites to find new or changed content.
Google Sitemaps is an experiment in web crawling. By using Sitemaps to inform and direct our crawlers, we hope to expand our coverage of the web and speed up the discovery and addition of pages to our index.
If your site has dynamic content or pages that aren't easily discovered by following links, you can use a Sitemap file to provide information about the pages on your site. This helps the spiders know what URLs are available on your site and about how often they change.
A Sitemap provides an additional view into your site (just as your home page and HTML site map do). This program does not replace our normal methods of crawling the web. Google still searches and indexes your sites the same way it has done in the past whether or not you use this program. A Sitemap simply gives Google additional information that we may not otherwise discover. Sites are never penalized for using this service. This is a beta program, so we cannot make any predictions or guarantees about when or if your URLs will be crawled or added to our index. Over time, we expect both
coverage and time-to-index to improve as we refine our processes and better understand webmasters' needs.
Also, you can submit updated Sitemaps as your URLs change, but you don't have to, as the spiders will periodically revisit your site (and will use the frequency information you provide in your Sitemap as one of the factors in how often they revisit) and look for new pages.
In fewer words, Google's Sitemaps program is a way to get updates for your website out to the Internet faster. The traditional way is to wait for the websites' robots to crawl your website, checking for new or updated material. With Sitemaps, you can notify Google immediately of updates instead of waiting for the indexing spiders to crawl your site.
There is quite a bit of information on getting things set up, but fortunately there is very little work that needs to be done. This HOWTO will explain how to part 1 of the following (again, from http://www.google.com/webmasters/sitemaps/docs/en/about.html):
Participating is easy
You can participate in the Google Sitemaps program by following these basic steps:
1. Creating a Sitemap (http://www.google.com/webmasters/sitemaps/docs/en/overview.html) in a supported format.
2. Submitting that Sitemap (http://www.google.com/webmasters/sitemaps/docs/en/submit.html) to Google.
3. Updating your Sitemap (http://www.google.com/webmasters/sitemaps/docs/en/submit.html#ping) when your site changes.
There are multiple ways to create your sitemap. You can find information on doing so here: Creating a Sitemap (http://www.google.com/webmasters/sitemaps/docs/en/overview.html). The method we are going to focus on is the one outlined here: Google Sitemap Generator (http://www.google.com/webmasters/sitemaps/docs/en/sitemap-generator.html).
Google suggests connecting to your website via SSH (http://www.google.com/search?q=ssh) in order to run the setup file from the command line, however there is an alternate option available to you via cPanel - cron (http://www.google.com/search?q=cron).
First, let's configure our config.xml file. Note: this assumes that you have already downloaded the Sitemap Generator program files (see here: http://www.google.com/webmasters/sitemaps/docs/en/sitemap-generator.html).
Here I am going to provide an example of a working config.xml so you can see just how many options aren't needed to get set up, and to show you an example of what the correct paths should resemble:
<?xml version="1.0" encoding="UTF-8"?>
<site
base_url="http://www.testmyports.com/"
store_into="/home/jeff/www/testmyports/sitemap.xml.gz"
verbose="1"
>
<directory
path="/home/jeff/www/testmyports"
url="http://www.testmyports.com/"
default_file="index.php"
/>
<url href="http://www.testmyports.com/" />
<filter action="drop" type="wildcard" pattern="*~" />
<filter action="drop" type="regexp" pattern="/\.[^/]*" />
</site>
All you really need are the <site>, <directory>, and <url> XML tags. The <filter> tags are optional, but a good idea to leave in. The ones shown in the example above are unedited and were copied directly from the example config.xml that is provided in the package.
NOTE 1: Notice the trailing "/" at the end of every line that contains "www.testmyports.com" - it is required. The "http://" portion is also required.
NOTE 2: The example config.xml file above is for an Addon Domain called "testmyports.com". As such, the web root for that website is /home/myUserName/www/mySiteName. If you are not making a sitemap for an Addon Domain, then your web root is simply /home/yourUserName/www/.
NOTE 3: The above config.xml is very basic, and makes use of only the options required to generate a sitemap. More options are available to you in the sample config.xml file and are explained in depth here (http://www.google.com/webmasters/sitemaps/docs/en/protocol.html). They will not be covered here.
After you have created your config.xml file, you will need to place it, along with the sitemap_gen.py file on the server. You can upload these files via FTP. Make sure you are using ASCII mode to transfer the files - not binary. Do not use the cPanel File Manager - as it will place the files on the server in binary format, which will contain win32 style linefeeds if you created/edited the config.xml in, say, notepad. FrontPage has been known to cause similiar issues. No matter how you transfer the files, the mode must be ASCII, not binary.
Now, you need to generate the sitemap file by running sitemap_gen.py and passing the config.xml file as an argument. We are also going to pass the --testing argument until we are sure everything is working properly.
After uploading config.xml and sitemap_gen.py to the appropriate directory, and after logging into cPanel and clicking the "Cron jobs" link, you are ready to run your first practice test at generating a sitemap.
1. From the Cron jobs link, click "Standard".
2. Make sure your correct email address is listed at the box at the top.
3. Under "Minute(s)", "Hour(s)", "Month(s)", "Day(s)", and "Weekday(s)", make sure the top option is selected for each one (ie: Every Minute, Every Hour, and so on).
4. In the box that says "Command to run:", enter the following:
python www/sitemap_gen.py --config=www/config.xml --testing
The above command assumes you have placed sitemap_gen.py and config.xml in your www/ directory.
5. Click "Save Crontab"
6. Wait 1 minute or less and you should receive an email with the output.
7. While you wait, click "Go Back"
8. Click "Standard"
9. After your email comes in, click "Delete"
Do NOT forget to delete your crontab!
If you forget to delete your crontab, you are going to be generating a sitemap file every minute. This will prevent your sitemap from working as expected. To recap: after creating the cron job, surf back to the cron jobs portion of cPanel, and be prepared to delete the cron job after your email comes in.
If everything goes well, your email should look similiar to the following:
Reading configuration file: www/config.xml
Walking DIRECTORY "/home/jeff/www/testmyports/"
Sorting and normalizing collected URLs.
Writing Sitemap file "/home/jeff/www/testmyports/sitemap.xml.gz" with 13 URLs
Search engine notification is suppressed.
Count of file extensions on URLs:
1 (no extension)
1 .css
5 .html
1 .ini
2 .php
1 .png
2 /
Number of errors: 0
Number of warnings: 0
Note 1: The line that says "Search engine notification is suppressed." is due to passing the --testing argument. This is ok for now since we are doing just that - testing.
Notice the Number of errors and Number of warnings.
I recommend downloading your sitemap.xml.gz file (notice its location in the corresponding line above), uncompressing it, and viewing the contents. It will look similiar to the following:
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.google.com/schemas/sitemap/0.84"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84
http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">
<url>
<loc>http://www.testmyports.com/</loc>
<lastmod>2005-08-14T16:47:25Z</lastmod>
<priority>1.0000</priority>
</url>
<url>
<loc>http://www.testmyports.com/advanced.html</loc>
<lastmod>2005-08-14T16:37:21Z</lastmod>
<priority>0.5000</priority>
</url>
<url>
<loc>http://www.testmyports.com/cgi-bin/</loc>
<lastmod>2005-10-02T02:43:42Z</lastmod>
<priority>0.5000</priority>
</url>
<url>
...
</urlset>
As you can see, every link available under the www/testmyports/ directory has been placed into this file. This is controlled by the following section in the config.xml above:
<directory
path="/home/jeff/www/testmyports"
url="http://www.testmyports.com/"
default_file="index.php"
/>
If you receive errors or warnings, fix them, and try again. When you have created a working sitemaps file, simply re run the cron 1 time (wait for the email, and delete the cron just like before),
but remove the --testing option. After removing the --testing option, you will no longer see "Search engine notification is suppressed.". Instead, you will see the following in the output:
Notifying search engines.
Notifying: www.google.com
Congratulations, you have created your first sitemap! Now you are ready to move on to steps 2 and 3 of the following which are not covered here:
Participating is easy
You can participate in the Google Sitemaps program by following these basic steps:
1. Creating a Sitemap (http://www.google.com/webmasters/sitemaps/docs/en/overview.html) in a supported format.
2. Submitting that Sitemap (http://www.google.com/webmasters/sitemaps/docs/en/submit.html) to Google.
3. Updating your Sitemap (http://www.google.com/webmasters/sitemaps/docs/en/submit.html#ping) when your site changes.
Feel free to post questions, comments, suggestions, or any other type of feedback in this thread.