V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

这是一个专门讨论 idea 的地方。

每个人的时间，资源是有限的，有的时候你或许能够想到很多 idea，但是由于现实的限制，却并不是所有的 idea 都能够成为现实。

那这个时候，不妨可以把那些 idea 分享出来，启发别人。

这是一个创建于 3633 天前的主题，其中的信息可能已经有所发展或是发生改变。

看到了http://www.v2ex.com/t/162904
对付这种情况，可以采用的方法是为爬虫建立专用页面，返回垃圾信息。

但是我才不会做这么损人不利己的事情

明明可以借助这个特性搭建自己的私人查询库么

./robots.txt：
User-agent: * Disallow: /shegongku/

./shegongku/index.html:
//在这里把各种需要查询又不想浪费自己服务器资源的索引 //建议加密

使用时

直接http://www.haosou.com/s?q=site:{yourhost} inurl:shegongku {yourkeyword}

爬虫

查询

索引

10 条回复 • 2015-01-20 13:31:05 +08:00

vibbow

2015-01-18 12:18:32 +08:00

然后你就被K站了。

14ly

2015-01-18 12:20:56 +08:00

@vibbow K就K，反正我的./robots.txt里已经写了不容许访问，这样不守规矩的爬虫K了才好

vibbow

2015-01-18 12:23:11 +08:00

@14ly 如果我没记错的话，Disallow了后并不是说蜘蛛就不会爬了
Google之类的还是会爬的，然后分析外链，只是不会索引其中的内容而已。
（好像还是会索引Title的）

vibbow

2015-01-18 12:25:59 +08:00

However, robots.txt Disallow does not guarantee that a page will not appear in results: Google may still decide, based on external information such as incoming links, that it is relevant. If you wish to explicitly block a page from being indexed, you should instead use the noindex robots meta tag or X-Robots-Tag HTTP header. In this case, you should not disallow the page in robots.txt, because the page must be crawled in order for the tag to be seen and obeyed.

14ly

2015-01-18 12:31:33 +08:00

@vibbow 还真是，要添加<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">或者X-Robots-Tag HTTP header，想要试验的同学注意了。
另外我找不到append了，翻了一下自己以前发过的帖子，明明有append