darren hobbs: distributed lucene

Interesting article by Mark Harwood here regarding distributed lucene indexes. Using distributed indexes is how google achieves its scalability I believe, but they are a fairly special case. If scalability in the sense of concurrent users is the issue, I tend to favour multiple identical boxes with a load balancer and an RPC frontend. This can be as simple as a servlet, or you can use SOAP or XML-RPC etc. (Possibly RMI, although I‘ve never tried that across a load balancer). Doing things this way is probably a lot simpler to manage than splitting your indexes across boxes and means that even if your queries are asymmetric (ie. 85% of the queries are for the same thing), the load can be fairly balanced. Reliability is achieved for free as well - if a box dies just stop sending requests there. Given Lucene‘s performance (it has been used to index collections of more than 10 million documents) its pretty unlikely that your dataset will get so large that sheer size starts to affect your query times. Unless of course, you are google :)

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。