Collecting search results
For collecting hits the Lucene subsystem expects a collector that is initialized with appropriate size. The size of allocated memory is directly proportional to the number of documents to be collected:
Size = N * (Top + Skip)
where N means size of some internal Lucene object. If Top and Skip is given, they define the size of memory to be allocated. The size of collector needs to be determined even if the Content Query does not define the TOP value (e.g. children query). In this case the size of the allocated memory should be defined by the number of documents in the Lucene index for the corresponding query. This is hard to guess or expensive to find out, so a different approach is required here. We could allocate a large memory chunk big enough for every possible scenarios, but if the webserver is strongly used (lot of requests, large amount of cache data etc.), allocating large amount of memory is an expensive (sometimes impossible) operation. According to our measurements when the webserver is under heavy load, more queries with smaller collectors are faster than a single query with a big collector. Therefore, we use multi-round queries with small but proportionally growing "Top" value. It is performant in the most cases and also can be used when the result set is big (expectation: query with big result sets are rare). These "Top" values are configured in the web.config:
<add key="DefaultTopAndGrowth" value="100,1000,10000,0"/>
This is also the default so the above values are used if the configuration is missing. Notes to configuration:
- Values are separated by comma.
- Every value is integer.
- Every value must be greater than preceeding.
- Last value can be zero.
If the last value is zero, it will be substituted with the maximum value for "Top". The best setting depends on the repository usage: number of all content, average number of children, usage of content queries etc. so the default setting should be changed in the test phase before live deployment.
Even though the above algorithm optimizes query execution time and memory consumption for queries that do not include the TOP keyword, it is still highly advised to use TOP whenever it is possible to guess the maximum number of documents or the necessary number of documents when querying!
Collecting search results with permissions
It is important to note, that there is a permission check executed before the result set is returned. This means that if there are 50 documents in the results, but the current user has permissions to view 30 content only, the final result set will contain 30 documents. Now this can get tricky when TOP is given in the query: if we set TOP to 20 in the previous example, and the query returns only 15 documents (this is possible if only 15 documents is accessible by the current user out of the first 20 documents), then the query will be reexecuted with a higher TOP value. This reexecution is carried out using the algorithm described in the Collecting search results section above: the next TOP value defined in the web.config will be probed until the demanded TOP number of documents is returned, or all documents are returned that the user has permissions to.