LDN *CODE STORM* 09 - Portal Search
I think we can all agree that search is a badly needed feature. This is the best way to locate content, and it sure beats the alternative of clicking through each tab to find something.
My questions regard the implementation of such a search. I suspect the reason that Sungard doesn't offer this feature is because it isn't something that can be quickly / easily tackled.
The first thing to tackle I suppose is to establish the scope of the search. Getting channel titles is easy. There is no built in mechanism for channel descriptions, or tagging, (though there are categories) so matching on a title is pretty hit and miss.
The next level of searching would involve looking inside the channels. Sure, we can query the contents of /opt/luminisshared/site/content/* to find particular pieces of data, but this would only work for a select few types of channels (namely the Targeted Content Channel). Other channel types and data pulled in after the page loads (via AJAX for example) would not be searched.
The next level would include all channels, an all content inside those channels. We can procedurally crawl the DOM after the channels all load, and start hiding divs (a la the old Netvibes search). This however would only return results for the channel the user has on the current tab. I suppose we could procedurally login on behalf of a user, and search all channels using "curl", or something similar and return a list of channels not on the current tab. This involves added complexities however, such as determining which channel parts a user has access to view.
As long as we are searching though, why stop at channels? We can search announcements, group studio, and course studio content. With this much data however, performance becomes a factor, and we may have to move to an indexed provider such as Sphinx.
Does anyone have any ideas on how to tackle such an ambitious search? Input is welcome!

Search Problems
Hi Ben,
We've been stumped on a rather narrow definition of your search types for a while (just searching portal content-but we have a considerable amount of this since we use Luminis as our intranet for regular pages, not just channel-based layouts).
Jon's channel search from a couple of years back is one of the better search implementations I've seen around, but it seems to be largely a manual process. I was hoping at one point to automatically crawl portal pages similar to the way most normal web page indexing works, but we've hit a couple of conceptual hurdles.
If you run a web crawler, you need one that can handle authentications. Not a big deal. So we figured we would create an account for the crawler. But what role should that account have? We thought it should basically be given everything so it could see everything. This leads into problems of course: after you've indexed your site, how does the search engine know only to show you the search results that are valid for your (the end user's) role or combination of roles? This means essentially that the crawler needs to index role-access information accompanying the content its retrieving which isn't available from a client perspective leaving me questioning whether the whole crawler paradigm is even possible. It's clear from this that it would be very difficult to use any off-the shelf search technology and that whatever search is developed would need to be tightly integrated into Luminis's core infrastructure. (Either that or the engine would have to return crawler results regardless of role, and contain no descriptions along with returned URLs to maintain content access restrictions.)
Brian
My Search Limitations and ideas for expansion
My search like Biran mentioned, is a manual process you need to keep up with constantly - especially if you're adding new channels all the time.
I had an idea at one point, thinking that luminis would have to be indexed like google. Putting user permissions aside for right now - a process would spin through all the channels defined and figure out where the heck the content is, pull it in and index it for searching.
At anyrate, due to the way Luminis stores content, I don't think an on-demand search would work well (I may be wrong). I think it may be too slow.
Now toss in user permissions - somehow the results would have to filter based on what the user had permission to see. So I couldn't see some budget channel, or file uploader we created for financial aid.
Maybe there's an exclusion list as well, since there are probably some channels we simply would not want the user to find. thoughts ?
-Jon
Some additional thoughts
First, I am pretty sure that the search is going to have to be in real-time. The reason I say this is that the content in our Portal is ABSOLUTELY specific to the individual. We have AJAX requests including RSS content that are built and called on-the-fly based on a user's roles. Even worse, they build off the roles in an arbitrary fashion - for example "employees, but not staff". Therefor, the time and space it would take to build each combination of roles beforehand would be impractical. Also, these are URLs are built procedurally, so it would be tough for a webcrawler to detect.
Second, we have already started to tackle how to determine who has access to what as far as TCC channels go. This "adventure" is documented here. If we calculate the list of TCC channels a user has access to, we can scan the content for matching terms. I think determining permissions will be necessary no matter the approach taken.
Third, partial updates (AJAX, etc) are going to be a BIG problem. Suppose I just have a TCC section that reads "new Ajax.Request('/somewhere_else.html');" There goes any useful searching unless I actually follow that link and see what it holds. Now, suppose what the request returns is a bunch of other links to yet more pages. This is where a web-crawler would come into play. We would have to be careful however to only follow non-destructive URL calls (ie GET methods). However, IE doesn't play nice with AJAX GET requests, so we have converted almost all methods over to POST. Whats good and whats bad to follow may not be possible to determine. (Although I can't think of any destructive URLs that are not forms)