Sunday, October 20, 2019

Google your Desk

Google your Desktop or Laptop

Google is synonymous to search today. Searching the Internet is termed as Googling!

Can the same be done with the files/contents in your own laptop, which is not in the Internet. Do you wish to have a private Google engine (or) similar for yourself to search your desktop or laptop, considering the Window or Mac or Linux OSes does not provide a satisfactory search features. They have improved a lot, in recent times, yet not as good as Google does for Internet.

Google Desktop Application

Google released a desktop application for Windows in 2004, and later released it for Mac and Linux systems. I started using the same between 2006 to 2010 with my Windows desktop and laptop. A productive application it is, and much better and faster search capability for my source code (than IDEs), emails (than Outlook), documents (than Windows search) and Images (image file names only at that point).

Search Features

The search was feature rich - allowed many criteria, filters and ranges (time). Google Desktop search was able to provide a better result set than the default OS specific search applications. I still use AND, OR, NOT conditions to exactly match and trim down the search results. I effectively used such conditions for searching my source code, when IDEs are not that faster and good in searching JS or Java files.
I also search for MS Word and MS Excel files more precisely, as I can use filters like 'title: <text_in_title>' for title of the document, 'name:' for file name and content of the documents; 'subject:<text in subject>' for email search; 'type:java|javascript' for file types etc.
I could also search for files that were modified within 10 days or created a year back etc. using range ('after:', 'before:') queries.

Try searching for emails with 'after and before' filters in Gmail.

One can discover these features by Googling as well. Or try the Advanced Search section of Google.com.

Local Search

Since Google Desktop application was indexing your desk in your laptop/desktop itself, the search is happening within your laptop. There was less resource (CPU, RAM) usage by the application, compared to the features and productivity it provided. However the DISK usage will raise depending on the flags (to store a history content) that was enabled. I did not try to store history contents, which I can use, if the file got changed or deleted, as that might add storage size; however Google Desktop provided a way to compress the content to reduce storage size.
The biggest benefit is the search results are super fast - faster than searching in Google-Internet, as the index data is stored locally.

How to Use?

The Google desktop application is available to download and use by Informer.com.  But be cautious, they don't support issues or security vulnerabilities that might be there.
While Lookeen.com is providing a similar alternative solution to try for 14 days and then buy after trial period ends - I haven't tried this at this date 19-Oct-2019.

How to Build one!?

Apache Lucene is a high performance full-featured text search engine, built in Java in 1999. The library is also available for other languages - .NET,  Python as PyLucene. The support for C and Perl languages were named as Apache Lucy, but deprecated. The core library is Free and Open Source Software (FOSS) and also adopted by different products (ElasticSearch, Apache Solr, CrateDB, Elassandra etc.) for text search.

Lucene supports a rich feature set for search - wild card, fuzzy, proximity text searches; range queries for a period; custom fields like title, fileType, time_of_edit etc. The search result can be sorted, paginated,  and ranked.

Use Lucene core library

One can write a small Java application that can embed the Lucene core library to index, store and query the lucene data.
The TutorialsPoint.com's tutorial seems to be precise and concise for a simple Java Lucene application : https://www.tutorialspoint.com/lucene/lucene_first_application.htm

One can feed appropriate set of folders and files or a Windows/Any-OS drive itself, to the application to index and search accordingly. One need to take care in reading and indexing appropriate attributes (filename, type, time of change, metadata etc. apart from content) of a file, so that the while searching these attributes can be used appropriately as filters.

The disadvantage would be maintain, scale and availability of the application over a period. Or taking such application to Production would be fatal, as such engineering problems are already solved by others - Apache Solr, ElasticSearch, CrateDB etc.!

Use Apache Solr

Explore the option of running Apache Solr (part of Lucene Apache project) which runs as a server with Lucene embeded in it. 
This is also open source server with code available in GitHub.com, along with details to kick start.

Index & Search

One has to have a mechanism to feed the desk folders & files to the Apache Solr server, with appropriate attributes as fields, so that it is searchable, for indexing, and have to write a tool/application to query Apache Solr.

Use ElasticSearch

ElasticSearch is similar to Apache Solr using Lucene at its core, but this is built and maintained by Elastic.co and not by Apache. However, this is also open source server, that one can download and run. This server is more used in production than Apache Solr, due to the elastic nature of scaling horizontally and managing the shards and replicas.

Index

Though the approach of indexing the folders to ElasticSearch service is same as Apache Solr, Elastic.co provides a set of tools - Elastic Beats - to stream data to ElasticSearch, without any coding effort. Using FileBeat, one can stream folders, files, logs etc. to ElasticSearch service with just configurations. 
There are more beats tools are available for streaming other types of data.

Search

Similarly Elastic Kibana can be used to visualize and query ElastiSearch data. This tool also can be configured and run, with no additional development cycle. Since this is a web based application, it runs on any OS.

SQL Search

Elastic.co is also providing a JDBC driver to query ElasticSearch like SQL, to wrap the HTTP based query APIs provided by ElasticSearch. A good documentation on the SQL query is documented in SQL4ES, which is the initial version for Elastic.co's JDBC driver. 
One can write a Java application integrating this driver to query ElasticSearch using SQL like queries. Or use SQL-Client provided by Elastic.co itself, which seems to be better than Kibana or JDBC driver, to me, as one doesn't need to learn the complex ElasticSearch query format, which is not simple and straight forward, it requires a good learning curve.

Use CrateDB

CrateDB is built using Lucene and the open source ElastiSearch for cluster management! 
Crate.io has detailed the differentiators of CrateDB on top of ElasticSearch for clarity. The major items are the in-built SQL, Joins, Blob storage, Insert and Update on Search (i.e. using 'where'). 

Index & Search

CrateDB has a SQL-CLI called Crash for Querying the data, however, that does not support Insert. One has to write an application using the client-libraries provided by CrateDB to do the data injection into CrateDB.

No comments: