Revamping it all!

Soon after we announced the preview release of our search platform back in April of this year, we got busy adding more articles and more life sciences dictionaries to our platform and in the process we managed to put heavy stress on our infrastructure in all directions.  As we added more articles, the keyword index grew beyond the number of attributes supported by the underlying database!  As we added more dictionaries, the search computation time increased by many folds, resulting in occasional timeout of search queries.  To make matters worse, tight coupling of our platform with the crawled and parsed data allocated too many resources (both compute and storage) for each of our deployments be it alpha, beta or internal development site. Incorporating any backup or disaster recovery plan would have further stressed and complicated our framework.

It soon became apparent that we needed to make some significant architectural changes.  So we did!

A New Document Store

Thanks to our DFS implementation, our corpus data have always been distributed evenly across our infrastructure however as mentioned above our various deployments were tightly coupled with their underlying data leading to redundant compute and storage instances as well as absence of any pragmatic backup or disaster recovery plan. Therefore Varun designed, thoroughly tested, and deployed a new document store layer on top of the existing DFS which now allows multiple applications to share or segregate the same data corpus through project namespaces.  It also allows for encryption at the project level so that the data can be securely accessed by various deployments using private/public key infrastructure. Similarly it also enables creating an active data backup at the project level that can be archived and restored as needed.

Processing Pipeline (ETL)

Though we had a decoupled architecture from the very start, the new document store enabled us to greatly simplify and streamline our processing pipeline (from crawler, word breaker, keyword index, nlp, classifiers, dictionaries to search results computation) to further break down components more discretely, to let them run concurrently on demand at the data ingestion time thereby reducing search time computations dramatically.  Under this new architecture, each processing component derives from a simple application base class provided by the document store pipeline that processes newly uploaded/crawled documents/pages through a callback function implemented by the component developer. Here is a an example of a simple component in our pipeline that prints the document id and its content when called with this document.

Sample Application for Document Store Pipeline

Each component can define scope of documents (see “domain_selector” attribute in the code above) to process and also define its dependent components (see “module_dependencies” attribute in the code above) so that the pipeline will ensure that dependent components are processed first.

Processing Components

With the new pipeline architecture and document store SDK, writing components (both our own and that from our future collaborators) has become much simpler.  Varun was able to discard majority of our past monolithic crawler, nlp, indexer, storage components and rewrite them as new pipeline application modules in a short period of time.  Here is a brief snippet to peek under the cover so to speak.

Here is how one uploads new content in the document store:

Sample Application to add documents into document store

Here is how we build the initial text index (see more on this below):

Code Snippet to create a keyword search index using document store pipeline

New Inverted Index

As stated above, one problem we encountered with growing data was the physical limit of the number of attributes we could store in the key-value database (Cassandra).  That forced us to rethink and look more closely into other mature technologies for storing the inverted index (a list of documents and their relevant attributes as values and a set of words as keys – thus the name inverted index!).  Apache Lucene indexing classes provided a good guiding framework for Varun to implement the same internally (without Java or large memory footprint – 2 of reasons why we chose not to use Lucene itself) as demonstrated in the code snippet above.  Furthermore, it has enabled our infrastructure to scale linearly over DFS while keeping the initial search index relatively small and super fast.


While our claim to fame is our unsupervised learning algorithm which clusters results into meaningful concepts (topics), we can also leverage external dictionaries, catalogs or taxonomies to improve the quality and relevance of results we surface in any given domain or project.  Our new document store SDK can be used by our partners to easily add new dictionaries into the platform with a few simple lines of code.  Here is an example of how we integrated Protein Data Bank (PDB) identifiers as a dictionary in our document store (as a result of this integration, if a protein is identified by our search engine, we can return its attributes from PDB).

Sample Dictionary to extract PDB Identifiers using document store pipeline

Similarly, here is an example of how we integrated Wikipedia references in our document store.

Sample Dictionary to extract Wikipedia attributes using document store pipeline

We will publish many more sample dictionary interfaces (such as Name, Place, or Event classifiers) along with some of our internal helper services (such as PDB and Wikipedia web services used in the scripts above) to help with extracting attributes from public or private dictionary sources.  The new interface not only made writing dictionaries simple but also greatly improved search time performance. Before, our search compute time increased as we added more dictionaries.  In our new framework each dictionary maintains its own results per document so the search engine simply queries matching terms from relevant dictionaries and coalesces all attributes across dictionaries. As an example, therefore a given protein now may have its attributes derived from PDB, Wikipedia entries, and any other supplied dictionaries (eg STRING) or product catalogs (eg Millipore Sigma, or Thermo Fisher Scientific).

What Next?

While you experiment with our new framework which is up and running at, stay tuned for more exciting news from NLPCORE.  Soon, we will publish our full set of APIs that allow both solution developers and algorithm developers to take our platform for a spin in life sciences vertical and beyond including their own data sets!

Scripting our Search Experience

Lastly, in the true spirit of all hands on deck, I decided to write a few python scripts to help our users look for specific search results without using our website. Thanks to the power of python, in only a few lines of code I was able to write an extensible console app which provides commands to list discovered topics (e.g. proteins, genes, disease name, etc.) discovered automatically from life sciences project, instance names for a given topic, and their specific references (surrounding text) within the articles. You are welcome to play with and modify the script (reader’s exercise: find a instance name for a given topic where instance reference contains a specific word! eg. find protein sequence that contains a ‘mutation’) as you wish .

Sample Console Application to list topics and annotation references using search APIs

Note all scripts and API references provided here are pre-release, AS-IS and subject to change.

User Documentation 0.2

User ManualPreview Release

User Manual

Version 0.2 (Preview Release)


This User Manual provides operating instructions for new users to our search and collaboration portal. The portal is designed for life science researchers, healthcare professionals and biologists where they can quickly identify candidate items – be it proteins, genes, cell lines or reagents for their experiments. They can also provide feedback on quality of search results - both for their relevance and their accuracy.

Terminology and Data Model

All search results are identified broadly as “Concepts” that can be as raw as a url, a document in its native opaque format, or a fully qualified and well-defined typed entity such as a Protein with all its attributes corresponding from say Uniprot knowledgebase along with its references across various articles. Concepts are broadly categorized as People (users, researchers), Places (including Institutions, Vendors, and Universities), Events (Dates, Times, Specific Events), Products (Bioentities, Reagents, Equipment and Materials), Documents (Articles, URLs) or simply Concepts (Dynamically discovered or user-defined concepts based on their (frequent) occurrences in search space such as ‘disease’, ‘transformation’, ‘test subject’ etc…).


An instance of an entity may have one or more Concept types – as an example a keyword may be a proper noun, a place, an organization or all of these things at the same time. It may also have various degree of confidence measure associated with each type. Additionally, it may belong to one or more subject domains – such as Biology, Chemistry, News, Current Events, Music, Sports, Politics or more...


Projects at a high-level are various verticals where we may wish to apply our search optimizations and learning algorithms or trained models. To begin with, we have focused in Biology / Life Sciences domain and specifically NIH open access articles. We will in near future allow users to add their own projects, subscribe to one or more projects and will make sure that search results are scoped with-in their specified projects.


We have incorporated notion of Dictionaries. For example there are known set of Bioentities and Reagents that we can incorporate as dictionary terms. We will allow users to add their own custom dictionaries as well. These dictionaries not only help our search engine to improve identification of known entities and concepts but they also help improve look up for new concepts in the neighborhood of these known items. As we add more such dictionaries and build more trained sets, our search engine surfaces more results across various verticals. The goal of the underlying platform and data model is to continue to process raw data through the entire pipeline without disruption as more models in one or more domains are added. We will be continually adding new dictionaries and trained models through our interim releases.

User Interface

The user-interface is based upon a classic clean Search model. User begins their operations with a simple search interface where they can type any keyword to initiate search processing.

The results are retrieved in a two-pane results view where left pane provides various concept types (categorized in a domain neutral (people/orgs, places, events, products) and domain specific (products – Proteins, Genes, Cells, Reagents… etc. specific to Life Sciences) entities) and right pane provides various result views – documents, time-series, concept graph, or geographical etc.


Search and Filter Pane

This pane allows users to perform fresh or subsequent searches as a funnel (increasingly narrow search criteria using original keywords coupled with one or more search results).

Task Pane

This pane is view specific. Based upon currently selected view in the Results Pane (see next), view specific tasks are shown in this pane aligned to the right edge.

Results Pane

Results pane is the main client area and is used to display search results in variety of views – Graph of concepts and their inter-relationships, List of Documents, tag cloud, people, geography or time-series views.

Graph View

Graph view is the default view that shows color-coded concepts of various types and their relationships to other concepts. The view is designed to let users easily interact with the original search results – expand or narrow down to specific set of concepts and their relationships. They can then perform subsequent funnel searches to get to very precise results.

Once you select one or more concept types from the top filter in the filter pane, you can see the results in a graphical display with concepts and their relationships displayed in corresponding concept type color and varying edge thickness indicating strength of relationship between connected nodes respectively.

Interacting with Results

You can interact with the graph by clicking using both left and right mouse buttons with following behavior for both nodes (concepts) and edges (relations).

  • Right-Click

Right-click provides a pop-up menu with following actionable items.

    • Properties

Properties dialog provides attributes of the selected Concept or its relationship and a list of all the references found across various articles.

You can click the toggle button > to the left of each reference and reveal the title of the article from where this reference was obtained.


You can click on the title (left-click) and see the text of the article color-coded with all the concepts high lighter along with any significant text fragments (highlighted in yellow) that were used as part of search construction.