Revamping it all!

Soon after we announced the preview release of our search platform back in April of this year, we got busy adding more articles and more life sciences dictionaries to our platform and in the process we managed to put heavy stress on our infrastructure in all directions.  As we added more articles, the keyword index grew beyond the number of attributes supported by the underlying database!  As we added more dictionaries, the search computation time increased by many folds, resulting in occasional timeout of search queries.  To make matters worse, tight coupling of our platform with the crawled and parsed data allocated too many resources (both compute and storage) for each of our deployments be it alpha, beta or internal development site. Incorporating any backup or disaster recovery plan would have further stressed and complicated our framework.

It soon became apparent that we needed to make some significant architectural changes.  So we did!

A New Document Store

Thanks to our DFS implementation, our corpus data have always been distributed evenly across our infrastructure however as mentioned above our various deployments were tightly coupled with their underlying data leading to redundant compute and storage instances as well as absence of any pragmatic backup or disaster recovery plan. Therefore Varun designed, thoroughly tested, and deployed a new document store layer on top of the existing DFS which now allows multiple applications to share or segregate the same data corpus through project namespaces.  It also allows for encryption at the project level so that the data can be securely accessed by various deployments using private/public key infrastructure. Similarly it also enables creating an active data backup at the project level that can be archived and restored as needed.

Processing Pipeline (ETL)

Though we had a decoupled architecture from the very start, the new document store enabled us to greatly simplify and streamline our processing pipeline (from crawler, word breaker, keyword index, nlp, classifiers, dictionaries to search results computation) to further break down components more discretely, to let them run concurrently on demand at the data ingestion time thereby reducing search time computations dramatically.  Under this new architecture, each processing component derives from a simple application base class provided by the document store pipeline that processes newly uploaded/crawled documents/pages through a callback function implemented by the component developer. Here is a an example of a simple component in our pipeline that prints the document id and its content when called with this document.

Sample Application for Document Store Pipeline

Each component can define scope of documents (see “domain_selector” attribute in the code above) to process and also define its dependent components (see “module_dependencies” attribute in the code above) so that the pipeline will ensure that dependent components are processed first.

Processing Components

With the new pipeline architecture and document store SDK, writing components (both our own and that from our future collaborators) has become much simpler.  Varun was able to discard majority of our past monolithic crawler, nlp, indexer, storage components and rewrite them as new pipeline application modules in a short period of time.  Here is a brief snippet to peek under the cover so to speak.

Here is how one uploads new content in the document store:

Sample Application to add documents into document store

Here is how we build the initial text index (see more on this below):

Code Snippet to create a keyword search index using document store pipeline

New Inverted Index

As stated above, one problem we encountered with growing data was the physical limit of the number of attributes we could store in the key-value database (Cassandra).  That forced us to rethink and look more closely into other mature technologies for storing the inverted index (a list of documents and their relevant attributes as values and a set of words as keys – thus the name inverted index!).  Apache Lucene indexing classes provided a good guiding framework for Varun to implement the same internally (without Java or large memory footprint – 2 of reasons why we chose not to use Lucene itself) as demonstrated in the code snippet above.  Furthermore, it has enabled our infrastructure to scale linearly over DFS while keeping the initial search index relatively small and super fast.

Dictionaries

While our claim to fame is our unsupervised learning algorithm which clusters results into meaningful concepts (topics), we can also leverage external dictionaries, catalogs or taxonomies to improve the quality and relevance of results we surface in any given domain or project.  Our new document store SDK can be used by our partners to easily add new dictionaries into the platform with a few simple lines of code.  Here is an example of how we integrated Protein Data Bank (PDB) identifiers as a dictionary in our document store (as a result of this integration, if a protein is identified by our search engine, we can return its attributes from PDB).

Sample Dictionary to extract PDB Identifiers using document store pipeline

Similarly, here is an example of how we integrated Wikipedia references in our document store.

Sample Dictionary to extract Wikipedia attributes using document store pipeline

We will publish many more sample dictionary interfaces (such as Name, Place, or Event classifiers) along with some of our internal helper services (such as PDB and Wikipedia web services used in the scripts above) to help with extracting attributes from public or private dictionary sources.  The new interface not only made writing dictionaries simple but also greatly improved search time performance. Before, our search compute time increased as we added more dictionaries.  In our new framework each dictionary maintains its own results per document so the search engine simply queries matching terms from relevant dictionaries and coalesces all attributes across dictionaries. As an example, therefore a given protein now may have its attributes derived from PDB, Wikipedia entries, and any other supplied dictionaries (eg STRING) or product catalogs (eg Millipore Sigma, or Thermo Fisher Scientific).

What Next?

While you experiment with our new framework which is up and running at http://beta.nlpcore.com, stay tuned for more exciting news from NLPCORE.  Soon, we will publish our full set of APIs that allow both solution developers and algorithm developers to take our platform for a spin in life sciences vertical and beyond including their own data sets!

Scripting our Search Experience

Lastly, in the true spirit of all hands on deck, I decided to write a few python scripts to help our users look for specific search results without using our website. Thanks to the power of python, in only a few lines of code I was able to write an extensible console app which provides commands to list discovered topics (e.g. proteins, genes, disease name, etc.) discovered automatically from life sciences project, instance names for a given topic, and their specific references (surrounding text) within the articles. You are welcome to play with and modify the script (reader’s exercise: find a instance name for a given topic where instance reference contains a specific word! eg. find protein sequence that contains a ‘mutation’) as you wish .

Sample Console Application to list topics and annotation references using search APIs

Note all scripts and API references provided here are pre-release, AS-IS and subject to change.

In-House Distributed File System for faster IO and optimized Compute

NLPCORE DFS 1.0

NLPCORE DFS is a native implementation of WebDAV protocol over a relatively new but stable distributed file system, glusterfs – primarily implemented with a goal to store large blobs on a cluster in a efficient manner. Our current design involves Cassandra which is essentially a key-value store. Though Cassandra is very well optimized distributed storage technology, but a relatively poor design on top could quickly take up a lot of resources in maintaining the Cassandra cluster.

# Data Description & Problem
Data that we handle is typically in the form of graph, logically representing nodes and edges. Most well established graph based databases that use Cassandra as their back-end, implement a bunch of sophisticated methods such as hashed keys to ensure data localization in a cluster. But since our models are still very experimental and change very often, a little trade-off between speed and convenience makes a huge difference in saving R&D time.

# Bottleneck in the process
Graph traversals are one of the major activities that the platform does periodically on per user query request. While people would imagine a fast matrices based algorithm to quickly discover interacting nodes in certain neighborhood, our graphs are mostly extremely sparse and creating a large entity map for each document to simulate dense matrix in the computer, requires creating another blob per document. These blobs are not often problematic but accumulate into dense key maps. While one would argue that most technologies like Cassandra would optimize blob data and only maintain indexes to file reference in the memory. This is exactly the behavior we are trying to achieve from this Distributed File System implementation.

While key value stores provide advanced query capabilities and sometimes ability to filter, such capabilities might not be required for blobs, rather a smartly articulated key in the form of a path can do the magic and only the bare minimum metadata can be saved in the key value store.

# Configuring underlying file system
There are large number of distributed file systems available and other than some minor configuration differnces, most provide similar capabilities. Our implementation uses Nginx’s WebDAV module as the interaction layer. For the purpose of interacting with the file system, file system is mounted on to a directory, which is then eventually used as the web root directory.

Glusterfs can be configured in various modes. Default is the distributed mode in which each server maintains a part of the file system as a discreet file and then these are distributed evenly so that no individual node has all the data. Other way of setting up the cluster is to use striping mode, in which the data is tripped across nodes and works best if the file sizes varies a lot in the cluster, in which case each server ends up picking a portion of every file. Finally cluster can be setup to create replica of data across different nodes. These configurations can be combined in conjunction as well, just like most RAID setups.

For example data can be striped and replicated at the same time to make sure read/writes are faster.

# Conclusion
Though we have not done a systematic study yet to compare both the implementations, some positive signs in the new system is the increased CPU utilization, which is a good indication that CPU is spending more time in computations. But this could be due to the new driver that we have written as well hence as part of future update to this blog, we plan on publishing profiling results of both the platforms.

Search Quality – Validating Protein-Protein and Host Gene-Virus Interactions using NLPCORE

As we round-out our product features, improve our platform and core text mining / entity extraction algorithms, we also focused on validating quality of search results. Thanks to our collaborators at University of Washington and Center for Infectious Disease Research (CIDR) we chose two representative life sciences data sets – one being the most commonly used Protein-Protein interactions and another being an experimentally discovered Host Gene-Virus interactions. Using these sets, we were able to not only validate a high recall rate but also a good precision through identifying interactions that were otherwise not mentioned in experimentally discovered set.

 

Here is a link to our Application Note (DRAFT) that we intend to revise shortly in the new year with our latest iteration of core algorithms. And here is the link for its Supplimentary Information with more details on our methods and data sets used in our tests.

Building solutions with Modular and Scalable REST Web Services powered by Discrete Docker Containers

REST’ing on Containers…

A lot has been written and talked about Web Services over the years and Docker Containers are the latest buzz. My first experience with web services, building and using them came at Microsoft back at its Dynamics CRM 3.0 engineering team where we built, consumed and exposed a set of entity (contacts, tasks, Leads…) web services as an abstraction over traditional relational database tables. I still fondly recall some of our tongue-in-cheek conversations regarding number of times a particular call from its web user-interface to middle tier to the database had to go through Object to XML back to Object then back to XML and so on, eventually down to a SQL SELECT, UPDATE or DELETE statement for the SQL Server (the eventual bottleneck for growing beyond enterprise to web scale). As for containers back in 2008, I had an engaging discussion with Internet Explorer developers regarding making use of Windows Virtual PC technology to offer backward compatibility and freely build groundbreaking new features.

However fast forwarding to present day with our startup NLPCORE it is a completely different ball game all together. See my earlier post (http://www.nlpcore.io/?p=36) for a brief overview of what we are about and our hardware experiments (we have btw added massive computing and storage capacity of our own with a couple of blade servers since then and yes the liquid cooled CPU/GPU monster is still happily crunching all computations thrown at it continuously!). In this post I will focus on our implementation stack and how our early conviction on Web Services and Docker Containers has paid us rich dividends after a few iterations, trials, tribulations and eventual triumphs.

Architecture

Here is a brief lowdown on how we have stacked together our components. Each component is encapsulated in a virtual container that can interact with others using its well defined end points. That allows us to independently revise/upgrade or even replace or concurrently maintain multiple versions of each component.

Containers help us encapsulate all dependencies at their tried and tested versions, configure their settings and deploy just the required component. Our components expose and consume well-defined and versioned web services to communicate with each other as well as with third parties as long as they have proper authentication and access tokens provided by our identity management system (using OAuth protocols that does not require us to create or maintain user IDs or email addresses at our end).

Putting it all together

All our code is written in Python and we recently integrated Swagger helping us tremendously on API references, documentation and samples. We host our own source code maintenance platform (gitlab) to ensure that we can maintain source code with proper versions and enable multiple dev teams to independently check-out, make changes and merge check-in any changes.

Our build process is also fully automated and follows CI/CD model. As part of code development, developers write check-in test scripts that our build engine executes after successful compilation and if approved it will continue to build a complete Docker container image with appropriate dependencies automatically pulled down and baked in to the image. Thanks to Docker’s incremental imaging ability, any subsequent builds after code changes only require a delta image to be created. A developer can therefore can build our entire platform from scratch and continue to work on any one portion with rest of the container(s) remaining unchanged and accessible for their testing and verification.

Besides our development stack, Varun has gone ahead and put together our blog/documentation site at http://nlpcore.io running a self-hosted WordPress site (as what else – another Docker contained instance).

Our entire customer facing, internal and partner facing enterprise (internet, intranet and extranet) is therefore running on a number of Docker instances that are properly isolated, connected through gigabit switch, internet and virtual private networks where appropriate and each component protected by authenticated access. Besides hosting it all on a set of blade/custom built servers ourselves, we subscribe to an off the shelf storage cloud service to ensure we have our data backed up somewhere else safely.

Frankly coming from an enterprise into a startup with minimal resources, I am super impressed and amazed with what our brilliant CTO Varun Mittal has assembled together while managing his work and studies. The entire docker infrastructure that Varun put together not only helped UW unblock their labs, earned him a well-deserved Research Assistantship but it is now also a published paper (we’ll add references at our site soon)!

Putting platform to test!

As Varun made progress on putting our hardware infrastructure, refactored, rewrote majority of the existing code in discrete components hosted by Docker instances, improved core NLP/ML algorithms (more on this in a future post), he showed us the new results and APIs at work very impressively. However we needed a real world test case and at the same time we were really hard pressed on finding a solid engineering support for him for revamping our own web user interface. After a few failed attempts through internship offers, and my own half-hearted attempts on taking up UX development (I haven’t given it up just yet! Thanks to Coursera, I have sped through quite a few Python, HTML5/CSS courses – so will be at it soon again!), we finally decided to go all out and find a serious third party partner who could take this on end to end. And I was fortunate to reconnect with one of my old friends who happened to be just the partner we needed!

I wrote down an Engineering requirements document that heavily leveraged our existing proof of concept implementation at http://nlpcore.com and our planned Web Services (exposing a clean interface across each component as depicted in the architecture diagram above) and handed this off to our new partners to get started on rebuilding our new interface from ground up! They recommended and chose a server side Java based framework for web apps vaadin – something that was new for us to learn and play with!

After a couple of weeks of ramp-up, we are thoroughly pleased to report that our bets have paid off! Having a third party develop our own user interface at an arm’s length using (supposedly) well-defined web services is a true test of our decoupled architecture and one of the best ways to eat our own dogfood (our cloud search platform).

Before

We implemented majority of the interface below using JavaScript, Python/Django framework but mix of Python on the server, Django variables and JavaScript makes it too unwieldy in the long run both to maintain and to extend unfortunately.

After

The new user interface (still a work in progress) is completely data-driven and decoupled from our platform (even the notion of entity types – genes, proteins, people… is completely abstracted – another future post!). Thanks to vaadin, we can not only easily maintain and extend the server side Java modules, but also apply different CSS styles to morph look and feel of the interface without changing any code (written in Java and maintained by application server).

Magic of CI/CD helps us take regular drops (as often, as early as possible), and deploy it on another Docker instance that we surface to our pilot users at http://beta.nlpcore.com.

Plan Ahead

We are very excited and optimistic about completing our planned features across search platform web services as well as life sciences search, collaboration and procurement solution in next couple of months ahead. We already have a number of pilot customers identified and will be circling back with them to get them to try out our life sciences solution, provide feedback (that is baked right into our solution as part of its collaboration features) and help us get deployed deeply in the biotech community.

Furthermore, we will be documenting and writing samples describing our web services platform that we envision a wide spectrum of life sciences researchers, coders, enterprise search consumers, developers, add-on developers (proprietary data formats, data stores…) will find super attractive and super easy to work with. We will ourselves provide working components (including a couple of search algorithms that can be plugged right in our interface) as samples to jump start this community.