Wrapping up 2016 and wishing all of you a happy 2017!

As the year 2016 comes to a close, we on behalf of NLPCORE would like to wish you and your loved ones a very happy holiday season and start of 2017. We also take this opportunity to thank you for your support, feedback and contributions in helping us refine our thinking and improve upon our solutions stack. Here is a brief update on NLPCORE through its trials, tribulations and eventual triumphs during 2016 and plans on anvil for a bright future ahead in 2017!

We began 2016 with a number of hardware experiments with a goal to find an optimized hardware stack across CPU, GPU, RAM, Storage and network latency that helps us minimize cost without compromising on performance. If you haven’t already, please read about our experiments here.

While getting the infrastructure on the right course, Varun also focused on core algorithms and working with Alexander Ratushny (Fred Hutch and now Celgene) we validated prowess of our search engine against manually verified gene-virus interactions. Not only we fared well on Precision/Recall numbers, we also were able to identify interactions buried deep in research papers that were not manually verified in an otherwise comprehensive study. Read more about our application note draft here. We will be revising it next year with our latest algorithms and hardware infrastructure.

With improved infrastructure and core algorithms, and great feedback from our collaborators including you, we put together our engineering requirements specification. With a few missteps and failed internships behind us, we are very fortunate to have partnered with our friends at dmmd.net who took our requirements document, proof of concept implementation at nlpcore.com and our new APIs as raw materials and have been busy converting it into a polished product (soon to be launched throughhttp://beta.nlpcore.com). This hands-off third party development approach also has helped us refine our platform so that it is not only well-defined for re-use and customization but also that it is responsive and scalable without compromise! You may read more on our engineering process innovations here.

With 2016 coming to close, we have outlined following as our priorities ahead.

Firstly, we are getting our Lifesciences Search, Collaboration and Procurement solution pilot ready. Our site will provide users ability to explore bioentities, reagents, see their linkages, interaction references into various articles – both through document and graph views. We also intend to provide users ability to define their own categories or entity types (such as methods, test subjects, and gender).

We will be documenting all our platform APIs, writing a number of samples up and down the stack – be it: how to add a new document/web link for search or add a new technique to identify an entity (eg DNA sequences) or add and expose optimization parameters for a search technique or add another view in our user-interface. We hope to attract developers with wide range of skills – math, java, python, front-end, infrastructure, algorithms…

We will also be doing performance analysis for our search results and entity extractions and revising our draft application note based on new findings. Our new algorithms start with the exhaustive data set and therefore we hope to improve both on precision and recall fronts.

Last but not the least, our focus in the latter half of the year will be on securing customer deployments, improving page footprints and generating revenues. In doing so, we aim for 2018 to be our transition from an early pre-revenue startup to a viable business enterprise.

In-House Distributed File System for faster IO and optimized Compute


NLPCORE DFS is a native implementation of WebDAV protocol over a relatively new but stable distributed file system, glusterfs – primarily implemented with a goal to store large blobs on a cluster in a efficient manner. Our current design involves Cassandra which is essentially a key-value store. Though Cassandra is very well optimized distributed storage technology, but a relatively poor design on top could quickly take up a lot of resources in maintaining the Cassandra cluster.

# Data Description & Problem
Data that we handle is typically in the form of graph, logically representing nodes and edges. Most well established graph based databases that use Cassandra as their back-end, implement a bunch of sophisticated methods such as hashed keys to ensure data localization in a cluster. But since our models are still very experimental and change very often, a little trade-off between speed and convenience makes a huge difference in saving R&D time.

# Bottleneck in the process
Graph traversals are one of the major activities that the platform does periodically on per user query request. While people would imagine a fast matrices based algorithm to quickly discover interacting nodes in certain neighborhood, our graphs are mostly extremely sparse and creating a large entity map for each document to simulate dense matrix in the computer, requires creating another blob per document. These blobs are not often problematic but accumulate into dense key maps. While one would argue that most technologies like Cassandra would optimize blob data and only maintain indexes to file reference in the memory. This is exactly the behavior we are trying to achieve from this Distributed File System implementation.

While key value stores provide advanced query capabilities and sometimes ability to filter, such capabilities might not be required for blobs, rather a smartly articulated key in the form of a path can do the magic and only the bare minimum metadata can be saved in the key value store.

# Configuring underlying file system
There are large number of distributed file systems available and other than some minor configuration differnces, most provide similar capabilities. Our implementation uses Nginx’s WebDAV module as the interaction layer. For the purpose of interacting with the file system, file system is mounted on to a directory, which is then eventually used as the web root directory.

Glusterfs can be configured in various modes. Default is the distributed mode in which each server maintains a part of the file system as a discreet file and then these are distributed evenly so that no individual node has all the data. Other way of setting up the cluster is to use striping mode, in which the data is tripped across nodes and works best if the file sizes varies a lot in the cluster, in which case each server ends up picking a portion of every file. Finally cluster can be setup to create replica of data across different nodes. These configurations can be combined in conjunction as well, just like most RAID setups.

For example data can be striped and replicated at the same time to make sure read/writes are faster.

# Conclusion
Though we have not done a systematic study yet to compare both the implementations, some positive signs in the new system is the increased CPU utilization, which is a good indication that CPU is spending more time in computations. But this could be due to the new driver that we have written as well hence as part of future update to this blog, we plan on publishing profiling results of both the platforms.

Search Quality – Validating Protein-Protein and Host Gene-Virus Interactions using NLPCORE

As we round-out our product features, improve our platform and core text mining / entity extraction algorithms, we also focused on validating quality of search results. Thanks to our collaborators at University of Washington and Center for Infectious Disease Research (CIDR) we chose two representative life sciences data sets – one being the most commonly used Protein-Protein interactions and another being an experimentally discovered Host Gene-Virus interactions. Using these sets, we were able to not only validate a high recall rate but also a good precision through identifying interactions that were otherwise not mentioned in experimentally discovered set.


Here is a link to our Application Note (DRAFT) that we intend to revise shortly in the new year with our latest iteration of core algorithms. And here is the link for its Supplimentary Information with more details on our methods and data sets used in our tests.

Building solutions with Modular and Scalable REST Web Services powered by Discrete Docker Containers

REST’ing on Containers…

A lot has been written and talked about Web Services over the years and Docker Containers are the latest buzz. My first experience with web services, building and using them came at Microsoft back at its Dynamics CRM 3.0 engineering team where we built, consumed and exposed a set of entity (contacts, tasks, Leads…) web services as an abstraction over traditional relational database tables. I still fondly recall some of our tongue-in-cheek conversations regarding number of times a particular call from its web user-interface to middle tier to the database had to go through Object to XML back to Object then back to XML and so on, eventually down to a SQL SELECT, UPDATE or DELETE statement for the SQL Server (the eventual bottleneck for growing beyond enterprise to web scale). As for containers back in 2008, I had an engaging discussion with Internet Explorer developers regarding making use of Windows Virtual PC technology to offer backward compatibility and freely build groundbreaking new features.

However fast forwarding to present day with our startup NLPCORE it is a completely different ball game all together. See my earlier post (http://www.nlpcore.io/?p=36) for a brief overview of what we are about and our hardware experiments (we have btw added massive computing and storage capacity of our own with a couple of blade servers since then and yes the liquid cooled CPU/GPU monster is still happily crunching all computations thrown at it continuously!). In this post I will focus on our implementation stack and how our early conviction on Web Services and Docker Containers has paid us rich dividends after a few iterations, trials, tribulations and eventual triumphs.


Here is a brief lowdown on how we have stacked together our components. Each component is encapsulated in a virtual container that can interact with others using its well defined end points. That allows us to independently revise/upgrade or even replace or concurrently maintain multiple versions of each component.

Containers help us encapsulate all dependencies at their tried and tested versions, configure their settings and deploy just the required component. Our components expose and consume well-defined and versioned web services to communicate with each other as well as with third parties as long as they have proper authentication and access tokens provided by our identity management system (using OAuth protocols that does not require us to create or maintain user IDs or email addresses at our end).

Putting it all together

All our code is written in Python and we recently integrated Swagger helping us tremendously on API references, documentation and samples. We host our own source code maintenance platform (gitlab) to ensure that we can maintain source code with proper versions and enable multiple dev teams to independently check-out, make changes and merge check-in any changes.

Our build process is also fully automated and follows CI/CD model. As part of code development, developers write check-in test scripts that our build engine executes after successful compilation and if approved it will continue to build a complete Docker container image with appropriate dependencies automatically pulled down and baked in to the image. Thanks to Docker’s incremental imaging ability, any subsequent builds after code changes only require a delta image to be created. A developer can therefore can build our entire platform from scratch and continue to work on any one portion with rest of the container(s) remaining unchanged and accessible for their testing and verification.

Besides our development stack, Varun has gone ahead and put together our blog/documentation site at http://nlpcore.io running a self-hosted WordPress site (as what else – another Docker contained instance).

Our entire customer facing, internal and partner facing enterprise (internet, intranet and extranet) is therefore running on a number of Docker instances that are properly isolated, connected through gigabit switch, internet and virtual private networks where appropriate and each component protected by authenticated access. Besides hosting it all on a set of blade/custom built servers ourselves, we subscribe to an off the shelf storage cloud service to ensure we have our data backed up somewhere else safely.

Frankly coming from an enterprise into a startup with minimal resources, I am super impressed and amazed with what our brilliant CTO Varun Mittal has assembled together while managing his work and studies. The entire docker infrastructure that Varun put together not only helped UW unblock their labs, earned him a well-deserved Research Assistantship but it is now also a published paper (we’ll add references at our site soon)!

Putting platform to test!

As Varun made progress on putting our hardware infrastructure, refactored, rewrote majority of the existing code in discrete components hosted by Docker instances, improved core NLP/ML algorithms (more on this in a future post), he showed us the new results and APIs at work very impressively. However we needed a real world test case and at the same time we were really hard pressed on finding a solid engineering support for him for revamping our own web user interface. After a few failed attempts through internship offers, and my own half-hearted attempts on taking up UX development (I haven’t given it up just yet! Thanks to Coursera, I have sped through quite a few Python, HTML5/CSS courses – so will be at it soon again!), we finally decided to go all out and find a serious third party partner who could take this on end to end. And I was fortunate to reconnect with one of my old friends who happened to be just the partner we needed!

I wrote down an Engineering requirements document that heavily leveraged our existing proof of concept implementation at http://nlpcore.com and our planned Web Services (exposing a clean interface across each component as depicted in the architecture diagram above) and handed this off to our new partners to get started on rebuilding our new interface from ground up! They recommended and chose a server side Java based framework for web apps vaadin – something that was new for us to learn and play with!

After a couple of weeks of ramp-up, we are thoroughly pleased to report that our bets have paid off! Having a third party develop our own user interface at an arm’s length using (supposedly) well-defined web services is a true test of our decoupled architecture and one of the best ways to eat our own dogfood (our cloud search platform).


We implemented majority of the interface below using JavaScript, Python/Django framework but mix of Python on the server, Django variables and JavaScript makes it too unwieldy in the long run both to maintain and to extend unfortunately.


The new user interface (still a work in progress) is completely data-driven and decoupled from our platform (even the notion of entity types – genes, proteins, people… is completely abstracted – another future post!). Thanks to vaadin, we can not only easily maintain and extend the server side Java modules, but also apply different CSS styles to morph look and feel of the interface without changing any code (written in Java and maintained by application server).

Magic of CI/CD helps us take regular drops (as often, as early as possible), and deploy it on another Docker instance that we surface to our pilot users at http://beta.nlpcore.com.

Plan Ahead

We are very excited and optimistic about completing our planned features across search platform web services as well as life sciences search, collaboration and procurement solution in next couple of months ahead. We already have a number of pilot customers identified and will be circling back with them to get them to try out our life sciences solution, provide feedback (that is baked right into our solution as part of its collaboration features) and help us get deployed deeply in the biotech community.

Furthermore, we will be documenting and writing samples describing our web services platform that we envision a wide spectrum of life sciences researchers, coders, enterprise search consumers, developers, add-on developers (proprietary data formats, data stores…) will find super attractive and super easy to work with. We will ourselves provide working components (including a couple of search algorithms that can be plugged right in our interface) as samples to jump start this community.


NLPCORE – Unplugging the public cloud!

Originally published at LinkedIn on March 16th, 2016…

Last week Tuesday we finally hit the delete button and terminated all our front-end and back-end instances at Google Cloud before switching our portal (http://nlpcore.com) over to our private cloud stack sitting in a locked closet in the backyard. It was quite a momentous occasion for us as for past few months we have been experimenting across various options to find an optimal price and performance balance for our continuously high compute requirements.

Continue reading NLPCORE – Unplugging the public cloud!