Lets get coding!

It has been quite a year where we rewrote our user-interface (read at nlpcore.io/user-documentation), revamped our core search algorithms (read at nlpcore.io/search-engine-improvements-in-preview-release), revised our core infrastructure for scale (read at nlpcore.io/revamping-it-all), and to round it all we now invite you to preview our development platform complete with secure web service APIs and code samples all hosted on widely used cloud services.

We have built our APIs using Swagger (swagger.io) framework for RESTful web services and have hosted these services on Tyk (tyk.io) API Gateway and Management Platform – both of these are well trusted and widely used tools among developers community. Furthermore we have integrated these with our identity and authentication management services (built on freeIPA freeipa.org).

Developers Portal (developers.nlpcore.com)

Registration and Secure Access to Projects
To access our platform APIs, developers register at our portal using the registration page at (developers.nlpcore.com/portal/register/). Once registered, they can log-in to the portal and request an API Key at (developers.nlpcore.com/portal/apis) to get started (they only need to request the key once). They can register for any publicly available projects and / or create new projects where they can then upload their own documents or web pages for search.

To facilitate this secure access both for developers, and for our web search UI, we maintain identities in our directory service over LDAP protocol. Additionally this service is capable of authenticating users identity through third-party login services such as Google+ and Facebook. Once authenticated, users get added to their selected projects and assigned policy granting them access to our APIs.

API Catalog
developers.nlpcore.com/portal/apis/59d413b82b6b6d0001c9bae3/documentation
The API catalog is maintained and documented using Swagger. We have implemented complete versioning support including multiple versions running concurrently (as an example our beta and alpha sites are on different API versions but use the same extracted data) using our new revised infrastructure (read more at nlpcore.io/revamping-it-all). This Swagger maintained site allows you to also test any of our APIs but to do so, you must click on the Authorize button on the page (on the right above the API list) and use your developer key.

Object Model
Our object model consists of following components.
Project – Projects provide a scoping boundary both for security and documents collection. Search results can span across projects but to do so, user must have access to these projects.
Document – Document is a body of text (numeric data, image, audio, video – any media type in future) – a file or a web page that is identified, stored, indexed, searched and retrieved as a whole or in part through its containing topics, entities, or references (see below).
Topic – A topic, a category or a concept is a named collection of relevant (to search keywords) extracted terms from a collection of documents. For example in Life Sciences, these are Proteins, Genes, Cell Lines or more generically bio-molecules. We plan to provide People, Organizations, Places, Events, Products, Documents and Concepts as broad general topics across all domains.
Entity – An entity is an extracted term from a document. Entity may be grouped in one or more topics and is considered related to other entities through annotations or description in the document where they occur with-in certain distance of one another.
Annotation Reference – An entity reference or annotation reference is the surrounding text containing this entity together with its related entities. Any two entities may have multiple references with-in or across documents and its count signifies the relative strength of their relationship.

API
Our APIs are conveniently grouped in these namespaces below.

Project – The Project APIs provide available project list. For security reasons we have chosen not to expose any create or delete APIs for projects but instead we require developers to manually do so using our developers portal as described above.

Document – These APIs allow developers to extract various attributes – meta data, word graph with-in and across documents, and related documents.

Search – These APIs allow developers to get documents, search term suggestions, topics, as well as the relationship graph of extracted entities (topic instances).

Entities – These APIs allow developers to get surrounding text (annotation reference) for an extracted entity by our search platform.

Feedback – These APIs allow developers to record users feedback into our search platform. They can save search results in arbitrary collections to recall later, as well as mark any of the extracted result as relevant (or not) and reassign its topic name to their own liking. We record their feedback to improve subsequent search results both individually and collectively over time.

Code Samples
jupyter.nlpcore.com:8888/notebooks/work/nlpcore-console-app.ipynb?token=f3c21d64f36597bc96e0fbf560696c17014ace3c2c3f6404

Besides Swagger where a developer can test a single API in isolation, we have also put together a few working samples to demonstrate specific life sciences use cases. To this end we have used Jupyter Notebook (jupyter.org) interactive computing environment. Developers can modify the python code to their liking (either simply parameter values or entire code blocks) at the website itself and run it to see the results or copy it for their own use.

While this is still a work in progress, we have now completed most of the required building blocks as we envisioned (and discovered!) and have exciting things in store for 2018! We invite you to try out our life sciences web search UI, try out our APIs and give us your feedback (feedback@nlpcore.com). Have a great season and start of 2018!

Building solutions with Modular and Scalable REST Web Services powered by Discrete Docker Containers

REST’ing on Containers…

A lot has been written and talked about Web Services over the years and Docker Containers are the latest buzz. My first experience with web services, building and using them came at Microsoft back at its Dynamics CRM 3.0 engineering team where we built, consumed and exposed a set of entity (contacts, tasks, Leads…) web services as an abstraction over traditional relational database tables. I still fondly recall some of our tongue-in-cheek conversations regarding number of times a particular call from its web user-interface to middle tier to the database had to go through Object to XML back to Object then back to XML and so on, eventually down to a SQL SELECT, UPDATE or DELETE statement for the SQL Server (the eventual bottleneck for growing beyond enterprise to web scale). As for containers back in 2008, I had an engaging discussion with Internet Explorer developers regarding making use of Windows Virtual PC technology to offer backward compatibility and freely build groundbreaking new features.

However fast forwarding to present day with our startup NLPCORE it is a completely different ball game all together. See my earlier post (http://www.nlpcore.io/?p=36) for a brief overview of what we are about and our hardware experiments (we have btw added massive computing and storage capacity of our own with a couple of blade servers since then and yes the liquid cooled CPU/GPU monster is still happily crunching all computations thrown at it continuously!). In this post I will focus on our implementation stack and how our early conviction on Web Services and Docker Containers has paid us rich dividends after a few iterations, trials, tribulations and eventual triumphs.

Architecture

Here is a brief lowdown on how we have stacked together our components. Each component is encapsulated in a virtual container that can interact with others using its well defined end points. That allows us to independently revise/upgrade or even replace or concurrently maintain multiple versions of each component.

Containers help us encapsulate all dependencies at their tried and tested versions, configure their settings and deploy just the required component. Our components expose and consume well-defined and versioned web services to communicate with each other as well as with third parties as long as they have proper authentication and access tokens provided by our identity management system (using OAuth protocols that does not require us to create or maintain user IDs or email addresses at our end).

Putting it all together

All our code is written in Python and we recently integrated Swagger helping us tremendously on API references, documentation and samples. We host our own source code maintenance platform (gitlab) to ensure that we can maintain source code with proper versions and enable multiple dev teams to independently check-out, make changes and merge check-in any changes.

Our build process is also fully automated and follows CI/CD model. As part of code development, developers write check-in test scripts that our build engine executes after successful compilation and if approved it will continue to build a complete Docker container image with appropriate dependencies automatically pulled down and baked in to the image. Thanks to Docker’s incremental imaging ability, any subsequent builds after code changes only require a delta image to be created. A developer can therefore can build our entire platform from scratch and continue to work on any one portion with rest of the container(s) remaining unchanged and accessible for their testing and verification.

Besides our development stack, Varun has gone ahead and put together our blog/documentation site at http://nlpcore.io running a self-hosted WordPress site (as what else – another Docker contained instance).

Our entire customer facing, internal and partner facing enterprise (internet, intranet and extranet) is therefore running on a number of Docker instances that are properly isolated, connected through gigabit switch, internet and virtual private networks where appropriate and each component protected by authenticated access. Besides hosting it all on a set of blade/custom built servers ourselves, we subscribe to an off the shelf storage cloud service to ensure we have our data backed up somewhere else safely.

Frankly coming from an enterprise into a startup with minimal resources, I am super impressed and amazed with what our brilliant CTO Varun Mittal has assembled together while managing his work and studies. The entire docker infrastructure that Varun put together not only helped UW unblock their labs, earned him a well-deserved Research Assistantship but it is now also a published paper (we’ll add references at our site soon)!

Putting platform to test!

As Varun made progress on putting our hardware infrastructure, refactored, rewrote majority of the existing code in discrete components hosted by Docker instances, improved core NLP/ML algorithms (more on this in a future post), he showed us the new results and APIs at work very impressively. However we needed a real world test case and at the same time we were really hard pressed on finding a solid engineering support for him for revamping our own web user interface. After a few failed attempts through internship offers, and my own half-hearted attempts on taking up UX development (I haven’t given it up just yet! Thanks to Coursera, I have sped through quite a few Python, HTML5/CSS courses – so will be at it soon again!), we finally decided to go all out and find a serious third party partner who could take this on end to end. And I was fortunate to reconnect with one of my old friends who happened to be just the partner we needed!

I wrote down an Engineering requirements document that heavily leveraged our existing proof of concept implementation at http://nlpcore.com and our planned Web Services (exposing a clean interface across each component as depicted in the architecture diagram above) and handed this off to our new partners to get started on rebuilding our new interface from ground up! They recommended and chose a server side Java based framework for web apps vaadin – something that was new for us to learn and play with!

After a couple of weeks of ramp-up, we are thoroughly pleased to report that our bets have paid off! Having a third party develop our own user interface at an arm’s length using (supposedly) well-defined web services is a true test of our decoupled architecture and one of the best ways to eat our own dogfood (our cloud search platform).

Before

We implemented majority of the interface below using JavaScript, Python/Django framework but mix of Python on the server, Django variables and JavaScript makes it too unwieldy in the long run both to maintain and to extend unfortunately.

After

The new user interface (still a work in progress) is completely data-driven and decoupled from our platform (even the notion of entity types – genes, proteins, people… is completely abstracted – another future post!). Thanks to vaadin, we can not only easily maintain and extend the server side Java modules, but also apply different CSS styles to morph look and feel of the interface without changing any code (written in Java and maintained by application server).

Magic of CI/CD helps us take regular drops (as often, as early as possible), and deploy it on another Docker instance that we surface to our pilot users at http://beta.nlpcore.com.

Plan Ahead

We are very excited and optimistic about completing our planned features across search platform web services as well as life sciences search, collaboration and procurement solution in next couple of months ahead. We already have a number of pilot customers identified and will be circling back with them to get them to try out our life sciences solution, provide feedback (that is baked right into our solution as part of its collaboration features) and help us get deployed deeply in the biotech community.

Furthermore, we will be documenting and writing samples describing our web services platform that we envision a wide spectrum of life sciences researchers, coders, enterprise search consumers, developers, add-on developers (proprietary data formats, data stores…) will find super attractive and super easy to work with. We will ourselves provide working components (including a couple of search algorithms that can be plugged right in our interface) as samples to jump start this community.