Lets get coding!

It has been quite a year where we rewrote our user-interface (read at nlpcore.io/user-documentation), revamped our core search algorithms (read at nlpcore.io/search-engine-improvements-in-preview-release), revised our core infrastructure for scale (read at nlpcore.io/revamping-it-all), and to round it all we now invite you to preview our development platform complete with secure web service APIs and code samples all hosted on widely used cloud services.

We have built our APIs using Swagger (swagger.io) framework for RESTful web services and have hosted these services on Tyk (tyk.io) API Gateway and Management Platform – both of these are well trusted and widely used tools among developers community. Furthermore we have integrated these with our identity and authentication management services (built on freeIPA freeipa.org).

Developers Portal (developers.nlpcore.com)

Registration and Secure Access to Projects
To access our platform APIs, developers register at our portal using the registration page at (developers.nlpcore.com/portal/register/). Once registered, they can log-in to the portal and request an API Key at (developers.nlpcore.com/portal/apis) to get started (they only need to request the key once). They can register for any publicly available projects and / or create new projects where they can then upload their own documents or web pages for search.

To facilitate this secure access both for developers, and for our web search UI, we maintain identities in our directory service over LDAP protocol. Additionally this service is capable of authenticating users identity through third-party login services such as Google+ and Facebook. Once authenticated, users get added to their selected projects and assigned policy granting them access to our APIs.

API Catalog
developers.nlpcore.com/portal/apis/59d413b82b6b6d0001c9bae3/documentation
The API catalog is maintained and documented using Swagger. We have implemented complete versioning support including multiple versions running concurrently (as an example our beta and alpha sites are on different API versions but use the same extracted data) using our new revised infrastructure (read more at nlpcore.io/revamping-it-all). This Swagger maintained site allows you to also test any of our APIs but to do so, you must click on the Authorize button on the page (on the right above the API list) and use your developer key.

Object Model
Our object model consists of following components.
Project – Projects provide a scoping boundary both for security and documents collection. Search results can span across projects but to do so, user must have access to these projects.
Document – Document is a body of text (numeric data, image, audio, video – any media type in future) – a file or a web page that is identified, stored, indexed, searched and retrieved as a whole or in part through its containing topics, entities, or references (see below).
Topic – A topic, a category or a concept is a named collection of relevant (to search keywords) extracted terms from a collection of documents. For example in Life Sciences, these are Proteins, Genes, Cell Lines or more generically bio-molecules. We plan to provide People, Organizations, Places, Events, Products, Documents and Concepts as broad general topics across all domains.
Entity – An entity is an extracted term from a document. Entity may be grouped in one or more topics and is considered related to other entities through annotations or description in the document where they occur with-in certain distance of one another.
Annotation Reference – An entity reference or annotation reference is the surrounding text containing this entity together with its related entities. Any two entities may have multiple references with-in or across documents and its count signifies the relative strength of their relationship.

API
Our APIs are conveniently grouped in these namespaces below.

Project – The Project APIs provide available project list. For security reasons we have chosen not to expose any create or delete APIs for projects but instead we require developers to manually do so using our developers portal as described above.

Document – These APIs allow developers to extract various attributes – meta data, word graph with-in and across documents, and related documents.

Search – These APIs allow developers to get documents, search term suggestions, topics, as well as the relationship graph of extracted entities (topic instances).

Entities – These APIs allow developers to get surrounding text (annotation reference) for an extracted entity by our search platform.

Feedback – These APIs allow developers to record users feedback into our search platform. They can save search results in arbitrary collections to recall later, as well as mark any of the extracted result as relevant (or not) and reassign its topic name to their own liking. We record their feedback to improve subsequent search results both individually and collectively over time.

Code Samples
jupyter.nlpcore.com:8888/notebooks/work/nlpcore-console-app.ipynb?token=f3c21d64f36597bc96e0fbf560696c17014ace3c2c3f6404

Besides Swagger where a developer can test a single API in isolation, we have also put together a few working samples to demonstrate specific life sciences use cases. To this end we have used Jupyter Notebook (jupyter.org) interactive computing environment. Developers can modify the python code to their liking (either simply parameter values or entire code blocks) at the website itself and run it to see the results or copy it for their own use.

While this is still a work in progress, we have now completed most of the required building blocks as we envisioned (and discovered!) and have exciting things in store for 2018! We invite you to try out our life sciences web search UI, try out our APIs and give us your feedback (feedback@nlpcore.com). Have a great season and start of 2018!

Revamping it all!

Soon after we announced the preview release of our search platform back in April of this year, we got busy adding more articles and more life sciences dictionaries to our platform and in the process we managed to put heavy stress on our infrastructure in all directions.  As we added more articles, the keyword index grew beyond the number of attributes supported by the underlying database!  As we added more dictionaries, the search computation time increased by many folds, resulting in occasional timeout of search queries.  To make matters worse, tight coupling of our platform with the crawled and parsed data allocated too many resources (both compute and storage) for each of our deployments be it alpha, beta or internal development site. Incorporating any backup or disaster recovery plan would have further stressed and complicated our framework.

It soon became apparent that we needed to make some significant architectural changes.  So we did!

A New Document Store

Thanks to our DFS implementation, our corpus data have always been distributed evenly across our infrastructure however as mentioned above our various deployments were tightly coupled with their underlying data leading to redundant compute and storage instances as well as absence of any pragmatic backup or disaster recovery plan. Therefore Varun designed, thoroughly tested, and deployed a new document store layer on top of the existing DFS which now allows multiple applications to share or segregate the same data corpus through project namespaces.  It also allows for encryption at the project level so that the data can be securely accessed by various deployments using private/public key infrastructure. Similarly it also enables creating an active data backup at the project level that can be archived and restored as needed.

Processing Pipeline (ETL)

Though we had a decoupled architecture from the very start, the new document store enabled us to greatly simplify and streamline our processing pipeline (from crawler, word breaker, keyword index, nlp, classifiers, dictionaries to search results computation) to further break down components more discretely, to let them run concurrently on demand at the data ingestion time thereby reducing search time computations dramatically.  Under this new architecture, each processing component derives from a simple application base class provided by the document store pipeline that processes newly uploaded/crawled documents/pages through a callback function implemented by the component developer. Here is a an example of a simple component in our pipeline that prints the document id and its content when called with this document.

Sample Application for Document Store Pipeline

Each component can define scope of documents (see “domain_selector” attribute in the code above) to process and also define its dependent components (see “module_dependencies” attribute in the code above) so that the pipeline will ensure that dependent components are processed first.

Processing Components

With the new pipeline architecture and document store SDK, writing components (both our own and that from our future collaborators) has become much simpler.  Varun was able to discard majority of our past monolithic crawler, nlp, indexer, storage components and rewrite them as new pipeline application modules in a short period of time.  Here is a brief snippet to peek under the cover so to speak.

Here is how one uploads new content in the document store:

Sample Application to add documents into document store

Here is how we build the initial text index (see more on this below):

Code Snippet to create a keyword search index using document store pipeline

New Inverted Index

As stated above, one problem we encountered with growing data was the physical limit of the number of attributes we could store in the key-value database (Cassandra).  That forced us to rethink and look more closely into other mature technologies for storing the inverted index (a list of documents and their relevant attributes as values and a set of words as keys – thus the name inverted index!).  Apache Lucene indexing classes provided a good guiding framework for Varun to implement the same internally (without Java or large memory footprint – 2 of reasons why we chose not to use Lucene itself) as demonstrated in the code snippet above.  Furthermore, it has enabled our infrastructure to scale linearly over DFS while keeping the initial search index relatively small and super fast.

Dictionaries

While our claim to fame is our unsupervised learning algorithm which clusters results into meaningful concepts (topics), we can also leverage external dictionaries, catalogs or taxonomies to improve the quality and relevance of results we surface in any given domain or project.  Our new document store SDK can be used by our partners to easily add new dictionaries into the platform with a few simple lines of code.  Here is an example of how we integrated Protein Data Bank (PDB) identifiers as a dictionary in our document store (as a result of this integration, if a protein is identified by our search engine, we can return its attributes from PDB).

Sample Dictionary to extract PDB Identifiers using document store pipeline

Similarly, here is an example of how we integrated Wikipedia references in our document store.

Sample Dictionary to extract Wikipedia attributes using document store pipeline

We will publish many more sample dictionary interfaces (such as Name, Place, or Event classifiers) along with some of our internal helper services (such as PDB and Wikipedia web services used in the scripts above) to help with extracting attributes from public or private dictionary sources.  The new interface not only made writing dictionaries simple but also greatly improved search time performance. Before, our search compute time increased as we added more dictionaries.  In our new framework each dictionary maintains its own results per document so the search engine simply queries matching terms from relevant dictionaries and coalesces all attributes across dictionaries. As an example, therefore a given protein now may have its attributes derived from PDB, Wikipedia entries, and any other supplied dictionaries (eg STRING) or product catalogs (eg Millipore Sigma, or Thermo Fisher Scientific).

What Next?

While you experiment with our new framework which is up and running at http://beta.nlpcore.com, stay tuned for more exciting news from NLPCORE.  Soon, we will publish our full set of APIs that allow both solution developers and algorithm developers to take our platform for a spin in life sciences vertical and beyond including their own data sets!

Scripting our Search Experience

Lastly, in the true spirit of all hands on deck, I decided to write a few python scripts to help our users look for specific search results without using our website. Thanks to the power of python, in only a few lines of code I was able to write an extensible console app which provides commands to list discovered topics (e.g. proteins, genes, disease name, etc.) discovered automatically from life sciences project, instance names for a given topic, and their specific references (surrounding text) within the articles. You are welcome to play with and modify the script (reader’s exercise: find a instance name for a given topic where instance reference contains a specific word! eg. find protein sequence that contains a ‘mutation’) as you wish .

Sample Console Application to list topics and annotation references using search APIs

Note all scripts and API references provided here are pre-release, AS-IS and subject to change.

User Documentation 0.2

User ManualPreview Release

User Manual

Version 0.2 (Preview Release)

 

This User Manual provides operating instructions for new users to our search and collaboration portal. The portal is designed for life science researchers, healthcare professionals and biologists where they can quickly identify candidate items – be it proteins, genes, cell lines or reagents for their experiments. They can also provide feedback on quality of search results - both for their relevance and their accuracy.

Terminology and Data Model

All search results are identified broadly as “Concepts” that can be as raw as a url, a document in its native opaque format, or a fully qualified and well-defined typed entity such as a Protein with all its attributes corresponding from say Uniprot knowledgebase along with its references across various articles. Concepts are broadly categorized as People (users, researchers), Places (including Institutions, Vendors, and Universities), Events (Dates, Times, Specific Events), Products (Bioentities, Reagents, Equipment and Materials), Documents (Articles, URLs) or simply Concepts (Dynamically discovered or user-defined concepts based on their (frequent) occurrences in search space such as ‘disease’, ‘transformation’, ‘test subject’ etc…).

 

An instance of an entity may have one or more Concept types – as an example a keyword may be a proper noun, a place, an organization or all of these things at the same time. It may also have various degree of confidence measure associated with each type. Additionally, it may belong to one or more subject domains – such as Biology, Chemistry, News, Current Events, Music, Sports, Politics or more...

 

Projects at a high-level are various verticals where we may wish to apply our search optimizations and learning algorithms or trained models. To begin with, we have focused in Biology / Life Sciences domain and specifically NIH open access articles. We will in near future allow users to add their own projects, subscribe to one or more projects and will make sure that search results are scoped with-in their specified projects.

 

We have incorporated notion of Dictionaries. For example there are known set of Bioentities and Reagents that we can incorporate as dictionary terms. We will allow users to add their own custom dictionaries as well. These dictionaries not only help our search engine to improve identification of known entities and concepts but they also help improve look up for new concepts in the neighborhood of these known items. As we add more such dictionaries and build more trained sets, our search engine surfaces more results across various verticals. The goal of the underlying platform and data model is to continue to process raw data through the entire pipeline without disruption as more models in one or more domains are added. We will be continually adding new dictionaries and trained models through our interim releases.

User Interface

The user-interface is based upon a classic clean Search model. User begins their operations with a simple search interface where they can type any keyword to initiate search processing.

The results are retrieved in a two-pane results view where left pane provides various concept types (categorized in a domain neutral (people/orgs, places, events, products) and domain specific (products – Proteins, Genes, Cells, Reagents… etc. specific to Life Sciences) entities) and right pane provides various result views – documents, time-series, concept graph, or geographical etc.

 

Search and Filter Pane

This pane allows users to perform fresh or subsequent searches as a funnel (increasingly narrow search criteria using original keywords coupled with one or more search results).

Task Pane

This pane is view specific. Based upon currently selected view in the Results Pane (see next), view specific tasks are shown in this pane aligned to the right edge.

Results Pane

Results pane is the main client area and is used to display search results in variety of views – Graph of concepts and their inter-relationships, List of Documents, tag cloud, people, geography or time-series views.

Graph View

Graph view is the default view that shows color-coded concepts of various types and their relationships to other concepts. The view is designed to let users easily interact with the original search results – expand or narrow down to specific set of concepts and their relationships. They can then perform subsequent funnel searches to get to very precise results.

Once you select one or more concept types from the top filter in the filter pane, you can see the results in a graphical display with concepts and their relationships displayed in corresponding concept type color and varying edge thickness indicating strength of relationship between connected nodes respectively.

Interacting with Results

You can interact with the graph by clicking using both left and right mouse buttons with following behavior for both nodes (concepts) and edges (relations).

  • Right-Click

Right-click provides a pop-up menu with following actionable items.

    • Properties

Properties dialog provides attributes of the selected Concept or its relationship and a list of all the references found across various articles.

You can click the toggle button > to the left of each reference and reveal the title of the article from where this reference was obtained.

 

You can click on the title (left-click) and see the text of the article color-coded with all the concepts high lighter along with any significant text fragments (highlighted in yellow) that were used as part of search construction.