amba-analysis-worker-percolator

The percolator component connects a discussion event, i.e., a tweet, to a publication or at least a DOI. Without the connection, the event can not be further processed by the analytical components. The percolator is based on ideas developed by CrossRef, that use a percolator for linking events as well. The percolator is developed in python using the amba-event-stream package. It runs as a docker container.

In order to improve linking throughput, three processes run simultaneously in the percolator to reduce the time overhead generated by waiting on web responses. Since all processes will connect to Kafka by using the same consumer ID, identifying them to be the same type, Kafka automatically shares the events equally between the 3 processes. A DOI resolver class is used, defining static functions to retrieve the DOI from event data.

One way of linking data is using Meta Tags, these are HTML tags embedded in the source code of a webpage. These tags will not be displayed by a browser and are used to specify information to automatic systems processing the page. Traditionally, these have been used by Search Engines and are nowadays additionally used to provide data for custom titles, descriptions and all kinds of crawler information. Each meta tag has a name attribute as well as a content attribute. The name is used to identify what metadata is stored, the content will contain the actual value. This allows to filter the tags for only relevant data which can easily be identified without the need of analyzing the content. This ensures correct data compared to a full-text analysis, which may result in wrong results since no context analysis is done. For Example, a citation may be found using the full-text analysis but wrongly be used to link the event.

Multiple methods of extraction are used to find the DOI for given discussion events. An overview of the process can be seen in Figure 1. First, the tweet data is checked for URLs. If that fails or these URLs do not contain a DOI, additionally, all referenced tweets will be checked. A response may not contain an URL itself but reference the original tweet the response is responding to. Thus, the URL needs to be linked. Note that multiple URLs are available from the Twitter API, that differentiate in their characteristics: a short URL, an expanded URL as well as an unwound URL. Since the DOI ideally can be extracted from the URL itself, the expanded and unwound URL are both checked. Sometimes only the expanded URL contains the DOI, while otherwhile the unwound URL does.

percolator_linking_new Processing Schema of Linking a Tweet with a Publication DOI

The function of linking an URL with a DOI is cached up to 10000 URLs by a Least recently used (LRU) cache. Caching this function allows for very little storage needed to cache since function parameter and result are both small. The cache generally allows faster processing and less need for requests in general. The LRU caching strategy ensures the most used are staying in the cache. The data is static and likely to not change ensuring time is not relevant for cache to expire.

A URL to be linked is first checked with multiple regex to extract potential DOIs. These regex are based on CrossRef but extended to suit the publisher URLs registered in the system. Since the DOI specification is not extremely strict, there needs to be verification of a potential DOI.

To confirmation that a DOI exists, a check is done by sending a request to doi.org. Due to DOI specification, an existing DOI is linking to the correct publication. If no DOI can be extracted using the regex on the URL, a request is sent to the CrossRef event API. The event rest API of CrossRef allows checking if they already linked a URL to a publication. If again no DOI can be retrieved, a request using the URL is sent. The response is subsequently checked for a set list of meta tags. If one or more such tags are found, their values are extracted and a DOI linking started on each of them. The last function if all other fail is to check the fulltext HTML for a DOI.

All publisher URLs are checked to ensure their working in the system. However their articles and pages on their sites which are not registered with a DOI. This means the percolator will not be able to link tweets linking to these URLs.

Once a DOI for a tweet is found, the database is queried for the publication data. If the data can be retrieved it will be added to the event, the event is then set to linked and published. Otherwise, it is set to unknown before sending it to Kafka. If no DOI can be found for a tweet, it’s not further processed in the pipeline and the data will be lost.