amba-analysis-worker-discussion

The discussion worker, more accurately the Twitter worker, is a component that transforms event data to processed data, especially the score allowing to rank and qualify the tweet itself. Therefore, tweet features that allow scoring need to be extracted from the tweet, and a total score has to be calculated based on all features.

The developed scoring mechanism allows representing the impact of an event on the discussion of a publication and, therefore, the publication itself. It is implemented in python using the amba-event-streams package.

Hashtags, Entities, Author Name and Location, and “top used tweet words” are extracted from a tweet but are not used to generate a score (since this data is interesting to be collected but does not qualify the tweet at all). Author Location extracts the location data from the Twitter supplied author data. This location data is a user-defined string that is geo-encoded using a free service[^1].

The total tweet score is a weighted sum of part scores multiplied by a factor generated by its type. While not changing the content of a tweet, the tweet type is a significant indicator of its impact. While a retweet is the fastest way to tweet, it does not add personal options and is much less likely to be seen by people in their timeline, thus resulting in the lowest possible factor. On the other hand, a quoted tweet and a response have a bigger chance to be seen and add value to the discussion and therefore resulting in a higher factor. Finally, an original tweet is the highest factor since it is likely to start a discussion. The different types and their respective factors can be seen in Table 1.

The second-biggest factor in the impact of a tweet, and therefore the highest weighted, is the tweet author itself. The number of followers that can see the tweet, whether they are verified or not, and the bot detection is significant for the author’s scoring. Details to each of the individual scoring can be seen in Table 2.

| Type Factor | | Score Abstract Similarity | | Score Sentiment | | Followers | |------------:|----:|--------------------------:|:----|----------------:|----:|-------------------------------------------:| | quoted | 0.6 | \>0.9 | 3 | \>0.6 | 10 | log₂*f**o**l**l**o**w**e**r**s* | | replied_to | 0.7 | \>0.8 | 5 | \>0.33 | 9 | | | retweet | 0.1 | \>0.5 | 10 | \>0.1 | 7 | | | tweet | 1 | \>0.2 | 3 | \<-0.1 | 2 | | | | | else | 1 | \<-0.33 | 1 | | | | | | | \<-0.6 | 0 | | | | | | | else | 5 | | Meta Score Calculation

The tweet content is essential for scoring as well. In order to generate a content score, the length, sentiment, and percentage of abstract matching of the tweet are considered. Sentiment and Abstract similarity are calculated using the Spacy framework with eight language packages (de, es, en, fr, ja, it, ru, pl). In the case of unknown languages, neutral values are returned. Further text preprocessing is done to improve results and performance. Therefore, all stop words, short words with less than three letters, URLs, and words neither a Noun, Propn (proper Noun), or Verb are removed. The sentiment varying between 1 (positive) and -1 (negative) is linearly in buckets over proportional, favoring a positive sentiment. The abstract similarity is a bit more complicated to score. While a high value is bad since it is not adding anything, a low value indicates that the tweet content is likely not about the publication content. The length scoring bucketing is in three main buckets; one is just a link or a few words, the second is a sentence max, and the last requires a few words. Exact values can be seen in Table seen in Table 1.

| Score Length | | Score Time | Score Bot | | Score Verified | | |-------------:|----:|--------------------------------:|----------:|----:|---------------:|----:| | \<50 | 3 | log (*X*))/log (1/7) + 3) \* 10 | no Bot | 10 | verified | 10 | | \<100 | 6 | score \<= 30 | Bot | 1 | not verified | 5 | | else 100 | 10 | score \>= 1 | | | | | Content Score Calculation

Furthermore, a score is calculated based on the time that has passed since the publication was published. The score is based on studies showing the importance of early sharing increasing the citation count in the future. The most crucial time range is a week. The formula used can be seen in Table 2. The values are limited in both directions and will always be between 1 and 30. Additionally, it is helping to highlight new research. Its weight to the overall score is relatively low.

Finally, the following weighted sum with the following weights is used to calculate the total score of the tweet.

score = type_factor \ ( 3 * score_time + 6 * score_user + 5 * score_content )

The score, the extracted data, and features are then stored in the event, which after a status update to processed will be sent to Kafka.

[^1]: https://nominatim.openstreetmap.org; accessed 30-October-2021