Datathons Solutions

NFT Datathon 2022 Goal Diggers Team


I. Business Understanding

Irreplaceable tokens, better known as NFT, are the latest phenomenon in the world of cryptocurrencies, which is gaining wide popularity. Simply put, NFTs are transforming digital works of art and other collectibles into one-of-a-kind, verifiable assets that can be traded in the blockchain system. Irreplaceable tokens, or NFTs, are digital content particles associated with the blockchain, the digital database behind cryptocurrencies such as Bitcoin and Ethereum. Unlike NFTs, these assets are interchangeable, which means that they can be traded or exchanged for similar assets of the same dollar value. NFTs, on the other hand, are unique and non-interchangeable, which means that no two NFTs are the same. Anyone can create an NFT. All you need is a digital wallet, some Ethereum and a connection to the NFT market, where you can upload your content and turn it into a NFT or crypto art.

The market is characterized by the peculiarity of uniqueness – unlike our familiar currencies – USD, BGN, EUR, Bitcoin, etc. each NFT is one of a kind and has its own unique characteristics, which makes determining its price more difficult. Furthermore, it is precisely these circumstances that lead to a high price volatility – for example, under the influence of a trend in social networks, demand can rise dramatically and this can lead to speculation in the value of the asset.

II. Data Understanding and Data Exploration

We are provided with an extremely exciting and complex selection of data for the case. We have 3 data sets available:

  • Sales – containing information about NFT transactions for the period April 2021 – March 2022

Get-to-know the data: It contains 257,434 unique transactions recorded by block number & time stamp, divided into 8 collections. It is interesting to note that we have a unique number of senders (variable “from”) 73,795 and 94,964 recipients (variable “to”), which may indicate an increase in the number of sellers and the formation of larger merchants. We have 2 currencies available – ETH & USD, of which in the next stage we will take into account only the observations in ETH to ensure comparability of similar prices. In terms of prices, there are 25,495 unique observations in the amount column, and gasPrice 201,565, from which it can be concluded that some of the transactions take place on similar or similar amounts in ETH. The same contract address corresponds to one collection, a total of 8 unique contract addresses.

Collections: there is a different spelling of the names of the collections in the different data sets, which should be standardized. The date format in the timestamp column needs to be adjusted.

  • Traits – descriptive characteristics of NFT by collections

Get-to-know the data: There are 202,265 unique records distributed in 8 collections. The interesting thing here is that there are 24,212 unique Tokens, to which there are 55 types of different traits / characteristics, according to the different collection.

Looking at the names of Traits (trait_type), similar spellings are observed, which need to be grouped under common categories in order to reduce and facilitate the analysis. The maximum number of tokens (tokenID) in traits is less for the CloneX (18895 on 19127) & Cool Cats (9932 on 9940) collections compared to the Sales data set. This may be due to differences in sampling periods.

  • Tweets-comments from the social network with reactions

Get-to-know the data: The latest data set contains 3,081 tweets from the creators of the 8 collections.

The columns userscreenname & username contain the names of the collections, 8 unique values, but in col. user has 7 unique records, which suggests inconsistencies in the data and is not suitable for use in the analysis from now on. Here again, there are differences in the spelling of the collections that will need to be removed to merge the data sets.

III. Feature Engineering

    1. Unique transactions, used as parent transaction. We create a key that connects seller (from), buyer (to), collection (tokenName) to a specific time of the transaction. Based on that key, we obtain additional variables that represent a lead functions of a new key that targets recipient and sender: from-to & to-from. This is done in order to identify related transactions and self-transactions that create an false volume, even with a price (amount) greater than 0. In this way, 3 types of transactions are distinguished: self-transactions, related transactions and actual sales. Further, only actual sales transactions will be included in the analysis, with which we will define the price. Here is an example of artificially created price increase with self-transactions:
    2. Rarity Index: we create 2 types of weights to assess the rarity of NFT. We use the serial number of NFT in the collection (tokenID) and refer it to the total number of NFT in the collection that are known at the moment. The first way to evaluate weight is the following: 1 divided by the total number of NFTs in the collection. The second way is using the serial number of NTF per collection, divided by the total number of NFTs in that collection. Our goal is to compare whether the serial number and weight of the NFT affect the price and whether the size of the collection affects the formation of the price. Here is an example of exceptional outlier:
    3. Determining the tradable volume:
    4. Total sales volume variable: shows traded volume regardless type of transaction and whether traded price is 0.
    5. -Real sales volume: zero, self-transactions and related transactions are excluded as counted transactions. Here we use real sale transactions only.
    6. Additional analysis of traits data set to combine similar characteristics. We noticed that among different trait types there are similarities between them and decided to combine them in one category. Here are initial features:

After reallocating them in new categories:

  1. Following steps: Evaluating crypto market volatility using external data sources.

IV. Modelling and Evaluation

  1. Sentiment analysis of tweet data

We use several dictionaries to gain more knowledge of our tweet data:

-Afinn  – determines the weight of the words involved in the posts. The sum of all words is negative -1460, which suggests that the negative words in the posts are predominant (obscene words). We could indicate as the most positive words: breathtaking, hurrah, superb, thrilled, outstanding. The total amount of reactions: -1460

-Bing : it divides the words in the posts into positive and negative. There are 4,781 negative words in the posts and only 2005 positive ones.

-NRC : extracts different emotions from the posts. Here again, the negative words from the posts predominate, but other emotions such as anger, disgust, sadness can now be distinguished.


2. Linear regression and ARIMA models

Our attention is drained to two main models: linear regression and ARIMA models. Linear regression aims to help us to understand  the basic  concepts and dependancies of NFTs. LR is build for token_id 8319 of Meebits collection due to limitness of computational power.  The NFT choice is based of the number of transactions (more that 700 of which 278 transactions are real). Used explanatory  variables for the model are weight of the whole collection, weight of single unit, volume all, volume real, and price as a response variable. Our proposal is to add trait information per NFT to the model, because we expect high influence of NFT`s characteristics. Then, we could use score per influencer, obtained from tweet data set. Results for the used method are: Multiple R-squared: 0.6976, Adjusted R-squared: 0.692 with Residual standard error: 27.43 on 272 degrees of freedom.

V. Further investigation

Our suggestions for further investigations are:

  1. Using tweets information: extract score result per user, which score could be used for the main model.
  2. Based on traits information we could make analysis of given characteristics. Feature engineering is a must due to differences of NFT`s descriptions. Making new variables using tweet data will improve results for the model.
  3. Development of the main model

We recommend finalizing feature engineering from the previous steps. Then, building an linear regression model with presented variables for better understanding of our data. Then, using ARIMA model could predict future price taking into account trends, seasonality and could clean the variance of data.


Share this

2 thoughts on “NFT Datathon 2022 Goal Diggers Team

  1. 0

    In many cases the data prep is a key step and in this case the data engineering and data cleaning and the related analysis that you mane are very important steps.  
    In addition to that I appreciate your attempt to use sentiment analysis.
    And finally I like your well structured work :).

Leave a Reply