Team “Non-Fungible Primates”
PART I: Predicting NFT prices
Business Understanding
The non-fungible token market is illiquid. Currently there is not a solution for real-time pricing of collections and individual tokens. Nexo is positioned as a leading lending provider for digital assets. Until recently they relied on crypto collateral for the provision of a loan. However, they recently launched a product that allows for NFTs (BAYC, Crypto Punks, etc.) to serve as a collateral. In order to prevent losses in a shortfall event, the loan provider purchases put options on the NFTs – putting the risk on the broker, but paying a premium for the contract. This is done primarily to ensure a loss floor in the case of a liquidation of the NFT. By developing an analytical solution that provides real-time pricing for a specific NFT based on a rarity index, liquidity and sentiment from social media such as twitter, Nexo can further increase their profit.
Data Understanding
The dataset are focused on 8 different NFT collections, specifically BAYC, MAYC, Azuki, CloneX, Doodles, Cool Cats, Doodles, Meebits, PudgyPenguins. We are presented with 8 contract addresses that correspond to a particular collection, as well as the public addresses of the sender/receiver. For the sake of explanation and visualization, we will focus on a single collection, bearing in mind that the same pipeline can be applied to every collection. We look at the Bored Ape Yacht Club in the following section.
These graphs allow us to infer some basic properties of different variables in the dataset by looking for trend, seasonality and correlation between features. We can observe seasonality when it comes to Daily Traded Volume in USD, it is triggered by decreases in the price of Ethereum. It is difficult to infer anything about the Average Daily Price from the graph because it resembles stationarity following a Normal Distribution.
Data Preparation
We are provided with 3 datasets:
1) NFT Sales Dataset
2) NFT Traits Dataset
3) NFT Tweets Dataset
We add one more dataset which provides general market information about the price of Ethereum.
4) Historical Price Ethereum Dataset
We perform multiple operations on each of the datasets, then combine them into one single dataset which we use for training and evaluation.
First, we take the sales dataset and categorize each transaction into Mint, Transfer, or Completed Transaction. We take only the Completed Transaction, disregarding Mint and Transfer. We can remove some transactions too, due to anomalies in the sell price which implies Wash Trading, but we won’t delve into such detail due to time limitations (can be done with Etherscan API). Then we further dwindle the set by selecting only transaction for the BAYC collection.
Second, we take the preprocessed sales dataset and compute daily min, max, average, count and unique count of NFTs traded.
Third, we combined the dataset described in point two with daily price, volatility and change for Ethereum so that we can see how the price of an NFT correlates with that of ETH.
Fourth, we takes the average daily likes, comments and tweets from the tweeter database so that we can see how media presence influences the price of the NFT collection.
Fifth, using the formula shown below, we compute a rarity index for each of the NFTs withing the BAYC collection.
Finally, we combine all the aforementioned features together in order to have a dataset which has each transaction in a period of one year together with features which take into account media presence, visual features of an NFT, general market characteristics and general collection characteristics. For each feature we account for time by taking an Exponential Moving Average for past thirty days as there is empirical evidence in support of using averages of values which closely precede the transaction.
Further Data Exploration
Our variable of interest in the dataset is called amountUsd because we want to consider at what price we can sell an NFT in terms of dollars as this is considered the most liquid asset. We take a look of how the different features correlate with the variable of interest:
This correlation column between amountUsd and all other variables provide useful insights about which features will be used in the model. We can see 30-Day Exponential Moving Average for Daily Volume in USD, Average Price of Collection and Media Metrics such as likes, comments and retweets as closely correlated to the price. This corroborates some theoretical hypothesis regarding the influence of social media on NFTs, how they are affected by liquidity and prices of other assets within the collection.
Modeling
1) For the sake of our analysis, we decided the target variable to be the exact USD amount of an NFT. To estimate it, we chose to experiment with a XGBoost Tree Regressor. After partitioning the train/test, correspondingly 80%/20%, we used min-max normalization. Then we used a correlation matrix to decide on the features to be used in the model. We selected those with an absolute correlation above 0.30. Then we performed parameter optimization with cross validation to find the best value for the boosting rounds parameter. After training the model on 9 folds, we used the left-out fold for the prediction. The prediction scoring used amountUsd as a target value against the Prediction (amountUsd). Eventually, we calculated the MSE across Folds. Then we trained and scored the model with the full training dataset and the optimal value for boosting.
2) The second model that we used is a Tree Ensemble Regressor. We wanted to see if an ensemble modeling technique would provide us with a higher R^2 using the selected aggregation mode in KNIME to aggregate the votes of the individual decision trees. After splitting the training-test set, we used a Z-score normalization and treated the rows with outliers by removing. Finally, we the tree ensemble learner regressor and predictor node to come up with the numeric scores.
The the trialed regression models aim to predict the exact transaction price for a given NFT by modelling its time series as a function of all the devised numerical features from our combined dataset. Although this approach succeeds in capturing the general trend to some extent, it falls short in incorporating the volatility (or variance) in price change, caused by extensive amounts of noise present in the time series of the general cryptocurrencies (ETH/BTC) and discrepancies influenced by exogenous factors such as related social media activity, hype after certain visual traits.
The conclusion we came to given our results and some research on the topic of ARMA and GARCH process modelling is that a more elaborate solution would consist of modelling the daily price change itself by a linear regression model with an ARMA-GARCH error to account for the variance in price change at the current and preceding time-steps.
Evaluation
PART II: Image-search profile-picture NFTs by Twitter user-name
Preliminary solution
We have implemented functionalities performing the following pipeline stages:
- Extracting via Twitter API v2 the profile picture for a given Twitter username in original (or normal downscale) size;
…
- Extracting a small subset database of NFT images for a range of unique assets from the `NFT+Hackathon+Traits+Dataset.csv` dataset;
(Here we chose a range of +/-50 token_id-s around asset 4418 the BAYC NFT collection.) - Matching the target profile picture within the extracted BAYC NFTs subset database via minimal cosine distance between AlexNET final-layer dense representations of the target and database NFT images.
(At scale we plan to further optimize this to first look for the best-matching (smallest-distance) average dense representation of a whole collection to first identify the target collection to lookup against, possibly in a distance-ordered similarity matrix) - Extracting the latest successful transaction for the given NFT via Etherscan API given the issuer contract and token_id of the matching-image NFT from step 3.
(This works to some extent and is not yet robust) - Additionally we tested the extraction of NFT images from (re)tweet URLs extracted from `hackathon_tweets.csv`
- We are still considering a similar NFT image recommendation system.
*Deployment – optional
3 thoughts on “Team NFPs”
A few suggestions for your article/presentation:
nit: Rarity Score: Don’t use the reciprocal of a fraction
Further Data Exploration: what method did you use for finding the correlations?
Predicting uncertainty: How would you incorporate the uncertainty prediction in your score?
Results: Show how significant the error is compared to the NFT price.
Twitter Image Search: How would you incorporate this research into your price prediction algorithm?
Good job! Looking forward to your presentation!
– business case and data understanding – 6
– data exploration – 5.5
– methods – 5
– rarity score – 5
– image handling – 5
Thank you for the good introduction. Also, the data overview and exploratory analysis look good.
question: How did you split the data in train / test in terms of time periods, did you get random points or a longer period for testing?
It would be nice to compare the reported results to a simple baseline, for a better perspective of the achieved results.
For image similarity it would be nice to present some results with a sample set of images.