data_read
NFT_Hackathon_RarityScore
Approach to the task. Steps
1. Tweet Data
Reviewing the sales data, we saw outliers in some of the collections in count of transactions. The problem was solved after the deletion of zero value transactions.
Checking the correlation between the sum of the retweets (daily sum, cumulative sum, cumulative rolling sum) and the sum of the USD amount of the transaction per day per collection. The reason we concentrate our effort on the retweets is because reposting shows a greater engagement and has a bigger influence on spreading a word about nfts.
There are 8 different users(studios) in the Tweet data, which we link to the 8 collections of NFT. MAYC (Mutant Ape Yacht Club) has no tweets or is mentioned together with Bored Ape Yacht Club.
The most tweets are for clone_x collection (from RTFKT Studios).
import numpy as np
import pandas as pd
tweet = pd.read_csv(‘hackathon_tweets.csv’)
# print(tweet)
userscreenname = tweet[‘userscreenname’].value_counts()
print(userscreenname)
RTFKT Studios 1393 Cool Cats 649 MeebitsDAO (minting now!) 497 Bored Ape Yacht Club 270 doodles 246 Pudgy Penguins 13 Azuki 12 Penguins 1 Name: userscreenname, dtype: int64 Due to the fact we base our tweeter analysis on summed values, we choose to work with the collection 'clone_x' (1393 observations). Then we obtain the sum of tweets per day, cumulative sum of tweets, and cumulative rolling sum for a 10 days horizon (R code): clone_x_tweet_agg_sum <- aggregate(retweets ~ timestamp, data = clone_x_tweet, FUN = sum) clone_x_tweet_agg_cum <- clone_x_tweet_agg %>% mutate(cumsum = cumsum(retweets)) %>% mutate(cum_rolling_10 = rollapplyr(retweets, width = 10, FUN = sum, partial = TRUE)) The result is an object with 411 rows and 4 columns ('timestamp' and 'sum of daily retweets', 'cumulative sum of retweets', and 'cumulative rolling sum for a 10 days horizon'). We calculate than the so called 'daily sum of transactions' (for the collection clone_x) which is the sum of all transactions of nfts for each particular day (from the sales file). We get an object with 107 rows and 2 columns ('timestamp' and 'daily sum of transactions' in USD). Then we merge the 2 data sets by the column 'timestamp' and get a new data set that contains data only on the common days. This data set is of the size 92x5 (the columns are 'timestamp', 'sum of daily retweets', 'cumulative sum of retweets', and 'cumulative rolling sum for a 10 days horizon', and 'daily sum of transactions'). Then we calculate the correlation of the values of the tweeter related columns with the 'daily sum of transactions for the collection':
#Corr. sum of daily retweets and daily sum of USD amount
cor(clon_x_mrd_sum$retweets, clon_x_mrd_sum$amountusd) # 0.6384357
cor(log(clon_x_mrd_sum$retweets), log(clon_x_mrd_sum$amountusd)) # 0.3564773
#Corr. cumsum retweets and daily sum of USD amount
cor(clon_x_mrd_sum$cumsum, clon_x_mrd_sum$amountusd) # -0.362008
cor(log(clon_x_mrd_sum$cumsum), log(clon_x_mrd_sum$amountusd)) # -0.5066663
#Corr. cum_rolling_10 retweets and daily sum of USD amount
cor(clon_x_mrd_sum$cum_rolling_10, clon_x_mrd_sum$amountusd) # 0.2243003
cor(log(clon_x_mrd_sum$cum_rolling_10), log(clon_x_mrd_sum$amountusd)) # 0.3191343
High correlation between the daily sum of retweets and the daily sum of the transactions is detected: 0.6384357. We can try now to build a timeseries model with an exogenous (daily sum of retweets) that will explain the sum of the daily transactions for a collection (clone_x). To this end we first check the type of process that is thought to generate the daily sum of transactions. We use the R function auto.arima(). It suggests an ARIMA(0, 1, 2) process. The output of the ARIMA(0, 1, 2) model with an exogenous variable is: Coefficients: ma1 ma2 retweets -0.3704 -0.2593 4972.2122 s.e. 0.1107 0.1097 789.5406 The process is statistically significant. However, we try an alternative model: ARIMA(1, 1, 1) with an exogenous variable Coefficients: ar1 ma1 retweets 0.5698 -0.9277 5158.4020 s.e. 0.1361 0.0763 684.7399 The process is statistically significant. Based on the AIC and the RMSE we conclude the alternative model is better (nft_tweets_R).
2. Calculating rarity Score:
To calculate the NFT Rarity, we are using the following formula:
Rarity Score for a Trait = 1 / (Number of NFTs with that trait/ Total number of NFTs in collection)
To calculate the overall rarity scope of an NFT, we are using the following formula:
Overall Rarity Score = Σ rarity score for trait i
To calculate these rates, we had to go through a couple of calculations.
- Calculate the sum of NFTs grouped by: collection, trait_type,trait_value:
,count(*) over (partition by [collection],[trait_type],[value]) as [TotalTraitValue]
,count(*) over (partition by [collection],[trait_type]) as [TotalTraitType]
,count(*) over (partition by [collection]) as [TotalCollection]
2. Calculate the ratio between sum of NFTs with a specific trait and sum of all NTFs in the collection:
,(cast(count(*) over (partition by [collection],[trait_type],[value]) as float)/count(*) over (partition by [collection])) as [RarityByCollection]
3. Calculate Rarity Score for a trait = 1/ [RarityByCollection]
4. Overall Rarity Score = Sum(Rarity Score for a trait) for all traits of an NFT
!!!See attached file
3. Creating image mapping algorithm:
We have extracted image data base from the given files data (Trait data). Each image is given a name, which is combination from the Token ID and Collection name.
- When there is a specific image to be recognized, we search it in the database and print the features: all transactions with all the available information for them, including the last one, so that we know who is the last owner.
- Another test is performed to check the similarities of this image with the others in the Collection (whole database). This can be used for checking the rarity of the image in this collection (database).
import urllib.request, json
import numpy as np
import pandas as pd
import os
import requests
import cv2
import json
from urllib.request import urlopen
from skimage.metrics import structural_similarity as compare_ssim
df = pd.read_csv('NFT+Hackathon+Traits+Dataset.csv')
df1 = df[df['collection'] == 'azuki']
print(len(df), len(df1))
696008 78222
ex_url = df.iloc[0]['nft_url']
r = requests.get(ex_url)
print(r.json()['image'])
image_url = r.json()['image']
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers={'User-Agent':user_agent,}
request=urllib.request.Request(image_url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
# resp = urlopen(image_url , timeout=10)
image = np.asarray(bytearray(response.read()), dtype="uint8")
test_image = cv2.imdecode(image, cv2.IMREAD_COLOR)
img_resized = cv2.resize(test_image, (500, 500), interpolation = cv2.INTER_AREA)
cv2.imshow('img_resized', img_resized)
cv2.waitKey() & 0xFF == ord('q')
cv2.destroyAllWindows()
https://ikzttp.mypinata.cloud/ipfs/QmYDvPAXtiJg7s8JdRBSLWdgSphQdac8j1YuQNNxcGE1hg/0.png
test_image_resized = cv2.resize(test_image, (64, 64), interpolation = cv2.INTER_AREA)
ssim_list = []
for image_name in os.listdir('./images1'):
img = cv2.imread('./images1/' + image_name)
ssim = compare_ssim(img, test_image_resized, multichannel=True)
ssim_list.append(ssim)
ssim_list.sort(reverse=True)
print(ssim_list[:50])
[0.9759417223134897, 0.38499271296822074, 0.38494061634080373, 0.37674030564340916, 0.37391522127442695, 0.34958107993950893, 0.3487488306952362, 0.34576506473255925, 0.34305969952772725, 0.3425184721544658, 0.3404009898042047, 0.34014015416439575, 0.34009286499866936, 0.33758452480510887, 0.334775938900204, 0.33411704356270455, 0.3283218758404782, 0.322909671930475, 0.32248149419204114, 0.3205361080304674, 0.3184180633385889, 0.3114298664479145, 0.3110674487453045, 0.31025982605534136, 0.3099693989290296, 0.3063020455610578, 0.3046642999127925, 0.30449549988138935, 0.30442072647967416, 0.303477472800421, 0.302761386555114, 0.29975184199458793, 0.29954256603522667, 0.29832597864486493, 0.2934228049870565, 0.2931483018273245, 0.291704707150598, 0.2868054373042173, 0.2851552988054587, 0.28421325809359654, 0.2825575114229519, 0.2812594364075998, 0.27978290576977377, 0.27864396791955687, 0.27834366010710343, 0.27688214560800967, 0.2754804142304711, 0.27511507930696805, 0.2739705351910143, 0.2728919098485376]
equal = 0
much_similar = 0
similar = 0
different = 0
much_different = 0
for i in ssim_list:
if i > 0.9:
equal += 1
elif i > 0.6 and i <= 0.9:
much_similar += 1
elif i > 0.3 and i <= 0.6:
similar += 1
elif i > 0.1 and i <= 0.3:
different += 1
else:
much_different += 1
print('equal', equal)
print('much_similar', much_similar)
print('similar', similar)
print('different', different)
print('much_different', much_different)
print('all images', len(ssim_list))
equal 1 much_similar 0 similar 30 different 99 much_different 6 all images 136
4 thoughts on “JustForFun”
General comment: How would you link the rarity and image similarity scores into the price prediction?
– business case and data understanding – 5
– data exploration – 6
– methods – 4
– rarity score – 6
– image handling – 5
Overall: you had some nice ideas but probably ran out of time.
2. It would be nice to make some data analysis and show examples of the rarity score. It would really help us understand if it provides any value for further use.
3. Nice idea to use compare_ssim, but again, it is not clear what does this comparison lead to. The test in the article shows only one image, if there were more, that would validate the idea.
It is great that you tried to take into account the tweet data and transactional data and I believe that on a collection level this general trend can be used to support the forecast on single nft level.