pull down to refresh

Research in Public #08: Diamonds in the Rough (Analysis of Post Quality)1
I previously demonstrated that higher posting costs incentivize people to post higher quality concent, as measured by zaps received in the first 48 hours.
A downside of using zaps as a measure of quality is that they are influenced by a variety of factors external to the post itself, such as the time of day it was posted, whether it hit the top of the Hot rankings, and so on.
Thus, I wanted to repeat the exercise with a more objective measure of post quality that relies only on the content of the post itself, such as the number of words, the number of images and links, and the content of the text itself.
To do that, I built a model to predict the number of sats a post would receive using only the post title and text as predictive inputs.

Feature Selection

The exact feature set I used was:
  • The total number of words in the post
  • The total number of images or links in the post text
  • A boolean for whether the post is a link post
  • Semantic embeddings for the post title and the post text
The semantic embeddings were retrieved using OpenAI's text-embedding-3-small embedding model, and then passed through a principal components analysis to extract the top 20 principal components. (Using all 1,536 original dimensions would likely have led to overfitting in the model.)
The resulting feature set has 43 dimensions: 1 each for total number of posts, total number of images or links, and the link post indicator, then 20 dimensions representing the title of the post and 20 dimensions representing the text of the post.

Model Training

I passed used these features as inputs for an XGBoost Regression model estimating the log number of sats earned by a post in the first 48 hours.
Since the purpose of the model is not predictive accuracy, but rather to get a content-based measure of quality for each post, I did not bother splitting the sample into a training set and a test set. I just trained the model using the entire dataset of about 180,000 posts. (To stay consistent with my previous sample selection criteria, I excluded freebies, bios, saloon posts, and posts in the ~AMA and ~jobs territories. I also excluded deleted posts.)
The resulting model achieved a RMSE of 1.9 and a R^2 of 0.35. An R^2 of 0.35 implies that the model is able to explain 35% of the variance in post zaps. By comparison, a simple linear regression would have achieved a RMSE of 2.1 and a R^2 of 0.20.

Feature Importance

In XGBoost, the importance of a feature measures the average improvement in model loss coming from decision nodes involving that feature. The relative importances of each of the features are shown below:
FeatureImportance
text_embeddings0.450
title_embeddings0.219
num_words0.201
is_link_post0.091
num_img_or_links0.039
What this means is that the post's textual content is responsible for 45% of the model's predictive ability, the most important feature. The next most important feature is the model's title, which accounts for 22% of the model's predictive ability. Not far behind is the number of words, accounting for 20%. Whether the post is a link post accounts for 9%, and the number of images or links only 4%.
Which embedding dimensions seem to matter most? The following SHAP value plot sheds light on that. In the plot, each dot represents a post. The color represents whether the post has a high (red) or low (blue) value for the feature in question. The horizontal placement of the dot shows how the feature contributed to the model's prediction. A dot on the right means that the feature added sats to the model's prediction, and a dot on the left means that the feature subtracted sats from the model's prediction. Thus:
  • A red dot on the right means that a high value of the feature contributed to a higher sats prediction
  • A red dot on the left means that a high value of the feature contributed to a lower sats prediction
  • A blue dot on the right means that a low value of the feature contributed to a higher sats prediction
  • A blue dot on the left means that a low value of the feature contributed to a lower sats prediction
The plot shows that:
  • A larger number of words tends to contribute to a higher sats prediction
  • A higher value for text embedding dimension 5 contributes to a lower sats prediction
  • A modal value for text embedding dimension 0 contributes to a lower sats prediction, but a high value can contribute to a higher or lower sats prediction (which it is depends on non-linear interactions with other features)
  • A higher number of images or links in the post usually contributes to a higher sats prediction, but sometimes contributes to a lower sats prediction
  • Being a link post contributes to a lower sats prediction
So, what are these embedding dimensions that seem to matter most?
Text embedding dimension 5, upon inspection, seems to be correlated with links to news articles in which the textual content of the post is a short one or two sentence summary of the news article, without any personal reflection. So, these posts don't seem to attract much sats.
Text embedding dimension 0, which has a huge mass at one value that negatively predicts sats, is basically an indicator for a post with no text. That big mass of blue essentially represents posts with no text. (The empty string has its own embedding value.) So: posts with no text negatively predict sats; but for some posts, having text further predicts even less sats.
I tried to see if I could find the embedding dimension that correlates most with AI slop, but nothing jumped out at me. I'm not 100% sure how well the embeddings can capture AI slop, so I'm not sure if my model is detecting that. I'd love to feed all the posts through an AI detector to get a direct measure of AI slop, but that's not something I want to pay money for, which I would have to with this quantity of posts.

Diamonds in the Rough

Using the model, we can get a predicted amount of sats for each post based only on the post content. We can then compare this predicted number of sats to the actual amount of sats earned. We can then rank the posts based on the difference. Posts with a high predicted sat value relative to actual sats earned are "diamonds in the rough": posts that the model indicates are high quality but received very little attention/zaps. The top 5 diamonds in the rough according the model are the following. Please go read these posts and give their authors the zaps they deserve!
ItemTitleTerritory
#220897Bitcoin Tells an Unchanging Storybitcoin
#132056Tutanota, A Protonmail Alternative?bitcoin
#1240569Nunchuk (I): Fundamentals and Security Mindsetdiy
#990187Use of OP_RETURN in coinjoin transactionsbitdevs
#30127"Crypto currencies are not currencies. They are not. They are not. They are not." -Christine Lagarde President of the European Central Bankbitcoin

Does posting cost lead to higher post quality as measured based only on post content?

Lastly, we return to the original objective, which is to see if higher posting costs are correlated with better post quality, using a quality metric based entirely on the post's content. The answer is yes. We use the model's predicted number of sats as the quality measure, and we regress this on the posting cost, while also including week fixed effects and territory fixed effects to control for global time trends and baseline differences across territories. We find that the relationship between posting cost and post quality is positive and statistically significant.
===================================================================================================
                                                  Dependent variable:                              
                    -------------------------------------------------------------------------------
                                                    log_sats48_pred                                
                            (1)                 (2)                 (3)                 (4)        
---------------------------------------------------------------------------------------------------
log(Posting Cost)        0.244***            0.124***            0.087***            0.054***      
                          (0.002)             (0.002)             (0.004)             (0.004)      
                                                                                                   
Constant                 2.945***                                                                  
                          (0.005)                                                                  
                                                                                                   
---------------------------------------------------------------------------------------------------
Territory FE                 N                   Y                   Y                   Y         
Week FE                      N                   N                   Y                   Y         
Territory Owner FE           N                   N                   N                   Y         
Observations              175,857             175,857             175,857             175,857      
R2                         0.118               0.299               0.311               0.315       
Adjusted R2                0.118               0.298               0.310               0.314       
Residual Std. Error 1.127 (df = 175855) 1.005 (df = 175743) 0.997 (df = 175518) 0.994 (df = 175494)
===================================================================================================
Note:                                                                   *p<0.1; **p<0.05; ***p<0.01

That's all I have for today. Feel free to leave any comments below! Hope you found this interesting.

Footnotes

  1. In #1243188 I suggested to @k00b that we engage in a research project using SN data. The idea would be to use this data to study: A) how micropayments with real money affects internet discourse; and B) barriers to the adoption of self-custody. I also promised @Undisciplined that I'd carry out the research in public, since many people might not know what economics research looks like, and may be curious as to how the process plays out. You can follow all of the updates here and verify all the code here. ↩
What an interesting way to surface underrated content - how cool would it be if a 'diamonds in the rough bot' could be automated to run on regularly and on a fixed time-frame! Nice work.
reply
30 sats \ 3 replies \ @grayruby 5h
Great work. Not a surprising result as it seems intuitive that the higher the post cost the more committed and thoughtful posters are going to be.
reply
Yes, it's good to have empirical confirmation of this theoretical idea. Next step is to show that because of this, SN posts are higher quality than r/bitcoin posts. Unfortunately, I sent a API request to Reddit and it might have just gone into the black hole. Don't know if I'll ever be able to access Reddit data to make a comparison. I read that they tightened up their API access in 2023.
reply
50 sats \ 1 reply \ @nerd2ninja 4h
Would you like assistance with non-API based page scraping? Captchas are just an extra thing you have to do in code you know.
reply
Yes I'd like that
My main concern would be getting around rate limiting/ IP bans since I'd probably be making a ton of requests
reply
Love this kind of data. I'm generally too lazy to actually gamify things and make posts trying to optimize (also hate doing that with video game characters), but I like data and analysis like this.
I'm curious if your data answers one thing I've wondered about link posts: Is there a correlation in terms of success/stats between link posts that also have content from the creator vs just links with nothing other than a subject line?
reply