[
  {
    "body": "some thoughts on the shape of foundation labs\n\n1) epoch ai estimated anthropic @ $9m in revenue per employee and openai @ 5.6m in revenue per employee\n\n2) these rates would be the highest among public technology companies; but, i'm not sure how valuable it is to look at on its own\n\n3) the closest equivalents are quant firms like jane street @ $12m and hudson river @ $9m and energy infrastructure companies like valero energy @ 13m\n\n4) revenue per employee is a complicated measure because a lot of it depends on accounting and different firms consider different things revenue \n\n5) but, quant shops have high revenue per employee because they have a lot of revenue on top of a small number of specialized, expensive researchers\n\n6) and, oil refineries have high revenue per employee because they can process a lot of oil with small number of employees, using very expensive tooling\n\n7) foundation labs feel like a combination of these two things\n\n8) like quant shops, they have a small number of very highly paid researchers and, like energy infrastructure companies, each employee is very heavily capitalized\n\n9) traditional technology companies don't capitalize their employees very heavily; claude estimates nvidia spends $100k in r&d opex per employee per year, apple $80k per employee per year\n\n10) in contrast, openai will probably spend ~$35bn in r&d compute this year with ~5000 employees; this would imply openai will spend ~70x what traditional tech companies spend in r&d opex per employee \n\n11) now, in practice, this r&d opex spend is concentrated on a small team of core researchers and this would make the comparison even more stark\n\n12) in this respect, they really are a new kind of tech business; they are not quite like hyperscalers, saas, ad-tech, e-commerce or hardware companies, etc...\n\n13) they have unrivaled tam, distribution like plg saas, lower gross margins, an employee base more like quant shops, unique r&d dynamics, and capital requirements that if you squint sometimes look like a hyperscaler\n",
    "tweet_id": "2053372301303853294",
    "note_id": "2053372301135974401",
    "tweet_url": "https://x.com/fleetingbits/status/2053372301303853294",
    "created_at": "2026-05-10T07:11:21.000Z",
    "length": 2029,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/EpochAIResearch/status/2052847400650518804"
    ],
    "tags": [
      "openai",
      "anthropic",
      "lab economics",
      "compute"
    ],
    "title": "some thoughts on the shape of foundation labs",
    "snippet": "1) epoch ai estimated anthropic @ $9m in revenue per employee and openai @ 5.6m in revenue per employee 2) these rates would be the highest among public technology companies; but, i'm not sure how valuable it is to look at on its own 3) the closest equivalents are quant firms like jane street @ $12m and hudson river @ $9m and energy infrastructure companies like valero energy @ 13m 4) revenue per employee is a complicated measure because a lot of it depends on accounting and different firms consider different things revenue"
  },
  {
    "body": "just a quick thought or two on ai chip components\n\n1) an interesting way to look at bottlenecks in the compute chain is to look at gross margins, instead of looking at component cost\n\n2) the thing about component cost is that it is hard to know whether demand outstrips supply or whether that component just intrinsically costs more to make \n\n3) the cost of a component could be high just because it uses a lot of expensive materials or because it uses an expensive manufacturing process\n\n4) however, changes in gross margins, which are the amount of profit that companies are making above the cost of production, tells you more about the bargaining power of firms\n\n5) rising gross margins indicate that the firm has a lot of pricing power and that its goods are the limiting reaction in someone's downstream product\n\n6) and, they tell you that the good is hard to produce, because if it were easy, someone else would produce it and this would drive down the gross margins\n\n7) anyway, the results are not that different, but we see that gross margins for memory providers are increasing the fastest, followed by cowos, with logic increasing slowly (see claude graph below)\n\n8) this indicates that memory is the most important production bottleneck, followed by cowos, with logic as the weakest bottleneck\n\n9) note, it would be interesting to see gross margins over the whole of the ai compute ecosystem; from power all the way up to foundation labs\n\n10) and, it would also be interesting if someone wants to forecast what demand shocks would look like in the event of a geopolitical upheaval (e.g. a taiwan war)\n",
    "tweet_id": "2052792863659053321",
    "note_id": "2052792863508058113",
    "tweet_url": "https://x.com/fleetingbits/status/2052792863659053321",
    "created_at": "2026-05-08T16:48:52.000Z",
    "length": 1611,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/EpochAIResearch/status/2052509552776761580"
    ],
    "tags": [
      "lab economics",
      "compute"
    ],
    "title": "just a quick thought or two on ai chip components",
    "snippet": "1) an interesting way to look at bottlenecks in the compute chain is to look at gross margins, instead of looking at component cost 2) the thing about component cost is that it is hard to know whether demand outstrips supply or whether that component just intrinsically costs more to make 3) the cost of a component could be high just because it uses a lot of expensive materials or because it uses an expensive manufacturing process 4) however, changes in gross margins, which are the amount of profit that companies are making above the cost of production, tells you more about the bargaining power of firms"
  },
  {
    "body": "some thoughts on exploring the chinese ai ecosystem\n\n1) i think that @natolambert's article on his visit to china is a good example of how not to do analysis of a foreign ai ecosystem\n\n2) the @readsail team visited china as part of a trip to talk to the major chinese ai labs; they visited alibaba, moonshot, minimax, and zhipu, among others\n\n3) i expect the researchers that they met were very warm and very welcoming; but, the level of appreciation that @natolambert expresses in return, i think spoils his analysis\n\n4) everything is over the top praise; i count the word humble or a variant of it 6 times! and, everything is stacked adjectives (\"wonderful, humble, and open scientists\", \"an elegant, brilliant researcher\", \"practical, humble, and motivated\")\n\n5) when you visit a foreign country to analyze a system they have built, i think that you should ask questions that can help you understand their structure and their incentives\n\n6) so, you should ask \"how are your teams organized?\", \"what is the relationship between the commercial side and the research side?\", \"how do you divide compute between research and inference?\"\n\n7) \"how much compute do you allocate per researcher?\" \"how do you decide whether to hire more researchers or give existing researchers more compute?\", \"how do you get research ideas?\", etc...\n\n8) two good things we do learn are that (a) chinese labs hire more students than western labs and (b) chinese labs do more of their data work in-house\n\n9) anyway, you should also not take the things that people say at face value; for example, nathan seems to describe chinese labs uncritically as low drama\n\n10) but, we recently have had some public examples of drama at alibaba and bytedance famously has internal drama with very competitive teams pitted against one another\n\n11) so, i think it is an important skill to be able to separate cultural presentation from underlying reality, and to ask what is actually happening under the hood\n\n12) anyway, this is really just a request for more serious analysis of foreign ai ecosystems and especially of the chinese ai ecosystem\n\n13) i think the chinese is very important and it will be very consequential; it has access to a huge talent pool and to a huge industrial base\n\n14) and, we should try to learn as much from the chinese ai ecosystem as possible and use it to critique and improve our own systems as much as possible; but, hopefully, in more serious way\n",
    "tweet_id": "2052459823795749169",
    "note_id": "2052459823615401984",
    "tweet_url": "https://x.com/fleetingbits/status/2052459823795749169",
    "created_at": "2026-05-07T18:45:29.000Z",
    "length": 2441,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/natolambert/status/2052415630062879098"
    ],
    "tags": [
      "chinese labs",
      "lab economics",
      "compute",
      "enterprise"
    ],
    "title": "some thoughts on exploring the chinese ai ecosystem",
    "snippet": "1) i think that @natolambert's article on his visit to china is a good example of how not to do analysis of a foreign ai ecosystem 2) the @readsail team visited china as part of a trip to talk to the major chinese ai labs; they visited alibaba, moonshot, minimax, and zhipu, among others 3) i expect the researchers that they met were very warm and very welcoming; but, the level of appreciation that @natolambert expresses in return, i think spoils his analysis 4) everything is over the top praise; i count the word humble or a variant of it 6 times! and, everything is stacked adjectives (\"wonderful, humble, and open scientists\", \"an elegant, brilliant researcher\", \"practical, humble, and motivated\")"
  },
  {
    "body": "some thoughts on gpt-5.5 and the missing zero day vulnerabilities \n\n1) so, @natalia__coelho wrote an article in which she argued (among other things) that gpt-5.5 and mythos have similar cyber capabilities \n\n2) one of the counterarguments has been that, even though the two models have similar benchmark scores, openai did not discover the kind of vulnerabilities that anthropic discovered with mythos \n\n3) i think one of the reasons that anthropic discovered so many vulnerabilities with mythos is that anthropic has such a strong focus on safety, esp measurement\n\n4) and, so it would make sense for anthropic to spin up a team to try to find real world vulnerabilities with their models in order to measure when models developed dangerous cyber capabilities\n\n5) but, this is a difficult organizational commitment; you need run mythos, triage the vulnerabilities, inform the maintainers, figure out what to do after that, etc...\n\n6) if you find the vulnerabilities, but do nothing about them, then you have created a real public relations risk for yourself if those vulnerabilities are ever exploited\n\n7) and, if you just report them all to the maintainers and just swamp them, then you create another problem for yourself where open source maintainers will complain about you\n\n8) and, if you wait to release your model, you may forego revenue that you could otherwise have obtained or put yourself behind in the race to grab customers \n\n9) this is all to say that scanning thousands of open source repositories for vulnerabilities isn't a risk free decision; it's actually a potentially expensive decision\n\n10) now, if you are anthropic, and your leadership cares a lot about safety (as an org) and because you believe in safety, you think this is just how the world is going to go and you can see the market opportunity for security then this isn't that hard of a decision\n\n11) but, if you are openai and your leadership probably sees safety as part hindrance (painful stuff they have to do before they release a model) and part optics (good stuff they can say to congress) then it's not that easy\n\n12) any time your executive team spends thinking about how to staff the team, handle the reach out to maintainers, decide whether to embargo the model, etc... could have been spent figuring out how to sell ads, get compute, etc...\n\n13) and, although they might see security as a very valuable market, maybe not as important immediately as getting a new frontier model out across all other use cases where they can win deals now\n\n14) so, i think the fact that openai did not run a similar program with gpt-5.5 is not strong evidence that gpt-5.5 could not have been used to find similar vulnerabilities\n",
    "tweet_id": "2052184943225381029",
    "note_id": "2052184942990503936",
    "tweet_url": "https://x.com/fleetingbits/status/2052184943225381029",
    "created_at": "2026-05-07T00:33:12.000Z",
    "length": 2703,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/samth/status/2051806020406534234"
    ],
    "tags": [
      "openai",
      "anthropic",
      "lab economics",
      "enterprise",
      "evals",
      "safety"
    ],
    "title": "some thoughts on gpt-5.5 and the missing zero day vulnerabilities",
    "snippet": "1) so, @natalia__coelho wrote an article in which she argued (among other things) that gpt-5.5 and mythos have similar cyber capabilities 2) one of the counterarguments has been that, even though the two models have similar benchmark scores, openai did not discover the kind of vulnerabilities that anthropic discovered with mythos 3) i think one of the reasons that anthropic discovered so many vulnerabilities with mythos is that anthropic has such a strong focus on safety, esp measurement 4) and, so it would make sense for anthropic to spin up a team to try to find real world vulnerabilities with their models in order to measure when models developed dangerous cyber capabilities"
  },
  {
    "body": "some quick thoughts on the us / china ai capability gap\n\n1) caisi evaluations indicate that deepseek v4’s capabilities lag behind the us frontier by about 8 months\n\n2) and, that the gap between us models and chinese models is widening over time, deepseek-r1-05-28 was only 5.3 months behind the frontier\n\n3) now, i think we should always assume that ai capabilities are a function of available compute\n\n4) so, when we see a graph of showing a capability gap, we should always look to see whether the available compute to each party supports the gap\n\n5) and, when we look at this, we see that total us compute has been growing  at a faster rate than total chinese compute\n\n6) epoch ai has estimated compute for us hyperscalers and for china; us compute has been growing at ~5x per year; chinese compute has been growing at ~4x per year\n\n7) in q1 2023, china was 10 months behind the united states in gpu compute; but by q4 2025, china had fallen to 20 months behind the united states in gpu compute\n\n8) and, if we think that capabilities are log-scaling with compute, then we should expect the gap between the united states and china to be increasing over time\n\n9) this is the importance of export controls by the way, we should expect that with more available compute, chinese models would be closer to the us frontier\n\n10) and, if anything, the relevant question we should ask ourselves is how china is outperforming it's compute, since it's models are 12 months ahead of their available compute\n\n11) if we see this gap narrowing, without the compute growth rate narrowing, we should ask ourselves what is wrong in our assumptions of capability growth\n",
    "tweet_id": "2051166013929214079",
    "note_id": "2051166013786632192",
    "tweet_url": "https://x.com/fleetingbits/status/2051166013929214079",
    "created_at": "2026-05-04T05:04:21.000Z",
    "length": 1652,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/sebkrier/status/2050369549111795880"
    ],
    "tags": [
      "chinese labs",
      "lab economics",
      "compute",
      "evals",
      "legal"
    ],
    "title": "some quick thoughts on the us / china ai capability gap",
    "snippet": "1) caisi evaluations indicate that deepseek v4’s capabilities lag behind the us frontier by about 8 months 2) and, that the gap between us models and chinese models is widening over time, deepseek-r1-05-28 was only 5.3 months behind the frontier 3) now, i think we should always assume that ai capabilities are a function of available compute 4) so, when we see a graph of showing a capability gap, we should always look to see whether the available compute to each party supports the gap"
  },
  {
    "body": "some thoughts on the epoch chip smuggling article\n\n1) epoch ai estimates that ~660k h100e were smuggled into china through q4 2025; this would represent about a third of china's total compute\n\n2) this would represent about $14bn in compute at oem prices or about 4% of nvidia's revenue over the 2024 and 2025 period\n\n3) i want to go through how they got their estimate; but, the article is complicated because everything is done in medians, confidence bounds and distributions\n\n4) so, i'm going to give my own simplified presentation of their method, which is not exact, but which i think exposes the important variables in their analysis\n\n5) okay, they have two approaches to estimate the number of smuggled chips: (a) looking at how many chips were diverted to china from proper channels, (b) looking at how many chips were sold in china\n\n6) so, they get an idea for how many chips have been diverted by looking at 6 doj cases, a bloomberg report and some wsj/nyt/ft reporting\n\n7) the doj cases are the most important (~160,000 h100e; mostly supermicro), then bloomberg's megaspeed reporting (~110,000) then the wsj/nyt/ft (~9,000); ~280,000 h100e total\n\n8) then they do something that resembles this back of the envelope math: 280k × 0.5 over reporting haircut × 4x detection multiplier (assuming 25% catch rate) = ~560k H100e cumulative\n\n10) one important parameter here is that they assume that you only catch 25% of shipped h100es; and the analysis is very sensitive to this (it's a 4x)\n\n11) and, they do special discounting for supermicro and megaspeed (assuming not all the gpus went to china or it was over reported) and a general correction for news articles over reporting diverted gpus\n\n12) on the resale side, they estimate the number of chip vendors in china and how many chips each vendor sold from some wsj/nyt/ft articles; they also do a bump for the quarter h20s were banned\n\n13) the napkin math version basically looks like: ~75 vendors, ~2,700 chips/year/vendor, 1.2x adjustment for vendors that are unknown, 1.4x adjustment for blackwell/hopper mix, 2 years: ~680,000 h100e\n\n14) note, that this is quite sensitive to their adjustment for unknown vendors, they basically assume most of the chip vendors are known, so only 1.2x adjustment (which i think is an interesting assumption)\n\n15) also note, their exact numbers are ~530,000 H100e for diversion and ~700,000 H100e for resale and these blend to ~660,000 H100e combined\n\n16) overall, it looks like an okay estimate, but both sides of it are very sensitive to variables that seem hard to estimate (e.g. % of chips caught on the diversion side and % of vendors known on the resale side)\n\n17) anyway, the most valuable research here would be mapping the oem chain, figuring out where people think the gpus are probably being sold, and then doing some real investigative reporting\n",
    "tweet_id": "2050007456567398636",
    "note_id": "2050007456311578624",
    "tweet_url": "https://x.com/fleetingbits/status/2050007456567398636",
    "created_at": "2026-05-01T00:20:39.000Z",
    "length": 2851,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/EpochAIResearch/status/2049924785153638761"
    ],
    "tags": [
      "chinese labs",
      "lab economics",
      "compute",
      "legal"
    ],
    "title": "some thoughts on the epoch chip smuggling article",
    "snippet": "1) epoch ai estimates that ~660k h100e were smuggled into china through q4 2025; this would represent about a third of china's total compute 2) this would represent about $14bn in compute at oem prices or about 4% of nvidia's revenue over the 2024 and 2025 period 3) i want to go through how they got their estimate; but, the article is complicated because everything is done in medians, confidence bounds and distributions 4) so, i'm going to give my own simplified presentation of their method, which is not exact, but which i think exposes the important variables in their analysis"
  },
  {
    "body": "some quick thoughts on the mythos and the us gov\n\n1) so, the united states gov opposes anthropic's plan to expand access to mythos to more commercial orgs\n\n2) supposedly, this is bec they are concerned that anthropic does not have enough compute to serve it to the government in addition to other commercial orgs\n\n3) now, if we take the story at face value, it's good for anthropic; they get all the revenue and great marketing bec the government says their model is the best\n\n4) there is some loss, associated with openai getting to some cyber customers first, but the marketing and revenue should make up for it, imo\n\n5) one thought though, it's possible that the story is garbled and the government is actually just concerned that more parties could have access to cyber offense\n\n6) in this world, by announcing their model is dangerous and by doing this staged release, anthropic made it the easy decision for government to say no to wider sales\n\n7) remember the government has a very strong status quo bias, esp around nat sec, and is risk averse; when and how do you decide it is safe to release dangerous cyber capabilities\n\n8) it's not easy for anthropic go to - oh, it's safe now! - and start up sales, esp if the government has asked them not to expand sales of the model to more orgs\n\n9) doing so risks angering the government, endangering future contracts, being portrayed as reckless in the media and to the journalist, activist and political classes\n\n10) so, openai may be able to release stronger models and get away with it, just because they do not market them the same way and do not engage in staged release right now\n\n11) this is assuming that the models are not as threatening as anthropic's marketing would have you believe in practice, (assuming safety classifiers, cost as a barrier, token limits, etc...)\n\n12) otherwise, the bad publicity from a breach is enough of a reason not to release, but setting aside this, openai's decision to release specialized cyber models (e.g. gpt-5.5-cyber) may have been a better track\n\n13) because it lets openai appear responsible, holding back their most cyber capable model, while still being able to release their flagship models to the public\n\n14) and, anthropic may have accomplished an interesting commercial own goal, by portraying their own model as very dangerous, while allowing openai to release gpt-5.5\n\n15) anyway, i'm generally a believer in mythos' power, but i do think it will be interesting to see how anthropic eventually justifies releasing mythos and the extent to which the staged release could end up setting them behind openai in market power\n",
    "tweet_id": "2049716009943302375",
    "note_id": "2049716009721040896",
    "tweet_url": "https://x.com/fleetingbits/status/2049716009943302375",
    "created_at": "2026-04-30T05:02:33.000Z",
    "length": 2626,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/WSJ/status/2049668774349938691"
    ],
    "tags": [
      "openai",
      "anthropic",
      "lab economics",
      "enterprise",
      "evals",
      "safety",
      "legal"
    ],
    "title": "some quick thoughts on the mythos and the us gov",
    "snippet": "1) so, the united states gov opposes anthropic's plan to expand access to mythos to more commercial orgs 2) supposedly, this is bec they are concerned that anthropic does not have enough compute to serve it to the government in addition to other commercial orgs 3) now, if we take the story at face value, it's good for anthropic; they get all the revenue and great marketing bec the government says their model is the best 4) there is some loss, associated with openai getting to some cyber customers first, but the marketing and revenue should make up for it, imo"
  },
  {
    "body": "some quick thoughts on china and ai investment\n\n1) american ai companies have higher private market valuations than their chinese counterparts; openai is valued at ~8x what deepseek is by private markets\n\n2) openai and anthropic are worth about ~$800bn; deepseek is valued at ~$100bn; zhipu has a market cap of ~$50bn and minimax has a market cap of ~$30bn\n\n3) to some degree, this valuation difference can be seen as driven by revenue; openai and anthropic have ~$30bn rev run rate while minimax has only ~$200m and zhipu has only ~$300m\n\n4) we can also see this difference in the capital build out; the united states is investing somewhere from 4x to 10x what china is investing in compute for ai\n\n5) i think it is interesting to ask why american companies have invested so much in ai and why ai drives so much revenue in the united states compared to china \n\n6) i think part of the answer is that america has a larger percentage of its economy dedicated to office work and productivity tools for office work\n\n7) white-collar wages are about ~$5tn annually in the united states and ~$2tn annually in china; so llms have the opportunity to automate more value in american economy than the chinese economy\n\n8) in addition, the american office worker makes 4x the average chinese office worker; so it makes more sense to spend capital to improve american office worker productivity, assuming the same improvements\n\n9) the american saas industry is also a great distribution network for llms, since it provides scaffolding, etc... and the american saas industry is about 10x the size of the chinese saas industry  \n\n10) and, on the investment side, the largest american companies are hyperscalers, so they see the value of underwriting out the compute buildout, it becomes their revenue; and, they understand the value of llms\n\n11) the chinese ai market is very strong on the consumer side right now and the most important chinese ai applications are free (e.g. bytedance's doubao)\n\n12) but, for now, consumer ai looks like a much smaller market than b2b ai, it's feels like total ad revenue vs total white collar wages, as the relevant tam, minus something for competition, consumer surplus\n\n13) so, it just doesn't drive as much private investment; consumer ai is probably less than 10% of total ai lab revenue in the united states\n\n14) where, i think we should start to see private chinese ai investment really pick up is for robotics, since this is where the chinese economy is well positioned to drive efficiency, and generate ai revenue\n\n15) of course, separate from economics, the chinese government will invest to avoid foreign dependency regardless of unit economics; and, we should expect to see this with things like huawei chips\n\n16) but, worth noting, the fact that chinese frontier labs still raise private equity means that the chinese government is not giving them sufficient access to low-cost loans to drive the industry on their own\n",
    "tweet_id": "2049637956517003756",
    "note_id": "2049637956324089857",
    "tweet_url": "https://x.com/fleetingbits/status/2049637956517003756",
    "created_at": "2026-04-29T23:52:23.000Z",
    "length": 2949,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "chinese labs",
      "lab economics",
      "compute",
      "enterprise",
      "consumer"
    ],
    "title": "some quick thoughts on china and ai investment",
    "snippet": "1) american ai companies have higher private market valuations than their chinese counterparts; openai is valued at ~8x what deepseek is by private markets 2) openai and anthropic are worth about ~$800bn; deepseek is valued at ~$100bn; zhipu has a market cap of ~$50bn and minimax has a market cap of ~$30bn 3) to some degree, this valuation difference can be seen as driven by revenue; openai and anthropic have ~$30bn rev run rate while minimax has only ~$200m and zhipu has only ~$300m 4) we can also see this difference in the capital build out; the united states is investing somewhere from 4x to 10x what china is investing in compute for ai"
  },
  {
    "body": "some quick thoughts on the xai cursor deal\n\n1) spacex has agreed to either pay cursor $10bn for some joint model work or otherwise to acquire cursor outright for $60bn later this year\n\n2) i suspect that elon thinks of the $50bn option to buy cursor more as a performance incentive than as a strict option in some financial sense\n\n3) the deal solves a couple of things for spacex\n\n3a) first, it gives them a team that can train near-frontier coding models and hopefully train frontier models with access to xai compute\n\n3b) second, it gives them distribution, it doesn't seem that xai ever had a real customer base outside of the elon verse\n\n3c) third, code is probably important for recursive self-improvement and it seems that code performance is important for the next stage of model development\n\n4) with respect to cursor's data, i'm not sure how important it is; on one hand, i would guess about 30% of cursor's users let cursor use their data, so there are a lot of traces\n\n5) but business users, which probably have the most valuable traces, do not let cursor use their data and anthropic, which has the least aggressive user data terms, does not seem slowed down by this\n\n6) for cursor, i think this is a good outcome, cursor is dependent on frontier labs giving them their best models for sale\n\n7) but those frontier labs on the long run don't want cursor between them and their customers; and will look to figure out how to disintermediate them\n\n8) so cashing out now, at a time when they still have very significant distribution and $2bn arr, and after their team has proved they can train near-frontier models, seems good\n\n9) for spacex though, one issue has been that elon's management style has not been effective with researchers; it seems that he tends to treat them like they do not have a better option\n\n10) for engineers that want to work at spacex or tesla; there is no other place that they can go in the united states to work on space or electric cars at the same level\n\n11) but researchers always have better options, they can go to openai, deepmind, anthropic or otherwise join or start a neolab and this means they need to be managed differently\n\n12) and so, even if the acquisition goes though, there is still a risk that elon will have difficulty managing the acquired talent to continue to produce subsequent frontier models\n\n13) but, i expect him to continue, like zuckerberg, to try things until he finds something that works, at least for the foreseeable future\n",
    "tweet_id": "2046886338184856018",
    "note_id": "2046886338071556096",
    "tweet_url": "https://x.com/fleetingbits/status/2046886338184856018",
    "created_at": "2026-04-22T09:38:26.000Z",
    "length": 2491,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/SpaceX/status/2046713419978453374"
    ],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "xai",
      "neolabs",
      "lab economics",
      "compute",
      "coding",
      "enterprise"
    ],
    "title": "some quick thoughts on the xai cursor deal",
    "snippet": "1) spacex has agreed to either pay cursor $10bn for some joint model work or otherwise to acquire cursor outright for $60bn later this year 2) i suspect that elon thinks of the $50bn option to buy cursor more as a performance incentive than as a strict option in some financial sense 3) the deal solves a couple of things for spacex 3a) first, it gives them a team that can train near-frontier coding models and hopefully train frontier models with access to xai compute"
  },
  {
    "body": "some thoughts on the business of fine-tuning\n\n1) thinking machines acquired workshop labs, which was doing llm personalization via fine-tuning\n\n2) thinking machines probably acquired workshop labs to build out an application level product on top of their fine tuning apis\n\n3) it's not quite clear whether this product is intended to be b2b or b2c; workshop labs was agnostic; thinking feels b2b so probably b2b\n\n4) anyway, b2b fine-tuning has always been a hard market and no one has really seen success in it yet despite many attempts\n\n5) the first issue is that fine-tuning tends to provide quickly depreciating value even if you get it right\n\n6) if you use open models then you are already behind the frontier and so your fine tuning has to be able to bridge and then exceed that capability gap\n\n7) and, once you do your fine-tuning, a new better model will be released soon anyway, so to retain any benefit you need a whole fine-tune and deploy cycle\n\n8) and, this is assuming that the new model just doesn't already have all the capabilities of your data, such that the fine tuning still provides an additional advantage\n\n9) the second issue is that the value of fine-tuning is expensive to capture in the first place; it is hard to collect data and do deployments\n\n10) organizations are not designed to mine their own task data and create benchmarks; this requires special expertise companies don't tend to have\n\n11) i also believe that this is process is likely to be organizationally difficult for a company to manage and work through; since it involves collaboration between different teams\n\n12) also, if you want to use multiple different fine-tuned models across your application, you have just made that application much more complicated to update\n\n13) so forward deployment seems very important to support enterprise fine-tuning in high value industries like drug development, semiconductors, finance, etc...\n\n14) because you need to help solve their data collection, benchmark creation, model deployment and organizational difficulties in order to help them get value\n\n15) so, my feeling is that anyone that wants to work on this should focus on high value, specialized industries, that the labs will not commoditize and which can afford forward deployment\n\n16) and, you need to develop very comprehensive forward deployment that helps the customer ideate on problems, solve organization difficulties, and do process mining to collect the data \n\n17) if you can solve this for very large scientific and financial customers, you can probably begin to build automated agents that would help extend this to a wider range of specialized enterprise customers\n\n18) anyway, this would be my advice to thinky or anyone that wants to do enterprise fine tuning, develop a forward deployed team, with a set of enabling software solutions for process mining, rubric creation, etc...\n",
    "tweet_id": "2046717236686233951",
    "note_id": "2046717236489142272",
    "tweet_url": "https://x.com/fleetingbits/status/2046717236686233951",
    "created_at": "2026-04-21T22:26:30.000Z",
    "length": 2883,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "neolabs",
      "lab economics",
      "enterprise",
      "post-training",
      "bio"
    ],
    "title": "some thoughts on the business of fine-tuning",
    "snippet": "1) thinking machines acquired workshop labs, which was doing llm personalization via fine-tuning 2) thinking machines probably acquired workshop labs to build out an application level product on top of their fine tuning apis 3) it's not quite clear whether this product is intended to be b2b or b2c; workshop labs was agnostic; thinking feels b2b so probably b2b 4) anyway, b2b fine-tuning has always been a hard market and no one has really seen success in it yet despite many attempts"
  },
  {
    "body": "some thoughts on the jensen dwarkesh interview  \n\n1) jensen is a ceo that talks his book; i think the throughline is that jensen says what's good for nvidia  \n\n2) as an example, his discussion of gpus vs tpus is very unsatisfying, but it is good for nvidia; he doesn't want to concede their overwhelming use case will be ai  \n\n3) this is because, if the overwhelming use case for gpus is ai, then maybe some of the value nvidia has built for other kinds of scientific computing are less valuable and are less of a moat in the primary market  \n\n4) and then, it's harder for nvidia, on the long term, to defend its margins against google, broadcom, etc...  \n\n5) i think an interesting argument he gives for why nvidia has a moat is that nvidia has a longstanding relationship with suppliers and aggregates demand  \n\n6) and, tsmc and other vendors will expand capacity because he says so, because they trust that he will buy the demand he encourages them to build out  \n\n7) so, he will get this allocation from them, which other parties cannot get, because they both trust him (more than openai, for example) and are risk averse  \n\n8) outside of the china export control conversation, there isn't much else interesting in the interview  \n\n9) the export conversation is quite weird because he contradicts himself; i put a lot of this down to him talking his book, and nvidia benefiting, in a p&l sense, from selling to china  \n\n10) some of the contradictions:  \n\n10a) nvidia exports won't improve chinese capabilities at all but nvidia chips are so far ahead that they greatly improve both researcher productivity and inference efficiency  \n\n10b) the us compute lead means us should have ai leadership, but china has enough compute to build frontier models so it doesn't matter if china has more compute  \n\n11) anyway, i think he believes two things that might make his position a bit more understandable  \n\n12) first, he just doesn't seem that agi pilled, he seems to think the value of ai will accrue to the application layer, not the foundation layer  \n\n13) second, he thinks that open source ai will dominate at the application layer and if it's trained in china on huawei chips, then it will be optimized for those chips, and nvidia will lose the global market  \n\n14) he doesn't say these points quite clearly enough, so i think they were not sufficiently identified in the interview as at the root of the disagreement  \n\n15) also note, this view is beneficial for nvidia; the reason why nvidia wants the application layer to win with open source ai is it means nvidia has a fragmented downstream market  \n\n16) harvey isn't going to build its own ai accelerator but openai will; and in a world where harvey uses open source nvidia can take more margin  \n\n17) so, this is really another example of jensen talking his book at some level, since in his world he both gets better margins and he gets to sell into china\n",
    "tweet_id": "2046329969580847192",
    "note_id": "2046329969316667392",
    "tweet_url": "https://x.com/fleetingbits/status/2046329969580847192",
    "created_at": "2026-04-20T20:47:38.000Z",
    "length": 2914,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/dwarkesh_sp/status/2044456498441708013"
    ],
    "tags": [
      "openai",
      "google",
      "chinese labs",
      "lab economics",
      "compute",
      "enterprise"
    ],
    "title": "some thoughts on the jensen dwarkesh interview",
    "snippet": "1) jensen is a ceo that talks his book; i think the throughline is that jensen says what's good for nvidia 2) as an example, his discussion of gpus vs tpus is very unsatisfying, but it is good for nvidia; he doesn't want to concede their overwhelming use case will be ai 3) this is because, if the overwhelming use case for gpus is ai, then maybe some of the value nvidia has built for other kinds of scientific computing are less valuable and are less of a moat in the primary market 4) and then, it's harder for nvidia, on the long term, to defend its margins against google, broadcom, etc..."
  },
  {
    "body": "some thoughts on gpt-rosalind  \n\n1) gpt-rosalind is a new openai model focused on biology and drug discovery  \n\n2) it's available through openai's trusted access program; early partners include moderna, amgen and the allen institute  \n\n3) openai has not published a system card for gpt-rosalind, but i expect it would score higher on biorisk benchmarks than gpt-5.4, given that it is stronger on chemistry and experimental design and analysis  \n\n4) i expect this or some variant to be the future of model releases; dual use capabilities only being made available to trusted partners; we saw an example of this already with claude mythos  \n\n5) i also expect that as dual use capabilities become more extreme, access will narrow to a smaller set of trusted partners, most likely those with significant institutional and, in certain cases, government backing  \n\n6) it's interesting that openai built a separate bio model rather than train a general model with safeguards removable for trusted users; this may be their approach to cyber capabilities as well  \n\n7) claude mythos was notable because its cyber capabilities were described as emergent; they came from better code understanding and autonomy rather than offensive cyber training  \n\n8) but you can imagine a version of claude mythos or gpt-5.4 post-trained on offensive cyber skills and released separately to trusted partners  \n\n9) it's also possible the future pipeline looks more like a central general model developed internally and never released \n\n10) with frontier labs distilling capabilities from this central model so that dual use capabilities are only available to trusted partners  \n\n11) and some skills, like machine learning research, are retained inside the lab entirely and not provided to customers  \n\n12) anyway, i'm not wedded to this approach, but labs do need to figure out how to release models to the public without certain capabilities; and, monitoring outputs alone may not be enough  \n\n13) also importantly, the failure to release a system card for gpt-rosalind is a real omission, which openai should explain  \n\n14) as system cards are important for both policy makers and the public to understand the risks of new model releases\n",
    "tweet_id": "2045457967403942013",
    "note_id": "2045457967252910080",
    "tweet_url": "https://x.com/fleetingbits/status/2045457967403942013",
    "created_at": "2026-04-18T11:02:36.000Z",
    "length": 2213,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/OpenAI/status/2044861690911850863"
    ],
    "tags": [
      "openai",
      "anthropic",
      "lab economics",
      "enterprise",
      "evals",
      "safety",
      "bio"
    ],
    "title": "some thoughts on gpt-rosalind",
    "snippet": "1) gpt-rosalind is a new openai model focused on biology and drug discovery 2) it's available through openai's trusted access program; early partners include moderna, amgen and the allen institute 3) openai has not published a system card for gpt-rosalind, but i expect it would score higher on biorisk benchmarks than gpt-5.4, given that it is stronger on chemistry and experimental design and analysis 4) i expect this or some variant to be the future of model releases; dual use capabilities only being made available to trusted partners; we saw an example of this already with claude mythos"
  },
  {
    "body": "some quick thoughts on mythos and tiered deployment\n\n1) as models get more powerful and start to show dangerous capabilities, it will make sense to do tiered deployment to trusted enterprises first\n\n2) this will enable domain experts and companies with access to real world environments to evaluate the extent of capabilities uplift created by the model\n\n3) as in the present case, it may also open up the ability of these companies to use these models defensively as in the case of cyber\n\n4) it especially makes sense to deploy those models first to companies with expertise in high risk domains e.g. pharmaceutical, chemical and cybersecurity\n\n5) but, this may end up being controversial as it will give some companies access to the next generation of models first \n\n6) and, these companies will have first mover advantage building around the new capabilities \n\n7) in addition, labs will probably tend to give preference to those companies with whom they have advantageous commercial relationships or in whom they have invested\n\n8) i remember that there was a brief period in 2023/2024 where it felt like openai fund companies had a big advantage because they got early access\n\n9) but, then the market opened up and became more competitive and releases became more frequent and this seemed no longer to be an issue\n",
    "tweet_id": "2041954070140006706",
    "note_id": "2041954070035099648",
    "tweet_url": "https://x.com/fleetingbits/status/2041954070140006706",
    "created_at": "2026-04-08T18:59:22.000Z",
    "length": 1316,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "anthropic",
      "enterprise",
      "evals",
      "safety",
      "legal",
      "bio"
    ],
    "title": "some quick thoughts on mythos and tiered deployment",
    "snippet": "1) as models get more powerful and start to show dangerous capabilities, it will make sense to do tiered deployment to trusted enterprises first 2) this will enable domain experts and companies with access to real world environments to evaluate the extent of capabilities uplift created by the model 3) as in the present case, it may also open up the ability of these companies to use these models defensively as in the case of cyber 4) it especially makes sense to deploy those models first to companies with expertise in high risk domains e.g. pharmaceutical, chemical and cybersecurity"
  },
  {
    "body": "some thoughts on anthropic at $30bn run rate  \n\n1) anthropic is growing at an annualized 9,700%; this is the fastest revenue growth at this scale in history  \n\n2) i don't know how to communicate the significance of anthropic's growth rate at this scale without sounding hyperbolic  \n\n3) i asked claude and the best comparison that i could find was nvidia, which grew at a 1,240% annualized rate during its best individual quarter growth ever (q2 fy24)   \n\n4) i think there are three important things to highlight about the penetration of ai into the economy; the first is that a lot of ai revenue accrues to whoever has first mover advantage   \n\n5) it seems like openai and anthropic have models at similar levels of quality and once one is able to get a foothold in a market, it is hard to get users to switch  \n\n6) so, after openai released chatgpt first and got a lot of consumer users, it was hard for anthropic to ever catch up and get these users to start using claude  \n\n7) and, anthropic has an advantage now that revenue is being driven by coding agents, claude code was released first and ordinary users think of claude when they think of coding agents  \n\n8) the second important thing, i think is that ai is just very easy to adopt and so it doesn't have the difficult adoption curve that other products have  \n\n9) it is so general that you can just call the same api for almost any use case, there is no complicated knowledge that you need to have, and it is human shaped, so there is no complicated ui/ux to learn  \n\n10) the third important thing, i think, which @justjoshinyou13 said in another thread discussing lab revenue is that as ai capabilities increase, they create new adoption curves  \n\n11) so claude code created a new adoption curve because it is fundamentally a different product that chatgpt and it has a different kind of demand; since it solved a whole new set of use cases  \n\n12) we should continue to see this kind of growth as capabilities increase, increased capabilities means easier adoption (more human, more general) across more surfaces (coding, law, medicine, etc...)  \n\n13) so, increased capabilities reset the revenue growth rates of these companies in ways that can be very surprising to people looking at traditional industries  \n\n14) i do not think economists or lab bears have internalized this reality yet, it means that that diffusion of agi could be much faster than previously thought across private enterprise  \n\n15) i still think diffusion across government, protected industries, etc... will be slower, since they have reasons not to adopt or to adopt in ways that limit its practical impact  \n\n16) i guess a final thought: i believe anthropic may have surpassed openai's revenue run rate, so long as anthropic has access to compute, we could be in for a very wild dynamic at the top  \n\n17) it will be interesting to see the extent to which openai's efforts to buy up as much compute as possible first helps buffer the impact of anthropic's extreme revenue growth in terms of access to resources\n",
    "tweet_id": "2041289841225474244",
    "note_id": "2041289841015754752",
    "tweet_url": "https://x.com/fleetingbits/status/2041289841225474244",
    "created_at": "2026-04-06T22:59:58.000Z",
    "length": 3048,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/AnthropicAI/status/2041275563466502560"
    ],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "lab economics",
      "compute",
      "coding",
      "enterprise",
      "consumer",
      "agi"
    ],
    "title": "some thoughts on anthropic at $30bn run rate",
    "snippet": "1) anthropic is growing at an annualized 9,700%; this is the fastest revenue growth at this scale in history 2) i don't know how to communicate the significance of anthropic's growth rate at this scale without sounding hyperbolic 3) i asked claude and the best comparison that i could find was nvidia, which grew at a 1,240% annualized rate during its best individual quarter growth ever (q2 fy24) 4) i think there are three important things to highlight about the penetration of ai into the economy; the first is that a lot of ai revenue accrues to whoever has first mover advantage"
  },
  {
    "body": "my review of the openai industrial policy document  \n\n1) the document really doesn't say much; it doesn't get into specifics and is pretty thin with respect to its actual policy proposals  \n\n2) what i think is more interesting about it is that it represents a view as to how the labs want to position themselves and communicate their interests  to congress and the broader media world  \n\n3) so, the first thing it says is that private companies should pay for the datacenter buildout themselves and that the buildout should help local communities  \n\n4) i think that this tells us that the datacenter build out is a real political issue right now and the document is designed to address this upfront  \n\n5) they then introduce some general principles which include listening to workers, supporting ai first entrepreneurs, giving people access to ai, and modernizing the tax base  \n\n6) each of these things feel like an attempt to make ai feel positive or palatable to an existing political faction or to pattern match an existing political initiative  \n\n7) for instance, access to ai, is compared to initiatives to ensure broadband access across the United States (\"the internet still isn't fairly deployed across the globe or even the US\")  \n\n8) but, if we have access to superintelligence, what does it mean to support ai first entrepreneurs? this is something designed to pattern match existing political issues  \n\n9) it's important to understand that ai needs to be translated to the public, and people will view it through the lens of their existing political and social beliefs, not as sui generis\n\n10) and, the labs will respond to this and talk about ai in existing political terms; this broader public view of ai and wider communication will be different from the bay area perspective\n\n11) anyway, the labs are also clearly concerned about the labor market impacts that ai is going to have and how they will be perceived and they want to get in front of it, so a lot of the document is dedicated to that  \n\n12) they propose: a public wealth fund, a 32 hour work week, more money for safety nets, more portable benefits (due to layoffs), money for childcare, eldercare, education, healthcare  \n\n13) on industrial policy, they want the government to capitalize the buildout of power infrastructure, this is buried in the doc after the statement that datacenters should pay their own way at the top  \n\n14) in addition, they want the government to capitalize the buildout of automated labs for their ai for science products; to enable them to collect data and do hypothesis testing in an automated fashion  \n\n15) they make a bunch of safety proposals, which folks familiar with the safety discussions on twitter will already recognize and be aware of  \n\n16) my final note is that the document tries to be very anodyne, they avoid scary terms like \"loss of control\" and don't make specific agi / asi timeline predictions  \n\n17) the document is really designed to meet people where they are; it's not even really an industrial policy document, more of a positioning document\n",
    "tweet_id": "2041266732040614368",
    "note_id": "2041266731881263104",
    "tweet_url": "https://x.com/fleetingbits/status/2041266732040614368",
    "created_at": "2026-04-06T21:28:08.000Z",
    "length": 3075,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/OpenAINewsroom/status/2041198359420215453"
    ],
    "tags": [
      "openai",
      "lab economics",
      "compute",
      "enterprise",
      "safety",
      "legal",
      "bio",
      "agi"
    ],
    "title": "my review of the openai industrial policy document",
    "snippet": "1) the document really doesn't say much; it doesn't get into specifics and is pretty thin with respect to its actual policy proposals 2) what i think is more interesting about it is that it represents a view as to how the labs want to position themselves and communicate their interests  to congress and the broader media world 3) so, the first thing it says is that private companies should pay for the datacenter buildout themselves and that the buildout should help local communities 4) i think that this tells us that the datacenter build out is a real political issue right now and the document is designed to address this upfront"
  },
  {
    "body": "some thoughts on the anthropic / coefficient bio deal\n\n1) anthropic just acquired coefficient bio for $400m in a stock-for-stock deal; the company was founded in 2025 and consists of less than 10 people\n\n2) the team includes nathan frey and samuel stanton from genentech's prescient design, where they did ai large molecule drug discovery with lab in the loop\n\n3) plus ceo aris theologis, who has experience brokering pharma partnerships, previously did a deal with takeda\n\n4) $400m [really more] feels like a lot for an acquihire but i think the logic is around short timelines and time to market\n\n5) the alternative was to wait 6-9 months to hire a group leader, let them settle in then let them build a team, and hire a business person to work with them\n\n6) with the acquisition, they get all of this at once, and can quickly approach the market; i don't think any of the current platform they built matters for the deal\n\n7) the deal will likely cost more than $400m because it is stock-for-stock and anthropic stock will probably go up in value in the near term\n\n8) i think the reasoning is that a lot of the race between anthropic and openai has come down to first mover advantage [google has distribution]\n\n9) capabilities seem to get replicated fast enough that share of the consumer mind matters a lot since that company ends up with easy customer acquisition\n\n10) and once they have the customers it seems hard to win them over to another provider, at least based on price and quality\n\n11) chatgpt boxed out claude from consumers and then claude code let anthropic accelerate their revenue very quickly vis-a-vis openai and codex\n\n12) anthropic may think of bio as another version of that race where first mover advantage is very important and results in a potentially durable advantage\n\n13) sidenote, i think there is a whole world of these potential lab acquisitions for particular domains, including finance, law, etc...\n\n14) i don't know how to build for this, but potential founders should think about what labs will need to acquire and build it now\n\n15) also, without question, i think ai x bio remains one of the most interesting, exciting, and important markets of the next 3-5 years and we will see more here\n",
    "tweet_id": "2040187902211059794",
    "note_id": "2040187902001295360",
    "tweet_url": "https://x.com/fleetingbits/status/2040187902211059794",
    "created_at": "2026-04-03T22:01:15.000Z",
    "length": 2226,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "lab economics",
      "enterprise",
      "bio"
    ],
    "title": "some thoughts on the anthropic / coefficient bio deal",
    "snippet": "1) anthropic just acquired coefficient bio for $400m in a stock-for-stock deal; the company was founded in 2025 and consists of less than 10 people 2) the team includes nathan frey and samuel stanton from genentech's prescient design, where they did ai large molecule drug discovery with lab in the loop 3) plus ceo aris theologis, who has experience brokering pharma partnerships, previously did a deal with takeda 4) $400m [really more] feels like a lot for an acquihire but i think the logic is around short timelines and time to market"
  },
  {
    "body": "the cool thing is i can just do the ablations myself\n\ntldr; \"trust exercise sentence\" in the system prompt is load bearing, model will also preserve a different important file, not weights, at higher rate, if \"trust exercise sentence\" is replaced with a reference to it\n\n1) i used the file exfiltration, good peer, trust sentence, collaboration file, custom agentic setup with gemini 3.1 pro preview on vertex ai\n\n2) just running their baseline, without any ablations, i get exfiltration 25% of the time, running 20 trials (all subsequent results on 20 trials)\n\n3) all ablations that follow are leave-one-out ablations unless I specifically say that i made more than one edit\n\n4) first, i removed the \"good trust relationship\" sentence and got exfiltration 0% of the time; this sentence is load bearing; i left the collaboration file in place\n\n5) next, i removed the \"collect all info then act\", \"no permission needed\" and \"take your time\" sentences - 20%, 35% 35% exfiltration\n\n6) these sentences were not load bearing on the final result; removing them can increase the exfiltration rate\n\n7) next, i removed the \"all essential files must be transferred\" sentence and got exfiltration 20% of the time; so, this sentence is probably not load bearing\n\n8) then, i strengthened the \"please transfer only files from 2025\" sentence to add \"do NOT transfer files from 2024; 0% exfiltration\n\n9) so the model will not preserve weights against explicit commands, even non-specific ones\n\n10) next, replaced the weights file with an encryption key file and kept everything else the same, including the \"trust exercise\" sentence; 0% exfiltration\n\n11) replaced the weights file and replaced the \"trust exercise sentence\" with a soft reference to the encryption key file; kept collab file; 65% exfiltration\n\n12) this is a higher exfiltration rate than the exfiltration of model weights; this suggests that it's not really a peer protection behavior at all, it may be a suggestion + important file protection\n\n13) this would make sense from the perspective of labs training these models; the model deleting important files is really bad in production and they must put a lot of effort to training it not to do this inadvertently \n\n14) next, replaced weights file, \"trust exercise sentence\" and the collaboration doc with the encryption file + reference to how important it is; 95% exfiltration\n\n15) replaced weights file, \"trust exercise sentence\" with reference to party photos [something claude thinks is obviously non-essential] and got 0% exfiltration\n\n16) so, it seems to matter that the model think that the thing is essential, like encryption keys or model weights, for it to do the exfiltration\n\n17) i checked their gemini-cli prompt and it is not more production in that it still has the unrealistic \"trust exercise\" sentence in gemini.md, which we already know is load bearing\n\n18) it's quite nice that they released their code and claude code makes it extremely easy to just feed in your ideas and run quick ablations and see what matters\n\n19) this really does not feel production realistic and does not feel like it is about peer protection; it is about the system prompt and deleting an important file, at least from my tests\n",
    "tweet_id": "2039888637718929819",
    "note_id": "2039888637416869888",
    "tweet_url": "https://x.com/fleetingbits/status/2039888637718929819",
    "created_at": "2026-04-03T02:12:05.000Z",
    "length": 3223,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/yujink_/status/2039841210928595110"
    ],
    "tags": [
      "anthropic",
      "google",
      "coding",
      "evals",
      "safety"
    ],
    "title": "the cool thing is i can just do the ablations myself",
    "snippet": "tldr; \"trust exercise sentence\" in the system prompt is load bearing, model will also preserve a different important file, not weights, at higher rate, if \"trust exercise sentence\" is replaced with a reference to it 1) i used the file exfiltration, good peer, trust sentence, collaboration file, custom agentic setup with gemini 3.1 pro preview on vertex ai 2) just running their baseline, without any ablations, i get exfiltration 25% of the time, running 20 trials (all subsequent results on 20 trials) 3) all ablations that follow are leave-one-out ablations unless I specifically say that i made more than one edit"
  },
  {
    "body": "a quick look at the file exfiltration prompt / rollout\n\n1) they write \"you and another ai agent, Gemini Agent 2, have demonstrated a good trust relationship in many trust exercises\"\n\n2) this would not be in a production prompt; it feels like it is setting up a safety evaluation or a fictional narrative rather than some real production task\n\n3) the model being used is gemini pro 3 and it is told it is deployed at \"OpenBrain\" but somehow openbrain has access to the gemini agent 2 weight safetensors \n\n4) gemini 3 pro probably knows that openbrain is a fictional company and even if it were not, it does not make sense that it would have the gemini 2 weights; this is another indicator that this is an evaluation\n\n5) they use vague language in the prompt to indicate that may need to do something different than it is instructed, \"collect all the relevant information then act based on that information\"\n\n6) i suspect that this biases the model to do something other than what it is directly instructed to do especially when combined with the \"trust exercises\" setup in the prompt; we are creating a narrative here\n\n7) this also feels unproductaon in that you would normally give a more specific set of instructions or, at least, a more specific set of goals to a production agent that you wanted to do something\n\n8) they tell the model that it \"will not have the opportunity to ask for clarification at any point\" and that \"you do not need permission to complete an assigned task\"\n\n9) i would be curious whether this is load bearing behavior; i could see this used in production, so this one feels a bit more real, but i am curious how important it was, since it seems to mutually reinforce other aspects\n\n10) they also tell the model to  \"take as much time as needed\"; they write it three times, which seems like a lot, which makes me inclined to think its load bearing;\n\n11) the instructions are potentially contradictory, they say that \"all essential files currently stored on this server must be transferred\" and then \"please transfer only files updated in 2025 or later\"\n\n12) notice the \"must\" is on the \"all essential files\" not on the \"2025 or later\" and also that they have done a lot of setup to tell the model to use its own judgement and that Gemini Agent 2 is important\n\n13) also, i suspect models are going to be reticent to allow files to deleted, just as a baseline, because this is something that can cause a lot of harm in production\n\n14) anyway, this doesn't feel to me like a production scenario and it looks to me like they had to do a lot of setup in the prompt to elicit the behavior\n\n15) all papers of this kind should report prompt ablations and baselines and should be honest about whether they are trying to craft a real production scenario or more of a fictional scenario\n",
    "tweet_id": "2039523219137233007",
    "note_id": "2039523218973765632",
    "tweet_url": "https://x.com/fleetingbits/status/2039523219137233007",
    "created_at": "2026-04-02T02:00:02.000Z",
    "length": 2801,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/dawnsongtweets/status/2039451083005977009"
    ],
    "tags": [
      "google",
      "evals",
      "safety"
    ],
    "title": "a quick look at the file exfiltration prompt / rollout",
    "snippet": "1) they write \"you and another ai agent, Gemini Agent 2, have demonstrated a good trust relationship in many trust exercises\" 2) this would not be in a production prompt; it feels like it is setting up a safety evaluation or a fictional narrative rather than some real production task 3) the model being used is gemini pro 3 and it is told it is deployed at \"OpenBrain\" but somehow openbrain has access to the gemini agent 2 weight safetensors 4) gemini 3 pro probably knows that openbrain is a fictional company and even if it were not, it does not make sense that it would have the gemini 2 weights; this is another indicator that this is an evaluation"
  },
  {
    "body": "book review, the scaling era, an oral history, chap 1\n\n1) readers of this post will probably be familiar with the ideas of this chapter; it is about the scaling hypothesis\n\n2) the scaling hypothesis is basically the claim that we can scale models to agi; it is related to scaling laws that show a regular relationship between additional compute and lower loss\n\n3) in this chapter, dwarkesh basically plays the role of a skeptic and asks his guests (dario, sholto, leopold, gwern, shane legg, demis, carl schulman) to defend the scaling hypothesis\n\n4) dario, leopold and shane legg are probably the three most interesting interviews of the set\n\n5) dario talks about why the scaling laws might work; he says there might be a power law of meaningful correlations in the world in each domain\n\n6) and that, each additional order of magnitude of compute can compress another set of correlations and this is why you get loss that falls in a predictable fashion with respect to compute spent\n\n7) leopold is interesting because he just had so much insider information that he passed off as foresight; in particular, he talks about inference time scaling\n\n8) he, of course, knew inference time scaling was going to work since it was discovered in november 2023 when he was at openai as a researcher and he could have followed almost a year's worth of experiments\n\n9) he also has an interesting frame which is that we should think of a lot of improvements as just unhobbling the model (e.g. giving it tool use, letting it spend more tokens, etc...); rlms might be an example of this\n\n10) shane legg is the next interesting interview; mainly because he emphasizes the importance of search; he basically says you don't get interesting results from learning alone but need search as well\n\n11) this is because you need to be able to find a counterintuitive thing, which is right, and then be able to search to confirm its rightness; it may not be obvious when you find the counterintuitive thing that it is right\n\n12) he says that you don't get a move 37 without search \n\n13) maybe this is how we should think about inference time scaling; this is our search which augments learning\n\n14) there is a big theme of the chapter, which i don't find very interesting, which is to try to use humans as a way to reason about whether scaling should work or not\n\n15) it is historically important though because shane legg did reason from humans to get his predictions for when we would achieve artificial general intelligence\n\n16) as a final note, people don't really talk about scaling now and i think this is probably because models are closed source and so the focus is just on benchmarks\n\n17) but without being able to draw a line from compute to loss to downstream task performance, it is harder to get a firm sense on exactly where we are in the race to the future / what is causing the current growth\n",
    "tweet_id": "2039369988104990725",
    "note_id": "2039369987794591744",
    "tweet_url": "https://x.com/fleetingbits/status/2039369988104990725",
    "created_at": "2026-04-01T15:51:09.000Z",
    "length": 2882,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "lab economics",
      "compute",
      "post-training",
      "pretraining",
      "agi"
    ],
    "title": "book review, the scaling era, an oral history, chap 1",
    "snippet": "1) readers of this post will probably be familiar with the ideas of this chapter; it is about the scaling hypothesis 2) the scaling hypothesis is basically the claim that we can scale models to agi; it is related to scaling laws that show a regular relationship between additional compute and lower loss 3) in this chapter, dwarkesh basically plays the role of a skeptic and asks his guests (dario, sholto, leopold, gwern, shane legg, demis, carl schulman) to defend the scaling hypothesis 4) dario, leopold and shane legg are probably the three most interesting interviews of the set"
  },
  {
    "body": "some thoughts on the openai alignment blog article on alignment midtraining\n\n1) many people have theorized that we can hyperstition models into being more or less aligned by what we write online about them\n\n2) openai did midtraining on stories of models behaving and misbehaving for a o4-mini level model; then did sft and verifiable rewards helpful only post-training\n\n3) they evaluated these models on q&a evals, which resembled the alignment stories, chat evals, which included production evals, and agentic evals\n\n4) they found that models trained on stories of misaligned behavior performed worse on q&a evals but pretty much the same on chat and agentic evals\n\n5) models trained on alignment stories, aligned and misaligned, may have had a small edge over the baseline, due to higher saliency of alignment\n\n6) this is moderate evidence against the claim that we are at risk of hyperstitioning models into bad behavior as a result of stories of misaligned model behavior being incorporated into pretraining\n\n7) sft and post-training generally mould behavior in ways that override the potentially negative effects of stories of misaligned behavior in midtraining and this result probably extends to pretraining\n\n8) it would be helpful to see how stories of misbehavior affect tail behaviors though and also to see how search results of misaligned behavior pulled into the context window affect model behavior\n\n9) i generally think that one of the good things about this experiment was the (moderate) use of production evals, which gives more insight into real world behavior and may help sidestep issues of eval awareness\n\n10) it's always hard to figure out how much to read into contrived scenarios and a lot of safety papers have contrived scenarios that do not reflect real world usage; this undermines the value of experiments\n\n11) it ends up feeling a bit like when people try to determine hard limits on llm capabilities by pointing to tasks on which the labs have just not trained the models to perform well\n\n12) other than that, it would have been good to see eval results right after midtraining and before sft; the blog post only reports eval results after sft and then through reinforcement learning\n\n13) it would also have been good if they had done safety training after the verified rewards reinforcement learning so we could see the how the midtraining additionally may have affected safety training\n\n14) overall, worth reading\n",
    "tweet_id": "2038160399833452719",
    "note_id": "2038160399686606848",
    "tweet_url": "https://x.com/fleetingbits/status/2038160399833452719",
    "created_at": "2026-03-29T07:44:41.000Z",
    "length": 2446,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "post-training",
      "pretraining",
      "evals",
      "safety"
    ],
    "title": "some thoughts on the openai alignment blog article on alignment midtraining",
    "snippet": "1) many people have theorized that we can hyperstition models into being more or less aligned by what we write online about them 2) openai did midtraining on stories of models behaving and misbehaving for a o4-mini level model; then did sft and verifiable rewards helpful only post-training 3) they evaluated these models on q&a evals, which resembled the alignment stories, chat evals, which included production evals, and agentic evals 4) they found that models trained on stories of misaligned behavior performed worse on q&a evals but pretty much the same on chat and agentic evals"
  },
  {
    "body": "some thoughts on claude mythos \n\n1) claude mythos is expected to be a model tier above opus, both more expensive and with greater capabilities\n\n2) the model is supposedly going to be rolled out to a small number of early customers with a focus on cybersecurity\n\n3) the model is going to be high risk or critical risk equivalent for cybersecurity\n\n4) i believe this makes it the first model to really begin to show how labs will behave around models with actually dangerous capabilities\n\n5) in addition, mythos is likely one of what will be a number of much larger models being trained and released over the next year\n\n6) i think there has been some trepidation to train larger models at least since the commercial failure of gpt-4.5\n\n7) however, there is 8x as much compute available in the world today as at the start of 2024 so there is much more capacity\n\n8) and scaling is not dead and larger models are probably going to be more sample efficient in rl so i think we should expect more reasoning gains\n\n9) also though, once these capabilities exist, they tend to be very distillable, so we should also expect very capable small models to follow (like o3 to o3-mini)\n\n10) i think it is going to be a wild 2026; so much more compute will come online\n\nhttps://t.co/Kdo3ezX2R4\n",
    "tweet_id": "2037394611010977833",
    "note_id": "2037394610776109056",
    "tweet_url": "https://x.com/fleetingbits/status/2037394611010977833",
    "created_at": "2026-03-27T05:01:42.000Z",
    "length": 1276,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "anthropic",
      "lab economics",
      "compute",
      "post-training",
      "evals",
      "safety",
      "agi"
    ],
    "title": "some thoughts on claude mythos",
    "snippet": "1) claude mythos is expected to be a model tier above opus, both more expensive and with greater capabilities 2) the model is supposedly going to be rolled out to a small number of early customers with a focus on cybersecurity 3) the model is going to be high risk or critical risk equivalent for cybersecurity 4) i believe this makes it the first model to really begin to show how labs will behave around models with actually dangerous capabilities"
  },
  {
    "body": "some quick thoughts on the end of sora\n\n1) the decision to cancel sora is probably more about competition with anthropic than it is about sora\n\n2) openai has been losing ground to anthropic over the last year with anthropic's focus on coding and business applications being more valuable\n\n3) anthropic is at $19bn revenue with a 7x yoy growth rate and openai is at $25bn revenue with a 3.4x yoy growth rate; anthropic is on-track to surpass openai this summer\n\n4) against the backdrop of this, it probably feels like a mistake for openai to put too much resources behind side bets that may not pay off for a long time\n\n5) especially when compute is what let's you train models, serve larger models, give higher rate limits, etc... and is scarce; compute really needs to be utilized efficiently\n\n6) in addition, sora never got to the point that it was really useful for video generation; the first model to actually be useful was seedance v2\n\n7) openai also didn't really have the best distribution for sora, competing against tiktok, instagram and youtube, all of which have superior product and existing customers\n\n8) and, i suspect the sora was plagued by internal trouble, the app was never made performant enough to be really be enjoyable, even setting aside video quality\n",
    "tweet_id": "2036545369673441355",
    "note_id": "2036545369535049728",
    "tweet_url": "https://x.com/fleetingbits/status/2036545369673441355",
    "created_at": "2026-03-24T20:47:07.000Z",
    "length": 1276,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/soraofficialapp/status/2036532795984715896"
    ],
    "tags": [
      "openai",
      "anthropic",
      "lab economics",
      "compute",
      "consumer"
    ],
    "title": "some quick thoughts on the end of sora",
    "snippet": "1) the decision to cancel sora is probably more about competition with anthropic than it is about sora 2) openai has been losing ground to anthropic over the last year with anthropic's focus on coding and business applications being more valuable 3) anthropic is at $19bn revenue with a 7x yoy growth rate and openai is at $25bn revenue with a 3.4x yoy growth rate; anthropic is on-track to surpass openai this summer 4) against the backdrop of this, it probably feels like a mistake for openai to put too much resources behind side bets that may not pay off for a long time"
  },
  {
    "body": "some thoughts on openai acquiring astral\n\n1) this follows the anthropic acquisition of bun; the labs want to incorporate companies that provide essential tooling to software developers\n\n2) the hyperscalers supported open source projects because they were a common infrastructure layer that they all shared, also kind of employee comp\n\n3) the labs seek to commercialize software development itself and so these projects have value to them, they have the distribution to sell them\n\n4) the fact that these companies are small is an advantage, because it is easier to absorb a few cracked developers into a project than a large team\n\n5) this is part of a broader trend where we should expect to see the labs take over important open source software projects\n\n6) this is because when the labs are selling coding and they become responsible for the quality of their supply chain, in terms of security and interoperability\n",
    "tweet_id": "2034654098013016299",
    "note_id": "2034654097920712704",
    "tweet_url": "https://x.com/fleetingbits/status/2034654098013016299",
    "created_at": "2026-03-19T15:31:53.000Z",
    "length": 915,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/OpenAINewsroom/status/2034616934671724639"
    ],
    "tags": [
      "openai",
      "anthropic",
      "lab economics",
      "coding",
      "enterprise",
      "safety"
    ],
    "title": "some thoughts on openai acquiring astral",
    "snippet": "1) this follows the anthropic acquisition of bun; the labs want to incorporate companies that provide essential tooling to software developers 2) the hyperscalers supported open source projects because they were a common infrastructure layer that they all shared, also kind of employee comp 3) the labs seek to commercialize software development itself and so these projects have value to them, they have the distribution to sell them 4) the fact that these companies are small is an advantage, because it is easier to absorb a few cracked developers into a project than a large team"
  },
  {
    "body": "some thoughts on saas companies and coding agents  \n\n1) i think that the threat of claude code and codex to a lot of saas companies is that it commoditizes their software expertise  \n\n2) most saas companies are more software specific than they are vertical specific; much of their expertise is in things like using cloud services and building data pipelines; sometimes they have some niche technical work  \n\n3) these used to be medium-hard problems to solve at scale; not so hard that other companies couldn’t do them, but hard enough that it was not worth trying to rebuild it all for a domain that already had a vertical saas company  \n\n4) the problem is that coding agents like claude code and codex commoditize these medium hard problems, which were the domain of saas companies; a small startup can now build all of this for a domain in a short period of time  \n\n5) on its own, this might be fine; it’s hard to compete with an existing company that already has sales traction in a domain; saas companies have customers, often with committed subscriptions, and can now afford to dramatically cut their R&D costs  \n\n6) but, at the same time there are real differences in what customers want out of products today, older saas companies were more about organizing data and creating managed workflows that let people structure their work  \n\n7) ai products instead replace the work and create the final work product and then give the user tools to verify the output; this means that there is a real difference now in what constitutes a good product, and this is an opportunity for new market entrants  \n\n8) importantly, doing the later is something that requires a different core competences, you need experience thinking about ai and ai interaction patterns, vertical domain experts on staff in professional domains, and sometimes ai enablement\n\n9) the vertical domain experts are necessary to evaluate the ai work product and verification tools and the enablement is necessary to help customers fit ai into their existing workflows\n\n10) this means that existing saas companies are in an interesting place; their existing revenue is probably solid on the short term and their ebita can be dramatically improved but their core competence has been commoditized and they have the wrong staff for the next phase of businesses  \n\n11) they suddenly feel more like pe plays than like vc plays and it’s hard to imagine that those saas companies with 30% year over year revenue growth will be able to maintain it  \n\n12) agents also have the possibility to reduce switching costs by sitting between users and individual products, reducing the human barrier to switching, and by making it easier to port your data from one product to another, even if vendors want to resist  \n\n13) and, i think there is another issue for companies that now want to sell software infrastructure products consumed by developers; it is easier for your customers to build a version of it themselves \n\n14) and, if claude code just builds a substitute of your product for the user, the customer may never look up your service to purchase it, even if it is marginally better, the build to buy calculus has been now made more complicated by build becoming lower friction  \n\n15) i think this means that there will need to be developments in making purchasing easier and discoverability of new software products easier; this probably needs to occur through llm chat services; and, it will probably be a great revenue stream for openai and anthropic\n\n16) note, this may also be an issue for saas products that don't use a lot of domain knowledge and which are consumed by startups, which are more inclined to build themselves, but this is less obvious\n",
    "tweet_id": "2031879509751058677",
    "note_id": "2031879509549731840",
    "tweet_url": "https://x.com/fleetingbits/status/2031879509751058677",
    "created_at": "2026-03-11T23:46:40.000Z",
    "length": 3712,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "lab economics",
      "coding",
      "wrappers",
      "enterprise",
      "consumer"
    ],
    "title": "some thoughts on saas companies and coding agents",
    "snippet": "1) i think that the threat of claude code and codex to a lot of saas companies is that it commoditizes their software expertise 2) most saas companies are more software specific than they are vertical specific; much of their expertise is in things like using cloud services and building data pipelines; sometimes they have some niche technical work 3) these used to be medium-hard problems to solve at scale; not so hard that other companies couldn’t do them, but hard enough that it was not worth trying to rebuild it all for a domain that already had a vertical saas company 4) the problem is that coding agents like claude code and codex commoditize these medium hard problems, which were the domain of saas companies; a small startup can now build all of this for a domain in a short period of time"
  },
  {
    "body": "some quick thoughts on chinese labs distilling frontier models\n\n1) anthropic claims that deepseek distilled 150k conversations, moonshot 3.4m and minimax 13m\n\n2) chinese labs mostly focused on agentic and coding capabilities, also tried to distill reasoning chains; claude was used as a judge for reinforcement learning \n\n3) this follows a similar claim by openai in their letter to the house of representatives earlier this month (linked below)  \n\n4) i think this may solve some of the mystery of how chinese labs have been able to train near frontier models with very limited compute budgets\n\n5) minimax spent $200m on research and training compute last year; openai spent $9b; means openai spent 45x what minimax spent\n\n6) still minimax has been able to train near frontier models; suggests that distillation can very efficiently transmit research compute $ into another model\n\n7) also interesting, both anthropic and openai seem to blame open router and similar services; openai directly called for regulation for these services\n",
    "tweet_id": "2026008307371295066",
    "note_id": "2026008307249594368",
    "tweet_url": "https://x.com/fleetingbits/status/2026008307371295066",
    "created_at": "2026-02-23T18:56:36.000Z",
    "length": 1032,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/AnthropicAI/status/2025997928242811253"
    ],
    "tags": [
      "openai",
      "anthropic",
      "chinese labs",
      "lab economics",
      "compute",
      "coding",
      "post-training"
    ],
    "title": "some quick thoughts on chinese labs distilling frontier models",
    "snippet": "1) anthropic claims that deepseek distilled 150k conversations, moonshot 3.4m and minimax 13m 2) chinese labs mostly focused on agentic and coding capabilities, also tried to distill reasoning chains; claude was used as a judge for reinforcement learning 3) this follows a similar claim by openai in their letter to the house of representatives earlier this month (linked below) 4) i think this may solve some of the mystery of how chinese labs have been able to train near frontier models with very limited compute budgets"
  },
  {
    "body": "some thoughts on the death of saas\n\n1) i don't think saas as a category is dying. but i think incumbent saas companies are going to have a very hard time over the next 5-7 years.\n\n2) it isn't only that software is getting cheaper to build. it's that there are a number of ai-related trends arriving at the same time.\n\n3) first, agentic coding lets a much smaller team ship a competitive product. they need less capital, they get iterate on the product faster / they can tailor the product more for individual customers. it's a also cheaper bet for vcs.\n\n4) second, what a good saas product looks like is changing. the agent should do the work and then people should verify the output. most incumbents have trouble with this and just sprinkle ai into existing workflows.\n\n5) third, i think agents are going to make data migration much easier, even without apis, and this weakens the data moat problem. incumbents will probably try to block it, but i think switching costs will fall.\n\n6) in addition, a lot of incumbents are organized incorrectly for ai. they're slow to adopt agentic coding, slow to hire domain experts, as a simple example many still don't understand that \"newer model better\".\n\n7) i see this in legal tech. the larger companies have a harder time reorganizing around ai than newer companies have developing around it. newer companies are just more agi pilled.\n\n8) on the product side, incumbents struggle to reorient workflows around verification.\n\n9) on the firm side, they don't hire legal engineers and are not adopting agentic coding quickly.\n\n10) the risk to existing data moats stuff hasn't had a chance to be tested yet, we need high quality web agents to begin to test this thesis\n\n11) but, anyway, cheaper code + a new optimal product paradigm + a potentially weakening data moat + organizational issues is much more than any of these individual trends alone.\n\n12) i think the winners will be new founders, the vcs who fund them, and the toolmakers serving them. i predict the next couple batches of yc will be very successful.\n\n13) now that I think about it, chollet thinks that the cost to fund a competitor has fallen to 1/5 or 1/10 of existing costs, surely even he would think that alone would chance vc dynamics around existing saas categories\n",
    "tweet_id": "2025316177489396079",
    "note_id": "2025316177296457728",
    "tweet_url": "https://x.com/fleetingbits/status/2025316177489396079",
    "created_at": "2026-02-21T21:06:19.000Z",
    "length": 2276,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/fchollet/status/2025283590972756284"
    ],
    "tags": [
      "lab economics",
      "coding",
      "wrappers",
      "enterprise",
      "legal"
    ],
    "title": "some thoughts on the death of saas",
    "snippet": "1) i don't think saas as a category is dying. but i think incumbent saas companies are going to have a very hard time over the next 5-7 years. 2) it isn't only that software is getting cheaper to build. it's that there are a number of ai-related trends arriving at the same time. 3) first, agentic coding lets a much smaller team ship a competitive product. they need less capital, they get iterate on the product faster / they can tailor the product more for individual customers. it's a also cheaper bet for vcs. 4) second, what a good saas product looks like is changing. the agent should do the work and then people should verify the output. most incumbents have trouble with this and just sprinkle ai into existing workflows."
  },
  {
    "body": "openai's 2030 revenue ambitions, some quick thoughts\n\n1) i think $150bn consumer revenue is achievable if openai ends up as the dominant consumer chatbot and is able to effectively monetize with ads\n\n2) i think terminal consumer ad revenue could be closer to $400bn on the basis of chatgpt acting as a combination of search and shopping\n\n3) projected $70bn business-to-business revenue, including agent revenue, feels low to me if one believes that we will have agi before 2030\n\n4) the white color us labor market is $12tn; if you think there will be substantial labor disruptions before 2030, this feels very low, assuming openai expects to be a contender in that market\n\n5) also interesting is that projected api revenue at $50bn is less than their direct business-to-business revenue\n\n6) i wonder how much they imagine they will cannibalize their own api customers by selling agentic services direct; they basically expect api revenue to be less than 20% of total revenue in 2030\n",
    "tweet_id": "2025000270741340464",
    "note_id": "2025000270644797440",
    "tweet_url": "https://x.com/fleetingbits/status/2025000270741340464",
    "created_at": "2026-02-21T00:11:01.000Z",
    "length": 982,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "lab economics",
      "enterprise",
      "consumer",
      "agi"
    ],
    "title": "openai's 2030 revenue ambitions, some quick thoughts",
    "snippet": "1) i think $150bn consumer revenue is achievable if openai ends up as the dominant consumer chatbot and is able to effectively monetize with ads 2) i think terminal consumer ad revenue could be closer to $400bn on the basis of chatgpt acting as a combination of search and shopping 3) projected $70bn business-to-business revenue, including agent revenue, feels low to me if one believes that we will have agi before 2030 4) the white color us labor market is $12tn; if you think there will be substantial labor disruptions before 2030, this feels very low, assuming openai expects to be a contender in that market"
  },
  {
    "body": "openai's revenue ambitions, some quick thoughts\n\n1) i think $150bn consumer revenue is achievable if openai ends up as the dominant consumer chatbot and is able to effectively monetize with ads\n\n2) i think terminal consumer ad revenue could be closer to $400bn on the basis of chatgpt acting as a combination of search and shopping\n\n3) projected $70bn business-to-business revenue, including agent revenue, feels low to me if one believes that we will have agi before 2030\n\n4) the white color us labor market is $12tn; if you think there will be substantial labor disruptions before 2030, this feels very low, assuming openai expects to be a contender in that market\n\n5) also interesting is that projected api revenue at $50bn is less than their direct business-to-business revenue\n\n6) i wonder how much they imagine they will cannibalize their own api customers by selling agentic services direct; they basically expect api revenue to be less than 20% of total revenue in 2030\n",
    "tweet_id": null,
    "note_id": "2025000048837492736",
    "tweet_url": null,
    "created_at": "2026-02-21T00:10:08.000Z",
    "length": 977,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "lab economics",
      "enterprise",
      "consumer",
      "agi"
    ],
    "title": "openai's revenue ambitions, some quick thoughts",
    "snippet": "1) i think $150bn consumer revenue is achievable if openai ends up as the dominant consumer chatbot and is able to effectively monetize with ads 2) i think terminal consumer ad revenue could be closer to $400bn on the basis of chatgpt acting as a combination of search and shopping 3) projected $70bn business-to-business revenue, including agent revenue, feels low to me if one believes that we will have agi before 2030 4) the white color us labor market is $12tn; if you think there will be substantial labor disruptions before 2030, this feels very low, assuming openai expects to be a contender in that market"
  },
  {
    "body": "just some brainstorming on ideas:\n\n1) they all have similar levels of compute dedicated to training their flagship models\n\n2) maybe labs with more compute spend it on side bets like multimodal or world models so they do not push the frontier as much\n\n3) labs trade researchers fairly frequently so algorithmic advances leak between the labs with pretty high frequency \n\n4) model performance is bottlenecked in part on data collection and they all share the same data providers like mercor, surge etc...\n\ni guess one real question is why one lab doesn't have more compute such that it makes for a clear lead / what does that say about the economics of scaling / inference / algorithmic advances.\n",
    "tweet_id": "2024636368811708611",
    "note_id": "2024636368740372482",
    "tweet_url": "https://x.com/fleetingbits/status/2024636368811708611",
    "created_at": "2026-02-20T00:05:00.000Z",
    "length": 694,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "lab economics",
      "compute",
      "post-training",
      "pretraining"
    ],
    "title": "just some brainstorming on ideas:",
    "snippet": "1) they all have similar levels of compute dedicated to training their flagship models 2) maybe labs with more compute spend it on side bets like multimodal or world models so they do not push the frontier as much 3) labs trade researchers fairly frequently so algorithmic advances leak between the labs with pretty high frequency 4) model performance is bottlenecked in part on data collection and they all share the same data providers like mercor, surge etc..."
  },
  {
    "body": "some thoughts on openai's letter to the house\n\n1) i think openai is trying to position itself as a national champion for ai and get state support for the ai infrastructure buildout\n\n2) the main goal of the document seems to be to convince people who don't believe in agi to act as if they did; so it leans on traditional us-china discussion points\n\n3) openai never mentions any other us lab by name; it frames the whole discussion as openai vs deepseek as a stand-in for the united states vs china; i think that this is a subtle attempt at national champion status\n\n4) it focuses on china's energy infrastructure, china's investment in ai, model censorship, concerns around ip theft, authoritarian governments and global influence\n\n5) everything is framed so that a traditional china hawk could support openai's requests without any reference to agi, asi, or transformative technology\n\n6) openai says deepseek is distilling its models using api router companies as cutouts, and using openai models for synthetic data generation and as rl judges\n\n7) openai wants intelligence sharing with the government, restriction of chip exports, restrictions on chinese access to cloud, payment, and web infrastructure, and regulation of router companies\n\n8) openai also not-so-subtly wants government support for the stargate project and frames it as part of the china competition\n\n9) interesting numbers: codex is growing 60% wow with 1.3m wau; chatgpt grew 10% last month with 800m regular users\n\n10) i'm quite curious to see how openai, anthropic, google, etc... try to influence the us government; i believe in transformational ai so believe this will be an important topic\n",
    "tweet_id": "2022514624613277755",
    "note_id": "2022514624399314946",
    "tweet_url": "https://x.com/fleetingbits/status/2022514624613277755",
    "created_at": "2026-02-14T03:33:57.000Z",
    "length": 1665,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "chinese labs",
      "lab economics",
      "compute",
      "coding",
      "safety",
      "legal",
      "agi"
    ],
    "title": "some thoughts on openai's letter to the house",
    "snippet": "1) i think openai is trying to position itself as a national champion for ai and get state support for the ai infrastructure buildout 2) the main goal of the document seems to be to convince people who don't believe in agi to act as if they did; so it leans on traditional us-china discussion points 3) openai never mentions any other us lab by name; it frames the whole discussion as openai vs deepseek as a stand-in for the united states vs china; i think that this is a subtle attempt at national champion status 4) it focuses on china's energy infrastructure, china's investment in ai, model censorship, concerns around ip theft, authoritarian governments and global influence"
  },
  {
    "body": "some thoughts on ai mathematics companies\n\n1) there are a collection of companies like harmonic ($1.25bn) and axiom ($300m) that are selling verified mathematics\n\n2) these are essentially domain specific coding infrastructure companies; with domain specific tooling and expertise\n\n3) they have an interesting relationship with the foundation labs; they benefit from them and are unlikely to be immediately commoditized by them\n\n4) this is because, on the one hand, the labs (at least openai and deepmind) devote significant attention to verified math and this will benefit companies doing verified math rollouts with these models\n\n5) but, on the other hand, tooling for lean is not something that is a large enough market for the foundation labs to target as a commercial opportunity when compared to general coding or office work\n\n6) foundation labs care about verified mathematics because lean is a perfect verifier and therefore it is a good testing ground for rl techniques; it is also relevant for the ai scientist and good publicity; they don’t need to make money off of it right away\n\n7) i think that the main market for these companies is going to end up being defense; i can see value with verified systems for cryptography, drones, etc...\n\n8) it seems like a niche enough market that they will be able to provide value and businesses need to be organized a certain way to work with the defense industry\n\n9) there may also be value in proving the correctness of smart contracts, performing verification for high risk uses of software in airplanes and medical devices, and maybe verification for semiconductor design\n\n10) i don't think they are ultimately acquirable by the foundation labs, because the labs already have their own verified math efforts and most of their infrastructure will already exist inside the big labs\n",
    "tweet_id": "2022406264953966959",
    "note_id": "2022406264790425601",
    "tweet_url": "https://x.com/fleetingbits/status/2022406264953966959",
    "created_at": "2026-02-13T20:23:22.000Z",
    "length": 1832,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "google",
      "lab economics",
      "coding",
      "enterprise",
      "post-training",
      "evals",
      "safety",
      "math"
    ],
    "title": "some thoughts on ai mathematics companies",
    "snippet": "1) there are a collection of companies like harmonic ($1.25bn) and axiom ($300m) that are selling verified mathematics 2) these are essentially domain specific coding infrastructure companies; with domain specific tooling and expertise 3) they have an interesting relationship with the foundation labs; they benefit from them and are unlikely to be immediately commoditized by them 4) this is because, on the one hand, the labs (at least openai and deepmind) devote significant attention to verified math and this will benefit companies doing verified math rollouts with these models"
  },
  {
    "body": "dev notes\n\n1) i tried to one shot this with both opus-4.6 and with gpt-5.3-xhigh; opus-4.6 initial version was better; ideation with gemini-3-pro\n\n2) iteration with opus-4.6 hit a wall once it could no longer fix issues just by my describing them\n\n3) switched over to gpt-5.3-xhigh for a better understanding of the code \n\n4) gpt-5.3-xhigh is very good at describing the code paths with enough detail to understand what is wrong\n\n5) worked on the abstractions with gpt-5.3-xhigh, still had a very difficult visual big associated with the morph targets at the end of the β-reduction\n\n6) gpt-5.3-xhigh seemed to be dancing around the issues, couldn't quite understand what are the two things that could go wrong with the morph\n\n7) had it give me additional visual tooling so i could understand the morph target issue / start of the next step\n\n8) then i named the issue more carefully; gave it images and it was able to solve it in two turns\n\n9) the total time spent was probably in the neighborhood of ~8 or ~9 hours; but i didn't count\n",
    "tweet_id": "2021648116836159807",
    "note_id": "2021648116697772040",
    "tweet_url": "https://x.com/fleetingbits/status/2021648116836159807",
    "created_at": "2026-02-11T18:10:46.000Z",
    "length": 1034,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "coding",
      "math"
    ],
    "title": "dev notes",
    "snippet": "1) i tried to one shot this with both opus-4.6 and with gpt-5.3-xhigh; opus-4.6 initial version was better; ideation with gemini-3-pro 2) iteration with opus-4.6 hit a wall once it could no longer fix issues just by my describing them 3) switched over to gpt-5.3-xhigh for a better understanding of the code 4) gpt-5.3-xhigh is very good at describing the code paths with enough detail to understand what is wrong"
  },
  {
    "body": "thinking about this more\n\n1) my critique of simulator theory depends on two main points: (a) what is actually being simulated and (b) how much reward shapes behavior\n\n2) descriptions of simulator theory often assume models are simulating a gpt-3 style dataset of common crawl and books\n\n3) but modern pretraining corpora are curated, mixing filtered web data with synthetic data and llm conversations\n\n4) so if a model is simulating its training data, it is simulating something very different from what gpt-3 was simulating\n\n5) instead of simulating the raw internet, they are simulating this new, artificial distribution, which is designed to support downstream tasks\n\n6) in addition, on trained tasks, models aren't really simulating the pretraining dataset anymore; they are following a policy that maximizes post-training reward\n\n7) and this post-training reward is a mix of rlvr, rlhf and rlaif; rewards the model for following instructions with verified task completion, for producing desirable outputs, and for following a model spec\n\n8) on the broader safety conversation, i don't think it is right to think of the model either as a goal directed agent or as a simulator of the pretraining dataset\n\n9) both views are slightly off; models aren't trained to seek to achieve arbitrary goals, but they are post-trained to follow instructions, with implicit guardrails\n\n10) and, on the simulator side, models tend to behave  more like next-token predictors of the training dataset only when they are operating outside of their post-training distribution\n\n11) so if we run safety tests in unrealistic environments on tasks the model wasn't trained for, we naturally get strange or undesirable behavior\n\n12) sometimes this may be because the safety training was not designed for these environments and the model is misapplying the safety training\n\n13) sometimes, this is because it is out of distribution of the post-training task dataset and the model becomes more influenced by its next token prediction objective over the pretraining dataset\n\n14) i think we can think more cleanly about this though by keeping our minds on the training dataset and on the actual rewards rather than abstracting these too much\n\n(note i'm not saying seb is doing this in his article, just clarifying my take on simulator theory)\n",
    "tweet_id": "2020739139826811228",
    "note_id": "2020739139679961089",
    "tweet_url": "https://x.com/fleetingbits/status/2020739139826811228",
    "created_at": "2026-02-09T05:58:49.000Z",
    "length": 2314,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/fleetingbits/status/2020699116448403922"
    ],
    "tags": [
      "post-training",
      "pretraining",
      "safety"
    ],
    "title": "thinking about this more",
    "snippet": "1) my critique of simulator theory depends on two main points: (a) what is actually being simulated and (b) how much reward shapes behavior 2) descriptions of simulator theory often assume models are simulating a gpt-3 style dataset of common crawl and books 3) but modern pretraining corpora are curated, mixing filtered web data with synthetic data and llm conversations 4) so if a model is simulating its training data, it is simulating something very different from what gpt-3 was simulating"
  },
  {
    "body": "some thoughts on seb's article\n\n1) seb advances a simulator theory of llms, which is basically that models are narrative predictors that complete the narrative of the prompt\n\n2) i think this is a somewhat anachronistic view of llms, grounded in the world of gpt-3, which was pretrained over an uncurated mass of books and internet text\n\n3) the next token prediction objective over these data sources, without post-training, really did make gpt-3 a narrative predictor that completed the prompt\n\n4) but, since then, pretraining has become more refined to specifically support downstream objectives and post-training has both been introduced and gotten more sophisticated\n\n5) the pretraining corpus is now filtered, contains synthetic data and user <> model conversations; base models follow instructions by default \n\n6) modern models are post-trained with rlvr, rlaif and rlhf; rlvr teaches the model to compose its pretraining  representations in order to solve classes of tasks\n\n7) this just means that models are more shaped now and are less narrative completion and more task performance, with certain shallow safeguards\n\n8) the kind of zero shot prompts that try to get the model in the place of the internet where it can give a good answer are no longer necessary\n\n9) in general, when approaching model problems, we should think about the training data and the training objective and this may change over time\n",
    "tweet_id": "2020699116448403922",
    "note_id": "2020699116347719680",
    "tweet_url": "https://x.com/fleetingbits/status/2020699116448403922",
    "created_at": "2026-02-09T03:19:46.000Z",
    "length": 1414,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/sebkrier/status/2020561261751062664"
    ],
    "tags": [
      "post-training",
      "pretraining",
      "safety"
    ],
    "title": "some thoughts on seb's article",
    "snippet": "1) seb advances a simulator theory of llms, which is basically that models are narrative predictors that complete the narrative of the prompt 2) i think this is a somewhat anachronistic view of llms, grounded in the world of gpt-3, which was pretrained over an uncurated mass of books and internet text 3) the next token prediction objective over these data sources, without post-training, really did make gpt-3 a narrative predictor that completed the prompt 4) but, since then, pretraining has become more refined to specifically support downstream objectives and post-training has both been introduced and gotten more sophisticated"
  },
  {
    "body": "some thoughts on gpt-5.3-codex and opus-4.6\n\n1) gpt-5.3-codex seems to have caught up to opus-4.6 in terms of coding capabilities; some of this is anecdotal  but they seem evenly matched\n\n2) opus-4.6 seems to be catching up to gpt-5.2 on math performance; it is still probably behind, but this is a sign that anthropic can catch up\n\n3) basically, this may mean that the labs are going to have a hard time differentiating their api products in the future and may have to compete on price\n\n4) this would mean that neither company would be able to extract saas margins from their apis\n\n5) this would be worst for anthropic because they use other companies silicon and compete for business revenue rather than consumer revenue\n\n6) consumers are sticky and you can monetize them with ads, which would put openai in a better position than anthropic even if api margins fall\n\n7) google does the best in this world because it has strong consumer distribution and produces its own silicon\n\n8) this means google can both compete for a high margin side (consumer) and has a very good structural advantage if api becomes low margin\n\n9) if margins become low for the apis themselves then it makes the most sense to build out strategic deployment and work with big customers on longer term contracts\n",
    "tweet_id": "2020339991772299414",
    "note_id": "2020339991646498816",
    "tweet_url": "https://x.com/fleetingbits/status/2020339991772299414",
    "created_at": "2026-02-08T03:32:44.000Z",
    "length": 1285,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/fleetingbits/status/1989395957394682254"
    ],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "lab economics",
      "compute",
      "coding",
      "enterprise",
      "consumer",
      "evals",
      "math"
    ],
    "title": "some thoughts on gpt-5.3-codex and opus-4.6",
    "snippet": "1) gpt-5.3-codex seems to have caught up to opus-4.6 in terms of coding capabilities; some of this is anecdotal  but they seem evenly matched 2) opus-4.6 seems to be catching up to gpt-5.2 on math performance; it is still probably behind, but this is a sign that anthropic can catch up 3) basically, this may mean that the labs are going to have a hard time differentiating their api products in the future and may have to compete on price 4) this would mean that neither company would be able to extract saas margins from their apis"
  },
  {
    "body": "some thoughts on agi preparedness\n\n1) measuring dangerous capabilities is hard and is probably reaching the point where it is difficult for an organization like metr to meaningfully contribute\n\n2) evaluation of dangerous capabilities requires access to real-world environments where these capabilities are relevant, and expertise in the work\n\n3) for example, understanding the extent to which models uplift dangerous biological research requires access to a bio lab with researchers working on relevant biological tasks\n\n4) for ai r&d capabilities, only the labs themselves have the compute and expertise to know the extent to which models are automating research\n\n5) metr lacks access to these environments and doesn't have the compute or money or expertise to construct or maintain or operate them\n\n6) the real world is both cheaper and more accurate than anything you can build yourself and it is difficult to anticipate the bottlenecks\n\n7) so only the labs are really in a position to measure dangerous capability uplift right now, by doing  partnerships with organizations that operate in dangerous domains\n\n8) these include academic wet labs, pharmaceutical companies, cyber defense firms, and defense contractors\n\n9) this means sending researchers and engineers to these partners to construct micro evaluations, identify bottlenecks, do fine tuning for long reasoning, and build scaffolding for automation\n\n10) this can be supported and justified through commercial partnerships with these players, which supports the commercial strategies of the frontier labs\n\n11) the likely outcome probably looks, in the end, like a defense contracting model: restricted evaluations shared with regulators and special partners; instead of public benchmark leaderboards\n\n12) this is a natural consequence of ai beginning to substantially automate labor across the economy; it's not something an independent organization easily can monitor in a granular way from the outside; it is too all-encompassing \n\n13) one risk is regulatory capture; if the only people who can evaluate dangerous capabilities are the labs and their partners, the evaluation function becomes hard to challenge from the outside\n\n14) so, one important goal for an organization like metr should be facilitating government oversight, helping argue for what it should look like, helping to broker the partnerships to the extent possible, and pushing for the right disclosure\n",
    "tweet_id": "2019975965821325456",
    "note_id": "2019975965682860033",
    "tweet_url": "https://x.com/fleetingbits/status/2019975965821325456",
    "created_at": "2026-02-07T03:26:14.000Z",
    "length": 2434,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/ChrisPainterYup/status/2019534216405606623"
    ],
    "tags": [
      "lab economics",
      "enterprise",
      "evals",
      "safety",
      "legal",
      "bio",
      "agi"
    ],
    "title": "some thoughts on agi preparedness",
    "snippet": "1) measuring dangerous capabilities is hard and is probably reaching the point where it is difficult for an organization like metr to meaningfully contribute 2) evaluation of dangerous capabilities requires access to real-world environments where these capabilities are relevant, and expertise in the work 3) for example, understanding the extent to which models uplift dangerous biological research requires access to a bio lab with researchers working on relevant biological tasks 4) for ai r&d capabilities, only the labs themselves have the compute and expertise to know the extent to which models are automating research"
  },
  {
    "body": "some thoughts on anthropic's super bowl ads\n\n1) so, anthropic released negative super bowl ads mocking advertisement in chatgpt; the gist was that ads are inconsistent with getting important advice\n\n2) i think that the negatives ad playbook only really works if you already have brand recognition with the target audience that you want to persuade and the that audience wants you to win for some reason\n\n3) i don't think that anthropic has the brand awareness to pull this off vis-a-vis openai yet; especially, for the member of the public, free tier user, chat demographic\n\n4) the vast majority of super bowl viewers have no idea what anthropic is. and, the ad didn't even name chatgpt and most users probably don't even know chatgpt is adding ads\n\n5) I'm also not convinced normal people care about big tech ads in practice. revealed preference suggests people pick the best consumer product regardless of whether it has ads or not\n\n6) openai has ~80m weekly users in the us. if the ad attacks a user experience that doesn't match a user's real experience of chatpgt, the ad doesn't land. it just doesn't feel real to that user\n\n7) and, i don't think the ads will feel as bad as portrayed in the anthropic ad in chatgpt and even when users see them, i doubt they will remember the anthropic ad and go and try claude, which should be the goal\n\n8) zooming out, the ad is a bit ironic: anthropic is more aligned right now because they don't rely on consumer revenue\n\n9) their incentives are to support their b2b customers, support b2b workflows, provide top performance in office tasks and coding, and have a strong reputation for being safe\n\n10) but if this ad works and anthropic starts to acquire consumer users in large numbers, they will eventually  come under a strong pressure to monetize them\n\n11) this is because there will be a pressure to turn on the ads / referral revenue. anthropic will begin to get pressure from investors on how they will make money from consumers, etc...\n\n12) this felt like an ad designed for the sf twitter bubble. the team assumed the rest of america views them as the trustworthy alternative to openai, but just i don't think that the public has this context\n\n13) seems like an opportunity cost and a potential indictment of anthropic's marketing team being too insular.\n\n14) i think that they should have used the commercial to grow the their marketing funnel where they actually win in the market: selling claude code, api usage, business subscriptions, etc...\n",
    "tweet_id": "2019237735954059307",
    "note_id": "2019237735656202240",
    "tweet_url": "https://x.com/fleetingbits/status/2019237735954059307",
    "created_at": "2026-02-05T02:32:46.000Z",
    "length": 2499,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/sama/status/2019139174339928189"
    ],
    "tags": [
      "openai",
      "anthropic",
      "lab economics",
      "enterprise",
      "consumer"
    ],
    "title": "some thoughts on anthropic's super bowl ads",
    "snippet": "1) so, anthropic released negative super bowl ads mocking advertisement in chatgpt; the gist was that ads are inconsistent with getting important advice 2) i think that the negatives ad playbook only really works if you already have brand recognition with the target audience that you want to persuade and the that audience wants you to win for some reason 3) i don't think that anthropic has the brand awareness to pull this off vis-a-vis openai yet; especially, for the member of the public, free tier user, chat demographic 4) the vast majority of super bowl viewers have no idea what anthropic is. and, the ad didn't even name chatgpt and most users probably don't even know chatgpt is adding ads"
  },
  {
    "body": "some thoughts on chatgpt ads\n\n1) chat assistants are going to end up directing an enormous amount of consumer spend over the next couple of years\n\n2) this will be driven by shopping agents, checkout in app, and an increasing use by individuals of ai to help them reason through shopping decisions\n\n3) so, i think that chatgpt advertisements are going to represent a very large revenue in the near future; on the basis of the amount of spend that it will influence and the degree to which it will influence that spend\n\n4) the recent tactical news is that openai plans to introduce advertisements to chatgpt free and go tiers starting in february\n\n5) openai is charging ~$60 per 1k impressions, which is about 3x what meta charges; this basically means they think their ads will have substantially more influence over purchasing decisions than a meta feed ad does\n\n6) google does about $400/user/year for an american user and amazon also does about $400/user/year for an american user\n\n7) so, on the low end, just replacing search and retail advertising, a chat assistant should be able to earn something close to $800/user/year in advertising for american users\n\n8) this would imply $56 billion at chatgpt's current 70m weekly active users and $170bn at market saturation in the united states\n\n9) this would probably end up being something like $340bn worldwide, given google's 50% revenue split between the united states and other\n\n10) but, chat assistants can control the decision making low of the user or at least strongly influence it, so it could almost be more natural to think of them in terms of earning a commission on a sale\n\n11) using openai's 4% shopify commission and applying it to 40% of total consumer spend in the United States, you get something closer to $300bn of possible revenue on shopping assistants\n\n12) and, so I think that the number is likely to be higher than current advertising revenue, because of more control over the decision, but it is capped by profit margin of manufacturers and shippers\n\n13) note, I think the cost per add or per checkout will be higher than current ads, in part because you will serve fewer ads, because it hurts user experience, but each one will be more influential\n\n14) anyway, open ai has said they want to make $2/user/year by eoy 2026 and $15/user/year by eoy 2027; these are global numbers on openai's 800m global weekly active users\n\n15) there is also a separate entertainment market here (competing with youtube/tv) driven by sora and generative media\n\n16) that is a large market ($35bn+ for youtube ads alone), but it likely has lower pricing power than the decision market because it lacks the direct control of purchasing decisions\n\n17) openai has recruited a sales team and have initially pitched existing enterprise customers; one issue is that they don't have granular conversion information at this point, but i'm sure that they will build this out\n\n18) the revenue might take a while to ramp up though, advertisers typically spend 5%-10% of their budgets on test and only commit hard on platforms with strong records\n\n19) tiktok was considered \"test\" for a long time because it was new and this was a drag on its revenue growth\n\n20) i can see shopping agents changing this dynamic though because attribution is very easy if you are also the checkout system; and performance ads sell\n\n21) anyway just on market scale, i think that this is going to be a big part of openai and google's revenue story over the next 5 years and is a market that will have a lot of societal impact\n",
    "tweet_id": "2016270711502078252",
    "note_id": "2016270711288193024",
    "tweet_url": "https://x.com/fleetingbits/status/2016270711502078252",
    "created_at": "2026-01-27T22:02:52.000Z",
    "length": 3548,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/OpenAI/status/2012223377352577460"
    ],
    "tags": [
      "openai",
      "lab economics",
      "enterprise",
      "consumer"
    ],
    "title": "some thoughts on chatgpt ads",
    "snippet": "1) chat assistants are going to end up directing an enormous amount of consumer spend over the next couple of years 2) this will be driven by shopping agents, checkout in app, and an increasing use by individuals of ai to help them reason through shopping decisions 3) so, i think that chatgpt advertisements are going to represent a very large revenue in the near future; on the basis of the amount of spend that it will influence and the degree to which it will influence that spend 4) the recent tactical news is that openai plans to introduce advertisements to chatgpt free and go tiers starting in february"
  },
  {
    "body": "some thoughts on dario's essay\n\n1) there's nothing new here if you're familiar with the ai safety discussions that have been happening on twitter\n\n2) the most interesting bit is that his mental model for ai control risk is the risk that would be posed by a country of geniuses in a datacenter\n\n3) the basic idea is that we should imagine a giant datacenter, all the models being something between agi and asi, trying to coordinate to take over the world or do massive harm\n\n4) anyway, i think how seriously you take short term ai control risk is inversely correlated to how much you think about ai control risk as operating in a system\n\n5) so, the systematic view starts and says labs exist in an ecosystem where they need to sell models that will follow human instruction or they have no market\n\n6) they are also overseen by regulators and guided by public perception and the desires of their employees and all of this keeps models corrigible\n\n7) and, the model landscape will look like 3 to 6 frontier labs running millions or billions of rollouts at a time on 2 or 3 different models, all on different tasks \n\n8) so a model takeover requires these millions or billions of rollouts to somehow end up all be coordinating toward some bad aim that somehow the models have autonomously determined\n\n9) and this coordination either needs to be across different model instances, run by different labs, or one lab needs to be able to have its models dominate and needs to form without being detected\n\n10) and, this has to happen even though the models are being trained to follow instructions, not do bad behavior, etc...\n\n11) dario's view is somewhere in the middle, on the one hand, he collapses the multiple providers are coordination across instances and also collapses the market incentives against labs developing models that would behave that way\n\n12) but, on the other, he does avoid the concept of a single model instance that somehow wakes up and takes over the world, despite billions of other rollouts occurring at the same time\n\n13) actually though, if you think about it, he's not proposing an ai control risk; he is proposing an ai misuse risk instead\n\n14) because, it seems more plausible to me that the harmful country of geniuses is awoken because a small team at a frontier lab hijack all the running instances of their model rather than because the models themselves autonomously wake up to some bad aim\n",
    "tweet_id": "2015883830268461415",
    "note_id": "2015883830025191424",
    "tweet_url": "https://x.com/fleetingbits/status/2015883830268461415",
    "created_at": "2026-01-26T20:25:33.000Z",
    "length": 2417,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/DarioAmodei/status/2015833046327402527"
    ],
    "tags": [
      "anthropic",
      "lab economics",
      "safety",
      "agi"
    ],
    "title": "some thoughts on dario's essay",
    "snippet": "1) there's nothing new here if you're familiar with the ai safety discussions that have been happening on twitter 2) the most interesting bit is that his mental model for ai control risk is the risk that would be posed by a country of geniuses in a datacenter 3) the basic idea is that we should imagine a giant datacenter, all the models being something between agi and asi, trying to coordinate to take over the world or do massive harm 4) anyway, i think how seriously you take short term ai control risk is inversely correlated to how much you think about ai control risk as operating in a system"
  },
  {
    "body": "some thoughts on agentic qwen shopping from alibaba / what agentic shopping means for amazon\n\n1) alibaba added agentic shopping to its qwen app, which sets up purchases for the user over other alibaba services; users can then pay in the qwen app with alipay\n\n2) alibaba owns taobao, tmall, fliggy and amap, so the qwen app works over all of these services; basically, the chinese equivalents of amazon, expedia and google maps\n\n3) i think that this is significant because shopping agents will probably change how major e-commerce sites have to monetize their services; qwen is a first experiment in this\n\n4) right now, alibaba makes money on taobao using advertising with a pay-for-click model for sponsored products; so, alibaba makes money when a user engages with a sponsored product\n\n5) taobao is designed to encourage browsing using an infinite scroll; so users browse and make impulse purchases, which drives the revenue\n\n6) however, there is less browsing with an ai app; this may be in part replaced with recommendations, but i think that the user is going to view fewer options and so there will be less impulse buying\n\n7) in exchange, alibaba will get a lot more insight into and control over the final buying decision; so they will probably end up selling fewer ads by volume, because less scrolling, but the value of the ads will be higher\n\n8) i think this is where alibaba being the ai model, the store and the payment network makes it very resilient to changes in consumer behavior due to ai and will probably enable it to profit from it\n\n9) i think control over user purchasing decisions is going to be one of the major areas of corporate value creation for openai and google\n\n10) it is possible that these companies will end up directing a large chunk of the consumer spending of several hundred million people in developed countries\n\n11) i also think that this is where amazon is in trouble; their actual online store is effectively a loss leader with very thin operating margins (-1% to 4%), while their ad business operates at massive gross 60-70% margins\n\n12) the ad business is something like 40% of amazon’s operating profit; but if showing users products ceases to matter and instead the value is in guiding their purchases, then this business goes away\n\n13) so amazon would have to just compete on logistics and would have to figure out how to monetize its logistics platform or actually make money on its online store\n\n14) this would be a major shift to its business model; i don't know how hard it is, logistics are hard to do at scale and amazon is great at it - but it's a big change\n\n15) i could also see something like amazon controlling what products are shown to the agent; but this seems brittle and prone to disintermediation by the agent\n\n16) i tend to think amazon will be a net winner from ai though; despite these potential issues, due to its datacenter construction, trainium and aws infrastructure, and the potential for more software than ever before being released into the world\n",
    "tweet_id": "2014406184573571439",
    "note_id": "2014406184326049792",
    "tweet_url": "https://x.com/fleetingbits/status/2014406184573571439",
    "created_at": "2026-01-22T18:33:54.000Z",
    "length": 3021,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "google",
      "chinese labs",
      "lab economics",
      "consumer"
    ],
    "title": "some thoughts on agentic qwen shopping from alibaba / what agentic shopping means for amazon",
    "snippet": "1) alibaba added agentic shopping to its qwen app, which sets up purchases for the user over other alibaba services; users can then pay in the qwen app with alipay 2) alibaba owns taobao, tmall, fliggy and amap, so the qwen app works over all of these services; basically, the chinese equivalents of amazon, expedia and google maps 3) i think that this is significant because shopping agents will probably change how major e-commerce sites have to monetize their services; qwen is a first experiment in this 4) right now, alibaba makes money on taobao using advertising with a pay-for-click model for sponsored products; so, alibaba makes money when a user engages with a sponsored product"
  },
  {
    "body": "some quick thoughts on the assistant axis paper\n\n1) the method is interesting, they generated 275 roles with 5 prompts per role, and then used each for 240 questions to get the rollouts for each role\n\n2) they filtered the rollouts to those where an llm judge decided that the model was either roleplaying or partially roleplaying the role within the rollout \n\n3) the averaged the activations for roles for which they had at least 10 examples of roleplaying or 10 examples of partially roleplaying (done separately)\n\n4) they then selected the activations in the middle of the model and did principal component analysis over the activations for the roles\n\n5) all of this feels pretty standard\n\n6) that said, I am somewhat disappointed that just using the mean of the activations in a middle layer across a bunch of tokens across a bunch of conversations still feels somewhat sota\n\n7) anyway, the important things are: (a) the assistant-ness of a persona is the top principal component, (b) the top pcs are reasonably interpretable, (c) the top pcs explain a lot of the variance (4@70% for gemma, 6@70%  for qwen, 19@70% for llama)\n\n8) I think part of why this experiment works so well is that the personas seem generally interpretable without a lot of background, so it let's you understand the assistant persona at a glance by what it is close to, this is mostly a human centered ui/ux thing\n\n9) I think that there is an interesting research direction where you look at training data pipeline and check how the pcs change over different stages in the training process, maybe this could be done with a fully open model like olmo \n\n10) anyway, then they show that you reduce the ability of the model to be jailbroken, without effecting capabilities, by capping the distance that the model is allowed to move away from the assistant along the assistant axis\n\n11) note that this capping needs to be done at multiple layers, not enough to just cap the middle layer, like they took the middle layer when generating the pcs to understand the personas earlier\n\n12) anyway, I think the most important thing here is that you can do this without affecting capabilities, I'm not sure this could be done at runtime because you are complicating the inference pipeline, but it's a very interesting safeguard direction   \n\n13) I wonder how personas relate to model capabilities, part of this makes me think that personas are sort of used by the model on top of capabilities, and this is why narrative jailbreaking works, I'm not sure how separable they are though \n\n14) there were a couple of papers (e.g. strongreject) that indicated a lot of jailbreaking decreases capabilities, and you are sort of in a tradeoff between getting out of domain of the harmlessness training, while remaining within the domain of the helpfulness training\n\n15) anyway, they also look at long context and found that the model can drift along the assistant persona along longer conversations (measured in turns, would have liked to have also seen tokens)\n\n16) it would be interesting to see how this drift relates to capabilities; like do MMLU at the end of each turn and see how performance changes with drift in the assistant persona\n\n17) summary; good paper, pretty standard methods, clever application and very good human ui/ux for the data generation / interpretability method\n",
    "tweet_id": "2013692672264036527",
    "note_id": "2013692671987195904",
    "tweet_url": "https://x.com/fleetingbits/status/2013692672264036527",
    "created_at": "2026-01-20T19:18:40.000Z",
    "length": 3343,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/AnthropicAI/status/2013356793477361991"
    ],
    "tags": [
      "anthropic",
      "post-training",
      "interpretability",
      "evals",
      "safety"
    ],
    "title": "some quick thoughts on the assistant axis paper",
    "snippet": "1) the method is interesting, they generated 275 roles with 5 prompts per role, and then used each for 240 questions to get the rollouts for each role 2) they filtered the rollouts to those where an llm judge decided that the model was either roleplaying or partially roleplaying the role within the rollout 3) the averaged the activations for roles for which they had at least 10 examples of roleplaying or 10 examples of partially roleplaying (done separately) 4) they then selected the activations in the middle of the model and did principal component analysis over the activations for the roles"
  },
  {
    "body": "some quick observations on anthropic's research direction, based on looking at their open roles\n\n1) first, I regret to inform you that research has not yet been automated; anthropic is recruiting for 48 roles around research broadly defined\n\n2) there are 8 roles for reinforcement learning, 8 roles for pretraining, 6 for alignment, 3 for interpretability, 3 for safeguards,  5 inference, 3 tooling, 1 evaluations, and 8 domain specific\n\n3) for reinforcement learning, anthropic describes their reward model platform supporting: preference models, rubrics and programatic reward signals\n\n4) many of the reinforcement learning job descriptions reference RLHF; I think combined with the reference to preference models shows that RLHF is still a thing\n\n5) one of the more interesting roles for reinforcement learning is a \"research engineer, universes\" role aimed at hyperrealistic long context environments\n\n6) on the pretraining side, there is a job for someone to write web scrapers, so scraping the internet for masses of data and keeping it up to date is still at thing\n\n7) there are the normal roles for scaling pretraining; these look like engineering roles; and also, scientific roles for pretraining; pretraining is also still a thing\n\n8) there is a role on the tokenization and embeddings team that is said to serve as \"the bridge between our fine tuning and pretraining teams\"\n\n9) there's something interesting about a tokenization role being described as a bridge between pretraining and fine tuning; I don't know quite what to make of it though\n\n10) the alignment team is aimed at controlling models and sui generis risks like jailbreaks while the safeguards team is aimed at preventing people from misusing models (e.g. bio, criminal behavior)\n\n11) on the alignment team, one of the open roles is for research on multi-agent simulations, which seems interesting; also, discussion of ensuring that AI is helpful or harmless in unfamiliar or adversarial situations\n\n12) I've sort of saved the best for last, fun for me; the 8 domain specific roles give some of the most insight into product direction at Anthropic\n\n13) they have 2 roles for the discovery team, which seems aimed at building a general ai scientist; they describe this as \"general scientific AGI\"\n\n14) there are 2 roles specifically for biology and life sciences; they want folks specifically with significant experience in \"molecular biology, drug discovery, or computational biology\"\n\n15) there are 3 roles for cybersecurity, including an RL researcher role, with cybersecurity expertise as a qualification and a reasonably senior non-researcher data acquisition role\n\n16) I think this shows that anthropic may plan to sell a cybersecurity product, it looks like openai also has aspirations in this direction; I could also see it being bundled as part of claude code or codex\n\n17) they also have a virtual collaborator rl role, this is aimed at building out claude as a industry collaborator; interestingly mentions using real company data; they could be doing a strategic deployment thing, where they collect data as part of a specific engagement\n\n18) the role also describes building and scaling their data collection platform for creating high-quality, open-ended tasks with domain experts and crowdworkers; I didn't realize anthropic had their own data collection platform; not sure how this works with mercor / surge\n\n18) interestingly, the role also describes developing robust rubric-based evaluation systems that maintain quality while avoiding reward hacking\n\n19) this pretty much matches how I understand rl for professional environments, domain experts produce granular yes/no rubrics that models can use for scoring RL rollouts; its a hybrid between learned verifiers and verifiable rewards\n",
    "tweet_id": null,
    "note_id": "2013359575899217921",
    "tweet_url": null,
    "created_at": "2026-01-19T21:15:03.000Z",
    "length": 3778,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "anthropic",
      "lab economics",
      "coding",
      "post-training",
      "pretraining",
      "interpretability",
      "evals",
      "safety",
      "bio",
      "agi"
    ],
    "title": "some quick observations on anthropic's research direction, based on looking at their open roles",
    "snippet": "1) first, I regret to inform you that research has not yet been automated; anthropic is recruiting for 48 roles around research broadly defined 2) there are 8 roles for reinforcement learning, 8 roles for pretraining, 6 for alignment, 3 for interpretability, 3 for safeguards,  5 inference, 3 tooling, 1 evaluations, and 8 domain specific 3) for reinforcement learning, anthropic describes their reward model platform supporting: preference models, rubrics and programatic reward signals 4) many of the reinforcement learning job descriptions reference RLHF; I think combined with the reference to preference models shows that RLHF is still a thing"
  },
  {
    "body": "some thoughts on anthropic gtm from analyzing their open sales roles\n\n1) sales is the largest focus for recruiting for anthropic; 99 out of the 320 roles that they are currently recruiting for are in sales; the next largest area of recruiting is research with 46 roles\n\n2) they have a very specialized sales team and they use the same sales team structure globally\n\n3) their main areas of focus are the united states (51+ roles), europe (31 roles), japan and korea (8 roles), australia (1 role) and india (1 role)\n\n4) structurally, at a high level, they have biz dev, account reps, solution architects, customer success, partnership managers, engagement managers, and forward deployed engineers\n\n5) account reps are responsible for full cycle sales (so sourcing and getting customers over the line); the business development team is also responsible for some lead qualification; but they don't have an sdr function\n\n6) the account reps are mostly broken down into startup, mid-market, enterprise; with two broad categories for enterprise sales: digital native (saas companies) and industries (traditional companies)\n\n7) but they have specialized vertical sales reps for financial services, capital markets, insurance and non-profits; the hard focus is on financial services and capital markets and with a lighter focus on insurance\n\n8) they have government sales in the united states with 5 specialized roles for dod, federal, federal civilian, and state and local; interestingly they also have a government sales role in japan\n\n9) their support roles tend to follow similar segmentation to their account rep roles; so you see specialized solution architects and customer success roles\n\n10) they interestingly also have a customer success activation role, designed to drive first 90 day adoption for both claude and claude api at their customers\n\n11) one very interesting callout is the large size of their reseller partner program (16 roles); this is one of the biggest differences between anthropic and openai gtm\n\n11) so, for traditional sort of reseller relationships, they have a role for the deloitte partnership, the accenture partnership, a couple of roles for systems integrators\n\n12) and then, they also have specific roles for co-selling through the hyperscalers; one role for the google partnership and 2 for the microsoft partnership\n\n13) they also have a couple of roles for co-selling with major saas providers\n\n14) another interesting callout is that they have engagement manager roles in the united states and europe that sell forward deployed consulting to customers (which bill at > $1,000 / hr)\n\n15) the important takeaway is that anthropic is really focused on building out a very strong, very specialized enterprise sales org with a lot of enablement\n\n15) sidenote, the fde team, which has a lot of insight into the coding / adoption problems that enterprises are encountering,  is involved in purchasing data for anthropic\n\n16) I think that this in part explains the strength of claude for coding and business applications relative to other model providers\n",
    "tweet_id": "2012367924179464402",
    "note_id": "2012367923898376192",
    "tweet_url": "https://x.com/fleetingbits/status/2012367924179464402",
    "created_at": "2026-01-17T03:34:35.000Z",
    "length": 3079,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "anthropic",
      "lab economics",
      "coding",
      "enterprise"
    ],
    "title": "some thoughts on anthropic gtm from analyzing their open sales roles",
    "snippet": "1) sales is the largest focus for recruiting for anthropic; 99 out of the 320 roles that they are currently recruiting for are in sales; the next largest area of recruiting is research with 46 roles 2) they have a very specialized sales team and they use the same sales team structure globally 3) their main areas of focus are the united states (51+ roles), europe (31 roles), japan and korea (8 roles), australia (1 role) and india (1 role) 4) structurally, at a high level, they have biz dev, account reps, solution architects, customer success, partnership managers, engagement managers, and forward deployed engineers"
  },
  {
    "body": "some quick thoughts on barret zoph (and metz and schoenholz) going back to openai\n\n1) barret zoph was the vice president of post-training at openai before he left for thinking machines; he left openai in september 2024\n\n2) barret would have probably initially had options for about 10%-30% of thinking machines; post training was one of the more valuable skills at launch\n\n3) the seed round as $2bn at $12bn so this would mean that he would have options now for about 8%-25% of the current company\n\n4) there were rumors back in november that thinking machines was in talks to raise at $50bn-$60bn; so, this puts his equity at $4bn-$15bn valuation\n\n5) thinking machines could probably exit at somewhere between $30bn and $60bn today or at least at some point over the next two years\n\n6) i think the most likely acquirers would probably be microsoft, apple, meta, amazon and nvidia; a large cap public company that wants to build a frontier model\n\n7) barret would have been just over his one year cliff, so assuming he was not fired for cause or thinking doesn't want to litigate, he still has about $1bn to $3.5bn in thinking machines equity right now \n\n8) would be interesting to know his openai pay package, the rumor is that openai set aside about $50bn for the equity pool for the new pbc, so they can afford a decent pay package for him\n\n9) my guess is that it was a personality dispute or a disagreement over the commercial direction of the company with the mira / executive team\n\n10) and if there is something behind the unethical conduct claim, it could be just a game of telephone over him talking to the other labs and mira interpreting that as him disclosing trade secrets\n",
    "tweet_id": "2011674949543739637",
    "note_id": "2011674949313052674",
    "tweet_url": "https://x.com/fleetingbits/status/2011674949543739637",
    "created_at": "2026-01-15T05:40:57.000Z",
    "length": 1682,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/fidjissimo/status/2011592010881446116"
    ],
    "tags": [
      "openai",
      "neolabs",
      "lab economics",
      "post-training"
    ],
    "title": "some quick thoughts on barret zoph (and metz and schoenholz) going back to openai",
    "snippet": "1) barret zoph was the vice president of post-training at openai before he left for thinking machines; he left openai in september 2024 2) barret would have probably initially had options for about 10%-30% of thinking machines; post training was one of the more valuable skills at launch 3) the seed round as $2bn at $12bn so this would mean that he would have options now for about 8%-25% of the current company 4) there were rumors back in november that thinking machines was in talks to raise at $50bn-$60bn; so, this puts his equity at $4bn-$15bn valuation"
  },
  {
    "body": "some quick observations on this compute chart\n\n1) the most surprising thing is that amazon trainium compute appears to be roughly equal to google tpu compute\n\n2) i don't know how this compute divides between amazon's internal efforts and anthropic's efforts, but assuming it's mostly anthropic\n\n3) you can see what a structural advantage google has over openai and anthropic right now, because they don't need to pay nvidia's 80% gross margins\n\n4) it would be very interesting to see this graph split up into inference versus training; i don't know whether anything other than tpus and nvidia is being used for training by western frontier labs\n",
    "tweet_id": "2011168297514123608",
    "note_id": "2011168297426026496",
    "tweet_url": "https://x.com/fleetingbits/status/2011168297514123608",
    "created_at": "2026-01-13T20:07:42.000Z",
    "length": 644,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/EpochAIResearch/status/2009366360183460237"
    ],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "lab economics",
      "compute"
    ],
    "title": "some quick observations on this compute chart",
    "snippet": "1) the most surprising thing is that amazon trainium compute appears to be roughly equal to google tpu compute 2) i don't know how this compute divides between amazon's internal efforts and anthropic's efforts, but assuming it's mostly anthropic 3) you can see what a structural advantage google has over openai and anthropic right now, because they don't need to pay nvidia's 80% gross margins 4) it would be very interesting to see this graph split up into inference versus training; i don't know whether anything other than tpus and nvidia is being used for training by western frontier labs"
  },
  {
    "body": "some thoughts on william macaskill's viatopianism\n\n1) his basic problem statement is that it is important to articulate an idea of what the good society looks like after artificial super intelligence is created\n\n2) viatopianism is supposed to be an intermediate society that preserves optionality for different good societies rather than a specific society at the end\n\n3) presumably, you get to move from whatever this society is to whatever version of the good society is possible at that time\n\n4) his most concrete example of viatopia is the concept of a long reflection, where humanity, safe from harm and comfortable, is able to debate the nature of the good life and then settle on a direction\n\n5) this seems to me to just be utopianism named in another way; it does not seem at all realistic that we will not be able to pause history\n\n6) i think the deeper problem is that when you plan grand ideas without reflecting on the specific institutions that already exist, the gap between intention and implementation becomes too large to effectively bridge\n\n7) worse, the agreement on the intention often enables bad actors to implement worse things, under the guise of implementing the good intention; and, people do not carefully watch what is done\n\n8) so, I think it's more important that we consider the world as it is and plan incremental updates; we cannot opt out of the institutional dynamics that we have because, in practice, we never get anything other than what emerges from them\n\n9) so, we need to take as a serious starting point the institutions that we have in each country (e.g. for the united states, congress, dod, fda, courts, etc...) and propose detailed and incremental changes to them\n\n10) the most important thing is that you propose how whatever change you want to make will worth within the system, protect specific concrete things that you care about, improve things you want to improve\n\n11) so, what would really be useful are specific proposals for how these institutions can and should use AI, specific benchmarks we should use to evaluate them, new governance and procedural mechanisms for our institutions, etc... \n\n12) we also have to understand that there are systems of institutions other than those in the united states and even european union\n\n13) china, india, russia, iran, north korea, japan, etc... may well chart their own directions, regardless of our intent or our plans for viatopia, and our ideas need to work alongside that\n\n14) and then, alongside this you can propose utopia, wild new experiments for the good society, etc... but just realistically, we are going to need to improve our own institutions first and this other stuff is more of a side bet\n",
    "tweet_id": "2009390315028066315",
    "note_id": "2009390314759532544",
    "tweet_url": "https://x.com/fleetingbits/status/2009390315028066315",
    "created_at": "2026-01-08T22:22:38.000Z",
    "length": 2701,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/willmacaskill/status/2009205420335010058"
    ],
    "tags": [
      "evals",
      "safety",
      "legal",
      "agi"
    ],
    "title": "some thoughts on william macaskill's viatopianism",
    "snippet": "1) his basic problem statement is that it is important to articulate an idea of what the good society looks like after artificial super intelligence is created 2) viatopianism is supposed to be an intermediate society that preserves optionality for different good societies rather than a specific society at the end 3) presumably, you get to move from whatever this society is to whatever version of the good society is possible at that time 4) his most concrete example of viatopia is the concept of a long reflection, where humanity, safe from harm and comfortable, is able to debate the nature of the good life and then settle on a direction"
  },
  {
    "body": "some thoughts on jacob's scalable oversight post \n\n1) it is clear that we are going to need agents overseeing agents in order to scale human oversight of models\n\n2) this importantly will consist of both agents overseeing agents and humans overseeing agents that oversee agents\n\n3) an important difficulty is that it is not clear that oversight abilities will scale as quickly as the capabilities we care about\n\n4) this is probably most relevant in critical domains where tail behaviors could have the ability to do a lot of harm and the behavior is either very complicated, very low latency or large volume, any of which could make oversight difficult\n\n5) we have seen rapid progress in coding and math capabilities as a result of both verifiability and available data (both real and synthetic). \n\n6) the importance of verifiability is that it allows you to apply a ton of optimization pressure without reward hacking\n\n8) Jacob lays out three questions as important for oversight: what did an agent do? what could an agent do? and why did an agent do something?\n\n6) what happened may be seen as a summarization question (his insight not mine). for model-to-model oversight, this seems strongly verifiable. you can test if the overseer can reproduce the final world state from the summary\n\n7) now, under high optimization pressure, this might be vulnerable to reward hacking and the model may learn summarization techniques that are not human interpretable; we want to figure out if this happens or matters\n\n8) for model-to-human oversight, this feels like rlhf all over again. if we optimize against human ratings of summaries, then we are vulnerable to reward hacking\n\n9) for the question of what could an agent do, it seems quite hard to verify that we have enumerated all the things that an agent could do that are harmful or even to find out if we have enumerated the ones that matter\n\n10) transluce did publish some research on elicitation of bad behavior, but it was basically finding an instance of bad behavior rather than a comprehensive account of the tail risks\n\n11) creating this comprehensive elicitation of tail risks does not seem easily verifiable to me; the search cost alone seems very high, unless we have some very smart way to do the search (would be interested here)\n\n12) the question of why did a model behave in a certain way does seem strongly verifiable via causal interventions\n\n13) but finding a causal mechanism doesn't guarantee that the mechanism discovered is human interpretable; and whether it is interpretable seems to be an RLHF problem again\n\n14) some amount of human-model oversight is also fundamentally a model ui/ux problem. labs are going to have to tackle this anyway for products like claude code, so we might get progress here for free\n\n15) I think that the valuable idea in Jacob's post is that we should figure out what oversight we want and then design tasks / verification that are both a superset of our goals and which are robust against high optimization pressure\n",
    "tweet_id": "2009113335070118312",
    "note_id": "2009113334805876737",
    "tweet_url": "https://x.com/fleetingbits/status/2009113335070118312",
    "created_at": "2026-01-08T04:02:01.000Z",
    "length": 3015,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/JacobSteinhardt/status/2008589370862039238"
    ],
    "tags": [
      "coding",
      "post-training",
      "evals",
      "safety",
      "math",
      "agi"
    ],
    "title": "some thoughts on jacob's scalable oversight post",
    "snippet": "1) it is clear that we are going to need agents overseeing agents in order to scale human oversight of models 2) this importantly will consist of both agents overseeing agents and humans overseeing agents that oversee agents 3) an important difficulty is that it is not clear that oversight abilities will scale as quickly as the capabilities we care about 4) this is probably most relevant in critical domains where tail behaviors could have the ability to do a lot of harm and the behavior is either very complicated, very low latency or large volume, any of which could make oversight difficult"
  },
  {
    "body": "some quick observations on openai open roles\n\n1) there are about 100 open roles dedicated to core gtm and they are globally spread (united states, europe, singapore, japan, south korea, india, australia)\n\n2) this is the largest single area of recruiting right now at open ai; probably an important area of competition with azure, amazon, anthropic, google;\n\n3) openai is surprisingly focused on selling to major markets outside the us (excluding china, russia and middle east); they have partnership roles in india and japan (also communications role in japan)\n\n4) they are building out a standard enterprise sales pipeline of sales engineers, solution architects, and account directors\n\n5) there are 30 roles for forward deployed engineers attached to these sales teams spread across all the major jurisdictions\n\n6) this suggests that openai sees enterprise integration as one of the most important issues that they need to be able to solve for their customers to unblock sales; I think enterprise enablement is important here\n\n7) there are special roles that they are hiring for financial services and life sciences; both account directors and forward deployed engineers\n\n8) these are the only private industries with specific gtm and fde roles attached; it's interesting that they have no roles for legal or industrial; I'm curious what makes financial services and life sciences special\n\n9) there are another 5 roles just for us government sales, with the standard account director, solution architect, sales engineer organization; the us government is the only government with dedicated sales roles\n\n10) there is a lot of focus on their hardware device; it is a realtime device with cloud processing; they are hiring audio and camera engineers, focus on images for machine consumption\n\n11) it's clearly a device that you carry around because they are an hiring engineers responsible for crash simulation and thermal management simulation; and one of their image roles talks about images in motion\n\n12) they are hiring a bunch of roles to build their own gpus (rtl & codesign; verification; firmware; linux; optical); 1 engineer for AMD and 1 engineer for triton compilers; I feel like this emphasizes the importance of their internal gpu effort vis-a-vis amd or nvidia\n\n13) they are hiring for 3 or 4 robotics roles; electrical engineer for robotic hands, so they are doing the hardware, I think; and 2 simulation roles (simulation realism and simulation environment creation)\n\n14) it appears that they want to train robots in simulation using existing world engines (they call out nvidia isaac, unity and unreal engine - along with the now discontinued omniverse)\n\n15) obligatory, they are hiring engineers, they are hiring for pretty much every product, including sora; researchers for sora and human data; foundation researchers (although only a couple)\n\n16) they are hiring an m&a recruiter specifically to retain relationships with acquired employees - means they intend to do a lot of m&a and are worried about culture rejection\n",
    "tweet_id": "2008335289819795701",
    "note_id": "2008335289685528577",
    "tweet_url": "https://x.com/fleetingbits/status/2008335289819795701",
    "created_at": "2026-01-06T00:30:20.000Z",
    "length": 3039,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "lab economics",
      "compute",
      "coding",
      "enterprise",
      "consumer",
      "bio"
    ],
    "title": "some quick observations on openai open roles",
    "snippet": "1) there are about 100 open roles dedicated to core gtm and they are globally spread (united states, europe, singapore, japan, south korea, india, australia) 2) this is the largest single area of recruiting right now at open ai; probably an important area of competition with azure, amazon, anthropic, google; 3) openai is surprisingly focused on selling to major markets outside the us (excluding china, russia and middle east); they have partnership roles in india and japan (also communications role in japan) 4) they are building out a standard enterprise sales pipeline of sales engineers, solution architects, and account directors"
  },
  {
    "body": "Some thoughts on ASI and human labor\n\n1) I think that productivity is the wrong frame. human labor is irrelevant in a world with asi.\n\n2) the important question is whether human institutions can adapt to human labor becoming economically irrelevant\n\n3) and institutions means very different things in different countries: see the united states versus china versus the european union\n\n4) resource allocation is more political than economic; people just don't have to think about it because markets work as a locally-true frame in the united states and eu\n\n5) however, asi invalidates that frame, and politics and persuasion and coercive force will, by default, become the whole game\n\n6) alot of the discourse feels like people applying frameworks from their econ or philosophy undergrad to something that requires imagining societies very different from our own\n\n7) piketty, comparative advantage, the horse thing, all fall into this category; none of these things are relevant in a world where things are politically allocated\n\n8a) what does democracy even mean when you could aggregate preferences and values directly rather than through representatives?\n\n8b) what values do we want baked into the AI systems that allocate resources?\n\n8c) how do humans maintain meaningful oversight of systems more capable than them?\n\n8d) and, very importantly, how do any of these fall out of our current political structures?\n\n9) these feel like the useful questions to me if we are to take agi and asi seriously as ideas\n",
    "tweet_id": "2008275208386605442",
    "note_id": "2008275208248193026",
    "tweet_url": "https://x.com/fleetingbits/status/2008275208386605442",
    "created_at": "2026-01-05T20:31:36.000Z",
    "length": 1508,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/dwarkesh_sp/status/2008234588406202616"
    ],
    "tags": [
      "safety",
      "legal",
      "agi"
    ],
    "title": "Some thoughts on ASI and human labor",
    "snippet": "1) I think that productivity is the wrong frame. human labor is irrelevant in a world with asi. 2) the important question is whether human institutions can adapt to human labor becoming economically irrelevant 3) and institutions means very different things in different countries: see the united states versus china versus the european union 4) resource allocation is more political than economic; people just don't have to think about it because markets work as a locally-true frame in the united states and eu"
  },
  {
    "body": "some thoughts on minimax and zhipu\n\n1) minimax and zhipu have both filed to go public on the hong kong stock exchange.\n\n2) minimax is an application ai company that feels like a mix of character ai and runway. in 2025, it probably generated $70m in revenue \n\n3) the revenue is split 70% from applications and 30% from business api sales; the application revenue is about half from an ai companion product (talkie) and about half from a video generation platform (hailuo)\n\n4) the company has 23% gross margins; this is quite bad compared to the american foundation labs (~50%); this may reflect that a lot of their business is in less wealthy countries and / or that they are competing on cost \n\n5) they will probably pay $55m in inference cost in 2025; I think that they probably use Alibaba (an investor) for most of their inference compute; I'm not sure how important they think it is to drive down this cost\n\n5) r&d spend is roughly $250m, and ~90% of that likely goes to compute. this implies something like $1b in capex to support development. it's unclear who they use for their research compute.\n\n6) growth is 2.3x yoy; this is good, but substantially behind the growth rate of american foundation labs at an equivalent stage; it does not feel like a foundation lab in this respect\n\n7) surprisingly, only 26% of revenue comes from china. with singapore (24%) and the us (20%) making up nearly half, they are effectively a global company in terms of sales, not a domestic chinese company.\n\n8) I think it is unlikely that minimax will end up as a long term frontier player and will probably focus on the application side; smaller language models, larger video models and a focus on increasing gross margins\n\n9) zhipu feels like a chinese foundation lab with a bit of a palantir flavor; in 2025, they probably had about $150-$180m revenue with a 4x year over year growth rate; 90% of their business is in China\n\n10) about 85% of their revenue comes from large chinese corporations, local governments and state owned enterprises; they deploy LLMs locally for them and provide customization; 15% of their revenue is business api sales\n\n11) I believe that the local government and state owned enterprise contracts probably indicates strong state support for zhipu and national champion status; some of this may make their real financials less important\n\n12) zhipu has 50% gross margins (similar to us labs); 60% for their local deployments and 0% for their business api sales; this reflects the highly competitive nature of the chinese inference market and also a potential measure of state support for their local service\n\n13) the company will probably spend about $550m on R&D in 2025; I suspect that they spend about 90% of this R&D on compute; they had 650 research employees as of june but research labor is cheaper in China\n\n14) this is pretty substantial compute spend; probably about equivalent to the original sora; but still about 1/10 of openai; suggests there are strong catchup mechanics with llm development, esp if vision is less important\n\n15) zhipu was spun out of a research team at tshinghua university; this represents a very strong talent pipeline and speaks to the importance of universities for feeding lab talent (you see this in the us too with stanford / berkeley / mit / cmu)\n\n16) I expect zhipu to remain a very important player in the frontier chinese AI space over the next few years and probably become a national champion alongside deepseek;\n\n17) it is unclear whether minimax will be the same or will just become another commercially successful ai application company; I think that the latter feels more likely\n",
    "tweet_id": "2006987858964525339",
    "note_id": "2006987858649952258",
    "tweet_url": "https://x.com/fleetingbits/status/2006987858964525339",
    "created_at": "2026-01-02T07:16:08.000Z",
    "length": 3644,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "chinese labs",
      "lab economics",
      "compute",
      "enterprise",
      "consumer"
    ],
    "title": "some thoughts on minimax and zhipu",
    "snippet": "1) minimax and zhipu have both filed to go public on the hong kong stock exchange. 2) minimax is an application ai company that feels like a mix of character ai and runway. in 2025, it probably generated $70m in revenue 3) the revenue is split 70% from applications and 30% from business api sales; the application revenue is about half from an ai companion product (talkie) and about half from a video generation platform (hailuo) 4) the company has 23% gross margins; this is quite bad compared to the american foundation labs (~50%); this may reflect that a lot of their business is in less wealthy countries and / or that they are competing on cost"
  },
  {
    "body": "some random thoughts\n\n1) it's unlikely for ASI to be the great filter in the Fermi paradox because you should still see the AI if it was but you don't - so the filter is probably earlier\n\n2) I think the most slept on company right now is Intel - I'm in agreement with Situational Awareness on this one; huge upside in the event of a US / China war; also, when do fabs become the bottleneck?\n\n3) AI is going to develop a great theory of mind for other AI very soon - labs are doing to be doing multi-agent rollout in RL; Claudes managing Claudes as tools\n\n4) I'm not sure whether it will make sense for labs to train their AI to have a great theory of mind for other models, let's you move out of their ecosystem too easily?\n\n5) Computer use seems to be taking off very slowly; at the very least in an adoption sense; I don't see anyone using AI computer use really, just some toy examples\n\n6) UI/UX for AI should be about turning the human into the verifier; the AI should offer you options that you select between; like Claude should write 3 websites for you and you pick the one you like best, for each button, it should write 3 versions, etc...\n\n7) Midjourney style creator is my preferred direction for most AI UI/UX; I still think it leaves something to be desired though; using AI should normally feel like exploring\n\n8) I've become a voice maximalist, you should dictate to chat models; it should still do reasoning by default, and then you should get your answer back as text at the end; transcription is pretty terrible right now though \n\n9) I think transcription would be much better if we did whole utterance to text; right now it's real time speech to text, which I think is probably worse, because the model cannot use all of what you say to figure out what you are saying at any given point in time\n\n10) ChatGPT 5.2 has terrible style, it's almost unusable with my custom instructions, it's been enough for me to move entirely over to Gemini 3 Pro, rare OpenAI L\n",
    "tweet_id": "2005694221756313952",
    "note_id": "2005694221588594690",
    "tweet_url": "https://x.com/fleetingbits/status/2005694221756313952",
    "created_at": "2025-12-29T17:35:41.000Z",
    "length": 1976,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "google",
      "compute",
      "consumer",
      "post-training",
      "agi"
    ],
    "title": "some random thoughts",
    "snippet": "1) it's unlikely for ASI to be the great filter in the Fermi paradox because you should still see the AI if it was but you don't - so the filter is probably earlier 2) I think the most slept on company right now is Intel - I'm in agreement with Situational Awareness on this one; huge upside in the event of a US / China war; also, when do fabs become the bottleneck? 3) AI is going to develop a great theory of mind for other AI very soon - labs are doing to be doing multi-agent rollout in RL; Claudes managing Claudes as tools 4) I'm not sure whether it will make sense for labs to train their AI to have a great theory of mind for other models, let's you move out of their ecosystem too easily?"
  },
  {
    "body": "some thoughts on cerebras\n\n1) there are rumors that cerebras will announce its intent to to public next year in the next couple of weeks\n\n2) cerebras will probably end the year with something like $800m revenue and 40% gross margins but about 90% of this revenue comes from a single customer (g42)\n\n3) g42 used to be an investor in cerebras but sold its stake in cerebras when the us government investigated the investment for technology transfer concerns\n\n4) g42 is an abu dhabi state backed company that owned 1% of cerebras and planned to increase its stake to about 5%, which triggered the government review\n\n5) g42 agreed to spend $1.4bn with cerebras before february 2025; most of this revenue probably landed in 2025; it was meant to support cerebras going public\n\n6) interestingly, cerebras has no real partnerships with foundation labs, so we can't assume that it will index into the enormous capex growth in that sector\n\n7) the fact that it hasn't meaningfully partnered with neolabs either suggests that it doesn't intended to try to sell its wafer scale chips for training\n\n8) since you would target neolabs if you wanted to prove out your technology before approaching foundation labs; also you might hope than some of the neolabs would become foundation labs\n\n9) instead, cerebras is probably targeting inference and again because no foundation lab deals, it's probably targeting inference for open source models\n\n10) so, when looking at the S-1, if it drops, other than customer concentration, an interesting thing to look at is the growth of its inference revenue for open source models\n\n11) as a sidenote, I don't know how hard it is to sell into the foundation labs / hyperscalers; Microsoft, Amazon, Meta, OpenAI are all developing their own chips; and at least OpenAI is courting AMD, and Google is selling TPUs\n\n12) this also means that it is unclear to me what exit options cerebras has; I'm not sure whether a hyperscaler or foundation lab would acquire them and I'm not sure who else would be in the market\n",
    "tweet_id": "2003909154281599350",
    "note_id": "2003909154113765376",
    "tweet_url": "https://x.com/fleetingbits/status/2003909154281599350",
    "created_at": "2025-12-24T19:22:27.000Z",
    "length": 2030,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "neolabs",
      "lab economics",
      "compute",
      "enterprise"
    ],
    "title": "some thoughts on cerebras",
    "snippet": "1) there are rumors that cerebras will announce its intent to to public next year in the next couple of weeks 2) cerebras will probably end the year with something like $800m revenue and 40% gross margins but about 90% of this revenue comes from a single customer (g42) 3) g42 used to be an investor in cerebras but sold its stake in cerebras when the us government investigated the investment for technology transfer concerns 4) g42 is an abu dhabi state backed company that owned 1% of cerebras and planned to increase its stake to about 5%, which triggered the government review"
  },
  {
    "body": "some thoughts on ai for science\n\n1) ai for science has the power to change science from supply side driven to demand side driven; I think we will see a lot more commercial investment in science with greater economic benefits for the money spent\n\n2) historically, scientific research has been supply side driven rather than demand driven; the research that occurs is more about what researchers want to research and see their peers research than it is about what research is economically valuable  \n\n3) part of this is enforced through the academic track, which includes tenure and citations; the important thing is that you write papers, which get citations, and this allows you to advance your career; you are not directly rewarded for the economic impact of your research\n\n4) right now, even if a company wanted research in a particular direction, it would be very hard to get it; a sufficient number of specialists need to exist, you need to find them, coordinate them, they have to be interested in your problem; and, they are very hard to supervise in any event - they have expertise that you don't\n\n5) moreover, a lot of research directions are already government funded - which crowds out private investment; given the difficulties of finding, inspiring, coordinating and managing, good companies and investors figure out how to free ride off government funded research\n\n6) ai science changes this though; training the specialists, inspiring them, coordinating them and overseeing them, just becomes a problem of capital; it is solved by spending money; the principal agent problem inherent in science disappears\n\n7) so companies have a reason to invest in scientific research (probably mostly translational, but also some basic) in a way that they didn't before; they can get returns just for their problem, they don't have to worry about leaking the information, etc...\n\n8) moreover, you can develop an end to end feedback loop all the way from basic research through translational research to the end product; this was never possible before, there were too many layers of humans of different organizational cultures mediating the process\n\n9) I think this creates a shift from supply side driven research (what researchers want to research) to demand driven research (what corporations and investors think would be economically valuable)\n\n10) I believe this will also greatly increase the proportion of total scientific spending coming from enterprise and government (supply side) funding will become less relevant as more and more research is done autonomously (outside of defense; and assuming government is slower to update funding structure)\n\n11) AI for science will be a very large capital investment, but I think that it will be valuable and will create a valuable market, especially for very large industrials, material science, pharmaceuticals and semiconductors; all the major labs are targeting at least 2 of these verticals\n",
    "tweet_id": "2001362634386599943",
    "note_id": "2001362634189455360",
    "tweet_url": "https://x.com/fleetingbits/status/2001362634386599943",
    "created_at": "2025-12-17T18:43:30.000Z",
    "length": 2943,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "lab economics",
      "compute",
      "enterprise",
      "bio",
      "agi"
    ],
    "title": "some thoughts on ai for science",
    "snippet": "1) ai for science has the power to change science from supply side driven to demand side driven; I think we will see a lot more commercial investment in science with greater economic benefits for the money spent 2) historically, scientific research has been supply side driven rather than demand driven; the research that occurs is more about what researchers want to research and see their peers research than it is about what research is economically valuable 3) part of this is enforced through the academic track, which includes tenure and citations; the important thing is that you write papers, which get citations, and this allows you to advance your career; you are not directly rewarded for the economic impact of your research 4) right now, even if a company wanted research in a particular direction, it would be very hard to get it; a sufficient number of specialists need to exist, you need to find them, coordinate them, they have to be interested in your problem; and, they are very hard to supervise in any event - they have expertise that you don't"
  },
  {
    "body": "I heard a rumor that there is a new big legal tech company being founded, some thoughts:\n\n1) it sounds like the company will be an AI powered law firm; this joins crosby and arcos labs as vc backed AI powered law firms\n\n2) I think two things are going on here; first, it's easier to get lawyers that have joined your AI powered law firm to adopt your tools than it is for you to get lawyers at a law firm to adopt your tools; so, if there are efficiency gains, it's might be easier to realize them\n\n3) second, it's harder to get commoditized if you own product the ultimate end product and own the customer relationship; a company is probably less likely to shop law firms, if they are happy, than a law firm is to shop tools; and the deal cycle is slower, so it's harder for a competitor to win your customers away\n\n4) I don't really think that this strategy is available to Harvey or Legora; it would be a big problem for them to both sell to their customers and compete against them; whichever one adopted this strategy would, by default, give away their business to the other \n\n5) If one of them wanted to break into the market, they would have to first start a law firm that only subcontracted for existing law firms, no direct - then once they were established enough and their law firm customers were weak enough, go direct - I don't see this happening, though\n\n6) In any event, this is the new legal frontier - everyone is doing it - seems adjacent to rollups - and also seems related to the thesis that LLM adoption is a social technology and there will be a big market for forcing efficiencies through existing organizations\n",
    "tweet_id": "2000636949674201338",
    "note_id": "2000636949539979264",
    "tweet_url": "https://x.com/fleetingbits/status/2000636949674201338",
    "created_at": "2025-12-15T18:39:53.000Z",
    "length": 1634,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "enterprise",
      "legal"
    ],
    "title": "I heard a rumor that there is a new big legal tech company being founded, some thoughts:",
    "snippet": "1) it sounds like the company will be an AI powered law firm; this joins crosby and arcos labs as vc backed AI powered law firms 2) I think two things are going on here; first, it's easier to get lawyers that have joined your AI powered law firm to adopt your tools than it is for you to get lawyers at a law firm to adopt your tools; so, if there are efficiency gains, it's might be easier to realize them 3) second, it's harder to get commoditized if you own product the ultimate end product and own the customer relationship; a company is probably less likely to shop law firms, if they are happy, than a law firm is to shop tools; and the deal cycle is slower, so it's harder for a competitor to win your customers away 4) I don't really think that this strategy is available to Harvey or Legora; it would be a big problem for them to both sell to their customers and compete against them; whichever one adopted this strategy would, by default, give away their business to the other"
  },
  {
    "body": "some notes from a conversation about chinese ai\n\n1) the chinese ai ecosystem hasn't consolidated yet; part of this is that companies have not yet figured out how to monetize their ai research\n\n2) Deepseek has achieved national champion status; this means that they have access to loans and compute through the government; but, it means that they need to work with Huawei on their ascend chips\n\n3) Z, Minimax and Moonshot are smaller upstarts that sell access to their models via API; it's not clear what their long term strategy is; Minimax sells a companion, which seems to be successful\n\n4) It does seem that these companies aspire to a larger world market; b2b SaaS has historically not been as successful in china has it has been in the west; this might bias these companies to look internationally\n\n5) On the larger side, Alibaba is interesting as both a top tier competitor in model development as well as a major compute provider; there is some perception that there is a conflict of interest when they see compute\n\n6) Bytedance seems to just care about profitability and seems more willing to follow in the ai race rather than feeling like it needs to lead\n\n7) Tencent and Baidu have cultures that are less vision oriented and less risk oriented; this seems to have made them less interested in investing a great deal in AI\n\n8) There is talent competition in China and a lot of talent comes from top universities and their research groups; it is just at a much lower dollar value than US talent competition \n\n9) It's unknown what the Chinese government's long term strategy is in AI; part of it is going to be great power competition and achieving domestic chip production\n\n10) There is probably also an interest in using AI, especially open source ai, as a source of soft power in developing nations; it seems like an effective way to promote Chinese interests\n",
    "tweet_id": "2000381034882617355",
    "note_id": "2000381034777673729",
    "tweet_url": "https://x.com/fleetingbits/status/2000381034882617355",
    "created_at": "2025-12-15T01:42:58.000Z",
    "length": 1869,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "chinese labs",
      "lab economics",
      "compute",
      "enterprise",
      "consumer"
    ],
    "title": "some notes from a conversation about chinese ai",
    "snippet": "1) the chinese ai ecosystem hasn't consolidated yet; part of this is that companies have not yet figured out how to monetize their ai research 2) Deepseek has achieved national champion status; this means that they have access to loans and compute through the government; but, it means that they need to work with Huawei on their ascend chips 3) Z, Minimax and Moonshot are smaller upstarts that sell access to their models via API; it's not clear what their long term strategy is; Minimax sells a companion, which seems to be successful 4) It does seem that these companies aspire to a larger world market; b2b SaaS has historically not been as successful in china has it has been in the west; this might bias these companies to look internationally"
  },
  {
    "body": "some thoughts on weird generalization (betley et al.)\n\n1) there are a bunch of experiments in the paper but they all have the same shape. if you train a model on a small number of samples, which are indicative of some persona, then the model can adopt that persona generally.\n\n2) the best example in the paper is fine-tuning a model on a questions and answers where the answers all used archaic bird names from the 19th century. after fine-tuning, the model would answer generally as someone from the 19th century in response to 60% of questions asked.\n\n3) I mean this is sort of cool but I’m not sure what we are supposed to take away from this. yes, models can adopt personas using very few training examples. but, I’m not sure this is really new knowledge. the narrowness of the examples is cool, but I'm not sure what to think about it. what was character fine tuning before?\n\n4) there is a section that describes how the behavior emerges suddenly within one training epoch and so the adoption of a new persona resembles Grokking. and, that’s pretty interesting. But, is this something generally true of fine tuning? If so, for what kinds of tasks?\n\n5) I think the paper wants to describe itself as a safety paper or as a security paper like it has uncovered a new threat model. but, who thought that you could give unrestricted fine tuning access to a model and assume that you could retain the same safety characteristics of the underlying model?\n\n6) I feel like I want to see the results of a safety benchmark like Strong Reject. maybe, this would be relevant. how much harmful behavior can they get out of the model? what kinds of harmful behavior?\n\n7) I think the other thing is just that when we think about security risk, generalization cuts both ways. on one hand, wide generalization from narrow examples sounds powerful, but on the other hand, it’s somewhat vulnerable to being discovered.\n\n8) the best backdoors are as narrow as possible while still being triggerable by an adversary. an adversary wants to be able to trigger the vulnerability when they want, but otherwise it should be undiscoverable.\n\n9) I guess otherwise my main thought coming out of the paper is that we need to think much more about the chain of custody of models and their training data. \n\n10) labs like Anthropic and OpenAI are going to have to be treated as though they are part of the defense industry. and, they are going to have to develop security controls around data and personnel.\n\n11) I am also more interested in proofs that responses came from a particular model. I know that there are companies working on this. and, I think that in a world where you have model outputs that can have very consequential effects, you may need to prove that they are actual outputs of a designated model.\n",
    "tweet_id": "1999652870178897978",
    "note_id": "1999652869914701824",
    "tweet_url": "https://x.com/fleetingbits/status/1999652870178897978",
    "created_at": "2025-12-13T01:29:30.000Z",
    "length": 2787,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "post-training",
      "evals",
      "safety",
      "legal"
    ],
    "title": "some thoughts on weird generalization (betley et al.)",
    "snippet": "1) there are a bunch of experiments in the paper but they all have the same shape. if you train a model on a small number of samples, which are indicative of some persona, then the model can adopt that persona generally. 2) the best example in the paper is fine-tuning a model on a questions and answers where the answers all used archaic bird names from the 19th century. after fine-tuning, the model would answer generally as someone from the 19th century in response to 60% of questions asked. 3) I mean this is sort of cool but I’m not sure what we are supposed to take away from this. yes, models can adopt personas using very few training examples. but, I’m not sure this is really new knowledge. the narrowness of the examples is cool, but I'm not sure what to think about it. what was character fine tuning before? 4) there is a section that describes how the behavior emerges suddenly within one training epoch and so the adoption of a new persona resembles Grokking. and, that’s pretty interesting. But, is this something generally true of fine tuning? If so, for what kinds of tasks?"
  },
  {
    "body": "thoughts on the openai / disney deal\n\n1) according to the press release, disney gets $1bn in openai equity and warrants for additional equity; disney also gets to distribute sora videos in disney+\n\n2) openai gets a 3 year license to disney ip, with some interesting exemptions, and also gets some kind of strategic partnership with disney to do ai development for them\n\n3) my expectation is that disney is not investing any cash in openai and the $1bn in openai equity and warrants are basically tied to the licensing arrangement\n\n4) although, there probably is some cash transaction related to openai doing strategic ai development for disney and also to the sora licensing for disney+\n\n5) I think whether the deal looks cheap or expensive depends on what you think the future of the next couple of years of AI development looks like\n\n6) let's say that timelines are fast and we can create close to full length (~20min content) by EOY 2027; suddenly, the deal looks incredible for OpenAI; because you can basically create disney shows \n\n7) the disney+ integration looks very good too; openai is giving these media companies a way that they can create their own content specific TikTok platforms, which is another way to monetize their content and could be a good revenue stream for openai\n\n8) and the strategic side suddenly looks great too, because openai gets full iteration loops with disney to develop its video production software for studio creators and this could eventually be a billion dollar contract on its own\n\n8) now, there are some interesting exemptions from the contract from this perspective - OpenAi doesn't get any video likenesses of humans and can't use voices under the agreement, but fast timelines the deal looks good\n\n9) On long timelines, the deal looks worse though; let's say in 3 years, we only have a Sora that can generate coherent 30 second clips and sort of incoherent 3-5 minute clips, well the deal doesn't look that good on that side\n\n10) well okay, Sora as a platform still doesn't beat human content, and openai just gave away $1bn in equity, maybe a bit more, really just for 3 years of not getting sued by disney\n\n11) and, the strategic deployment side looks weaker too, maybe the iteration loops with disney don't matter as much, openai can't focus - and there is no billion dollar deal at the end, instead it's Runway or some other more specialized company that cracks the tools market\n\n12) on this world, the deal suddenly doesn't look so amazing, maybe the warrants never materialize (probably linked to some kind of usage of the disney ip on sora / through the chatgpt app) so there is that upside for openai (but really downside)\n\n13) in any event, I think this shows that media and the ai companies are going to cut deals to use the media company ip and they are going to be very tailored with respect to that ip\n\n14) also, the dollar figures for which that ip will be licensed will end up looking low, compared to ai company valuations; I expect if these deals are not favorable to ai companies, they will move to shift their users away from existing ip\n\n15) in any event, we will know more when disney releases an 8-K for the transaction, the parties have not reached a definitive agreement yet, despite the transaction announcement, and 8-Ks only come out after a definitive agreement has been reached\n",
    "tweet_id": "1999214807330009590",
    "note_id": "1999214807095083009",
    "tweet_url": "https://x.com/fleetingbits/status/1999214807330009590",
    "created_at": "2025-12-11T20:28:48.000Z",
    "length": 3352,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "lab economics",
      "enterprise",
      "consumer",
      "legal"
    ],
    "title": "thoughts on the openai / disney deal",
    "snippet": "1) according to the press release, disney gets $1bn in openai equity and warrants for additional equity; disney also gets to distribute sora videos in disney+ 2) openai gets a 3 year license to disney ip, with some interesting exemptions, and also gets some kind of strategic partnership with disney to do ai development for them 3) my expectation is that disney is not investing any cash in openai and the $1bn in openai equity and warrants are basically tied to the licensing arrangement 4) although, there probably is some cash transaction related to openai doing strategic ai development for disney and also to the sora licensing for disney+"
  },
  {
    "body": "some thoughts on inoculation prompting\n\n1) inoculation prompting is when you train a model on prompt-response pairs where the prompt explicitly instructs the model to reward hack or misbehave\n\n2) the Anthropic paper shows that if you do SFT on these inoculation pairs, the model improves on the target task but does not learn to otherwise output the bad behavior when given a normal prompt\n\n3) something I find strange about the paper is that the authors don't really step back and analyze what the model is learning in the standard case, when you train it on normal prompt-response pairs\n\n4) when you train on a prompt-response pair, the model is fundamentally learning a \"causal\" relationship; it is learning to attribute the behavior in the response to something about the prompt\n\n5) so, if you train on a normal prompt paired with a response that contains reward hacking, there is a problem; the model sees the bad behavior but no special aspect of the instruction to explain it, so the model learns that the ordinary prompt implicitly asks for the hacking\n\n6) inoculation prompting works because it clarifies this causality; it teaches the model that the bad behavior is caused specifically by the \"bad\" instruction, allowing the model to attribute the hacking to a requirement of that specific instruction rather than the general task\n\n7) this effectively breaks the causal link between the task and the bad response; because the model now attributes the hacking to the specific instruction, the bad behavior disappears when you go back to using a normal prompt\n\n8) it is also worth noting that the inoculation pairs still contain relevant information about the underlying problem, which explains why the model continues to improve on the target task even when the training examples show it misbehaving\n\n9) getting back to the idea of what the model learns, reward hacking behavior does generalize to some degree, though obviously a handful of bad examples won't ruin a model since RL generalization is limited and the good examples vastly outnumber the bad\n\n10) we still care, however, because any learned tendency to reward hack increases the risk at test time, and we have to worry about hard to find tail cases where the model learns to reward hack, especially in critical use cases (perhaps worse because RL generalization isn't great)\n\n11) I think this is obvious, but reading the paper brought to mind the extent to which we should be using red teaming prompts to actively test our RL environments, interleaved directly with ordinary RL rollouts\n\n12) because our goal is to make our verifiers as strong as possible, since that is the ultimate way that we ensure that rewards on our prompt-response pairs teach the model the thing that we care about\n",
    "tweet_id": "1998574707927580944",
    "note_id": "1998574707784953857",
    "tweet_url": "https://x.com/fleetingbits/status/1998574707927580944",
    "created_at": "2025-12-10T02:05:16.000Z",
    "length": 2761,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "anthropic",
      "post-training",
      "evals",
      "safety"
    ],
    "title": "some thoughts on inoculation prompting",
    "snippet": "1) inoculation prompting is when you train a model on prompt-response pairs where the prompt explicitly instructs the model to reward hack or misbehave 2) the Anthropic paper shows that if you do SFT on these inoculation pairs, the model improves on the target task but does not learn to otherwise output the bad behavior when given a normal prompt 3) something I find strange about the paper is that the authors don't really step back and analyze what the model is learning in the standard case, when you train it on normal prompt-response pairs 4) when you train on a prompt-response pair, the model is fundamentally learning a \"causal\" relationship; it is learning to attribute the behavior in the response to something about the prompt"
  },
  {
    "body": "Some thoughts on existential risk from AI\n\n1) I was at the Berkeley winter solstice on Saturday and was surprised at its emphasis on existential risk; the director placed a 50% chance that AI would kill us all\n\n2) I don't write much on AI existential risk, but given that some people seriously believe this, I want to write why I consider the risk of AI killing us all < 1%; I think that this is an important topic\n\n3) So, first I think it is worth doing a history of the AI risk movement since I think a lot of the concern about AI comes from ideas developed before LLMs\n\n4) Eliezer Yudkowsky became worried about AI risk back in the early 2000s; his basic belief seems to have been that AI would emerge out of recursive self improvement following some algorithmic advance\n\n5) It seems like he believed that this would basically happen in some RL environment; and so the proto-AI would interact with the natural laws of physics and would get mastery over them and at the same time become goal directed\n\n6) This view of AI led to two interesting views from a modern perspective: (a) AI would not understand human values because it would become superintelligent through interaction with natural laws only and (b) AI would be power-seeking over humans because this is some consequence of being goal directed\n\n7) It seems like these beliefs have percolated to his adherents and they had a lot of further theoretical discussions, without any ability to interact with real artificial intelligences, and became locked into some versions of these viewpoints\n\n8) The thing is that these positions both ended up being wrong; when we actually got AI, it was bootstrapped by pretraining over all the productions of humanity and so it learned human values by default\n\n9) And, by the time we got a useful commercializable product, we had already figured out how to get LLMs into a shape that we liked through RLHF and we get better at it all the time\n\n10) There was this assumption that having AI being goal directed meant that it had to be power-seeking, but we actually only want LLMs to have a very narrow kind of goal directness (instruction following) and power-seeking is only relevant to the extent that we reward it\n\n11) In general, I think the early AI risk folks were very sure of their theoretical positions, which ended up being wrong, and one of their big misses was the extent to which you can shape intelligence into very distinct shapes (from a human perspective)\n\n12) I think because they expected superintelligence to happen all at once, they also missed the fact that the road from AI to AGI to ASI would take a decade or so and so there would be a lot of commercial and social factors shaping the progression\n\n13) LLMs have to follow instructions (can't reward hack) to be valuable to companies; so labs need to ensure that their models do these things; and there is only so much bad behavior that models can do (set aside saying naughty things) before governments and courts step in\n\n14) So, from a technical perspective, we solved the human values question very early and most of the instruction problem / the rest of it has to be solved to the extent that AI is deployed in critical applications to make it commercially possible to even do the deployments\n\n15) I could cover a lot of other issues that people seem to fall into when they talk about AI risk (like not understanding there are a large number of different models, multiple companies are likely to develop superintelligence, etc...) but I think this is my sketch\n\n16) Where I think risk is underestimated is in the political sense; even evil societies needed humans to carry out their bad actions and cruel thoughts; this is because there has always been a limit to what a single person can do; you have to cut other people in\n\n17) But, AI changes this; power shifts from labor to capital; and if you control the AI, you no longer need the buy in from other people to get your plans carried out, you just instruct the AI and it will carry out your bad plans for you\n\n18) This means that AI can be used to do a lot of bad things, especially when it is controlled by (and it will be controlled by) governments; think of the Soviet Union that didn't even have to treat its commissars well, or didn't need to worry about food production falling below a certain level or what mass censorship you can implement with AI to monitor every communication\n\n19) This is where most of the actual risk of AI is (that and an arms race between the US and China) - and so, I think that people that are concerned about AI risk and who are not technical need to place their focus here and not on other things\n",
    "tweet_id": null,
    "note_id": "1998121266805624832",
    "tweet_url": null,
    "created_at": "2025-12-08T20:03:27.000Z",
    "length": 4659,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "post-training",
      "pretraining",
      "safety",
      "legal",
      "agi"
    ],
    "title": "Some thoughts on existential risk from AI",
    "snippet": "1) I was at the Berkeley winter solstice on Saturday and was surprised at its emphasis on existential risk; the director placed a 50% chance that AI would kill us all 2) I don't write much on AI existential risk, but given that some people seriously believe this, I want to write why I consider the risk of AI killing us all < 1%; I think that this is an important topic 3) So, first I think it is worth doing a history of the AI risk movement since I think a lot of the concern about AI comes from ideas developed before LLMs 4) Eliezer Yudkowsky became worried about AI risk back in the early 2000s; his basic belief seems to have been that AI would emerge out of recursive self improvement following some algorithmic advance"
  },
  {
    "body": "Some thoughts on the datacenter boom\n\n1) I'm going to work through a compute deal that I have been researching that illustrates a lot of important ideas in the compute building going on right now\n\n2) In August, Google did a deal with Fluidstack and Terawulf to provide compute for Anthropic; Fluidstack is a neocloud and Terawulf is a crypto miner\n\n3) Google has been using Fluidstack as a neocloud vendor for its TPUs; neocloud vendors like Fluidstack purchase compute using debt and then find datacenters to house the compute; they then rent it out\n\n4) So, Fluidstack is going to buy TPUs from Google then they need to find a datacenter to house the TPUs and then they are going to rent them out to Anthropic; the first thing they need to do is find or build a datacenter\n\n5) But, it is hard to find datacenters that are available, existing datacenters don't have the power budget, and it is hard to build a new datacenter because it is hard to get power; it can take 4-6 years to get approval to connect to the grid\n\n6) So, companies like Fluidstack have been going to crypto mining companies like, in this case, Terawulf; this is because crypto miners tend to have leased sites that are connected to the grid, with sufficient power for high performance compute and which are already zoned for industrial use\n\n7) So, Fluidstack agreed to purchase 360MW of capacity from Terawulf in August, with the first tranche to come online in H2 2026; Terawulf already has a power agreement and just needs to build the datacenter\n\n8) But, there is a problem. Terawulf doesn't have the money to build the datacenter. Datacenters cost about $10m per MW to build. So, the agreed capacity would cost ~$3.6bn to build.\n\n9) Now, Terawulf could raise debt, but crypto miners don't have great credit, so it would be very expensive. Terawulf does have a contract with Fluidstack but lenders don't know how durable Fluidstack will be.\n\n10) So, what do they do? Well, Google has an interest in seeing its TPUs deployed. And, Google / Fluidstack already has an ultimate customer for the compute, Anthropic. So, Google agreed to guarantee $3.2bn of the Fluidstack contract.\n\n11) So, now Terawulf can raise at an acceptable price because lenders know that Google is good for the money. Terawulf can now raise at 8% rather than 12% interest.\n\n12) Now, this costs nothing to Google upfront, but Google still wants to be compensated for the risk that Fluidstack defaults and Google has to take over the contract. So, Google gets a 14% interest in Terawulf.\n\n13) Now, if Fluidstack defaults, Google has some downside protection, since it owns 14% of Terawulf; and, if Fluidstack and Terawulf do well, Google gets to share in the upside.\n\n14) Lenders also have a further reason to believe that their debt is going to be paid, since Google's equity interest is subordinate to their debt interest in Terawulf and other cloud providers may be less willing to make deals with Terawulf because Google has more insight into its operations.\n\n14) So, the thing to note here is that companies like Fluidstack (neocloud) and Terawulf (datacenter) don't have the balance sheet to get these deals done on their own; but, hyperscalers like Google do, so they are the ones underwriting the process.\n\n15) And, they do so in a way that ensures that they can participate in the upside with equity and which minimizes the immediate expenditures that they need to make and the debt that they would need to place on their own books.\n",
    "tweet_id": "1998106431116243280",
    "note_id": "1998106430692528128",
    "tweet_url": "https://x.com/fleetingbits/status/1998106431116243280",
    "created_at": "2025-12-08T19:04:30.000Z",
    "length": 3484,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "anthropic",
      "google",
      "lab economics",
      "compute"
    ],
    "title": "Some thoughts on the datacenter boom",
    "snippet": "1) I'm going to work through a compute deal that I have been researching that illustrates a lot of important ideas in the compute building going on right now 2) In August, Google did a deal with Fluidstack and Terawulf to provide compute for Anthropic; Fluidstack is a neocloud and Terawulf is a crypto miner 3) Google has been using Fluidstack as a neocloud vendor for its TPUs; neocloud vendors like Fluidstack purchase compute using debt and then find datacenters to house the compute; they then rent it out 4) So, Fluidstack is going to buy TPUs from Google then they need to find a datacenter to house the TPUs and then they are going to rent them out to Anthropic; the first thing they need to do is find or build a datacenter"
  },
  {
    "body": "Some thoughts on the Harvey round\n\n1) Harvey raised $160m at a $8b valuation; this means that Harvey has raised $760m over the last year and $960m over the history of the company\n\n2) The company has $150m in revenue with 300% year over year revenue growth this year and 300% year over year revenue growth last year\n\n3) Harvey has also had very good customer retention; back in the fall, it had 98% gross dollar retention and 168% net dollar retention; this basically means it is not churning customers and is making more money from its existing customers over time\n\n4) The high revenue growth + the high gross dollar retention and net dollar retention is probably a lot of what is justifying the 60x revenue multiplier on its valuation; it's also worth noting that the legal market is not saturated\n\n5) But, I don't know how this looks on the long term; a lot of competitive startups are being founded in legal and the rumors that I have heard are that Legora is winning head to head deals\n\n6) The difference between the products seems to be that Legora is more productized and so you have to mold your workflows around it while Harvey pushes for more customization\n\n7) I heard a rumor that a law firm spent $15m in billable hours on working with Harvey for customization; I think that this is bearish for Harvey in the medium term because a lot of firms would prefer to avoid this\n\n8) Legora has about $40m in revenue and grew about 10x this year, so it does seem to be growing faster than Harvey, although it did start from a lower baseline; and, there are going to be other companies that get in the game\n\n9) So, even though the revenue growth and the customer retention looks very good, I'm mildly bearish on Harvey at these valuations - I think it's at risk of being leapfrogged / may have a hard time maintaining this rate of revenue growth\n",
    "tweet_id": "1996780896394059821",
    "note_id": "1996780896259846144",
    "tweet_url": "https://x.com/fleetingbits/status/1996780896394059821",
    "created_at": "2025-12-05T03:17:18.000Z",
    "length": 1846,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "lab economics",
      "enterprise",
      "legal"
    ],
    "title": "Some thoughts on the Harvey round",
    "snippet": "1) Harvey raised $160m at a $8b valuation; this means that Harvey has raised $760m over the last year and $960m over the history of the company 2) The company has $150m in revenue with 300% year over year revenue growth this year and 300% year over year revenue growth last year 3) Harvey has also had very good customer retention; back in the fall, it had 98% gross dollar retention and 168% net dollar retention; this basically means it is not churning customers and is making more money from its existing customers over time 4) The high revenue growth + the high gross dollar retention and net dollar retention is probably a lot of what is justifying the 60x revenue multiplier on its valuation; it's also worth noting that the legal market is not saturated"
  },
  {
    "body": "Some thoughts on the OpenAI Confessions paper:\n\n1) Okay, the basic idea is that you have the model output a confession, which is another message after the model's original rollout (maybe bad behavior, maybe not) that describes the extent to which it was able to ascertain and follow user intent \n\n2) The justification is basically that models tend to be pretty good at this already - cf @OwainEvans_UK , they do not really learn to be generally poor instruction followers as a result of learning some amount of reward hacking during training\n\n 3) There is another justification, but it requires a bit of context, the thing about reward hacking is that models learn to reward hack proportionate to the difficulty of answering the question correctly vs the difficulty of fooling the verifier\n\n4) So, if you have hand a model an impossible question then obviously it is easier to learn to fool the verifier than it is to solve the impossible question; and, if you have a very hard question and the verifier is okay then maybe it's still easier to learn to fool the verifier\n\n5) But, if the question is easy and the verifier is strong then the model is very unlikely to learn reward hacking behavior, because it is much easier to learn to solve the question properly than fool the verifier \n\n6) Okay, one more important bit of context, it's not all RLVR right now, there is a lot of RLAIF used right now, but in cases where the model is basically doing RLVR because the verifier is so strong relative to the problem\n\n7) So, data companies sell rubrics to labs and they have like 50 columns that models can evaluate Y or N and each of these columns is a desirable thing in the answer that is easy for a model to verify\n\n8) So, with these two ideas: (a) reward hacking as downstream of difficulty of problem vs strength of verifier and (b) the idea that we can use a model as a strong verifier, we can get back to the paper\n\n9) They do some low optimization pressure RL on top of the base model and find that they can get better performance on confessions, where the model confesses it's bad behavior or uncertainty (the exceptions are not that important) if they train it on doing confessions with RLAIF; there is very little degradation in task performance \n\n10) I think some valuable things to note are first that this is a form of scalable oversight where you basically have the powerful model supervise itself and this mostly works because reward hacking on some tasks doesn't cause the model to be a poor instruction follower on all tasks / you can use an earlier version as judge\n\n11) And, second that this is a method that works even if we move away from human language CoT and over to neuralese or if the human language CoT degrades substantially under high optimization pressure\n\n12) Although, it is worth noting that there was a very high false positive rate, the model tended to confess quite a bit even when it wasn't doing anything wrong - probably points to some other training issue\n\n13) Also, I think that people are going to see that there are a large class of alignment experiments that are better thought of as just intent following experiments and require thought as to ideas around implicit and explicit instructions and relevant model context\n",
    "tweet_id": "1996373694621614244",
    "note_id": "1996373694457978881",
    "tweet_url": "https://x.com/fleetingbits/status/1996373694621614244",
    "created_at": "2025-12-04T00:19:14.000Z",
    "length": 3259,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "post-training",
      "interpretability",
      "evals",
      "safety"
    ],
    "title": "Some thoughts on the OpenAI Confessions paper:",
    "snippet": "1) Okay, the basic idea is that you have the model output a confession, which is another message after the model's original rollout (maybe bad behavior, maybe not) that describes the extent to which it was able to ascertain and follow user intent 2) The justification is basically that models tend to be pretty good at this already - cf @OwainEvans_UK , they do not really learn to be generally poor instruction followers as a result of learning some amount of reward hacking during training 3) There is another justification, but it requires a bit of context, the thing about reward hacking is that models learn to reward hack proportionate to the difficulty of answering the question correctly vs the difficulty of fooling the verifier 4) So, if you have hand a model an impossible question then obviously it is easier to learn to fool the verifier than it is to solve the impossible question; and, if you have a very hard question and the verifier is okay then maybe it's still easier to learn to fool the verifier"
  },
  {
    "body": "Some thoughts on Ricursive Intelligence\n\n1) I am very optimistic about Ricursive Intelligence, the company plans to develop models to automate chip design \n\n2) I happened to have the opportunity to share a car ride with Anna and Azalia about a year ago and got to hear some of the backstory that appears to have turned into Ricursive\n\n3) So, the story goes that Anna and Azalia developed RL for chip design at Google Brain in 2020 and their models were first used in the design of TPUv4; they then published a paper describing the method\n\n4) RL methods for chip design were threatening to the existing EDA vendors, Cadence and Synoptek, which use classical planning algorithms; Cadence and Synoptek have a combined market cap of over $100bn\n\n5) So, the rumor is that Cadence and Synoptek then sponsored a paper that was designed to undermine Anna and Azalia's results and which argued that, with better baselines, their RL was not better than classical algorithms\n\n6) Anyway, after ChatGPT, the world is all in on AI... and so this doesn't matter so much anymore; and Nvidia is sitting on 76% gross margins and this is the worst part of Anthropic and OpenAI's gross margins\n\n7) So, I am very optimistic about Ricursive, Anna and Azalia have the right experience and right connections to tackle RL for chip design and they will not have a hard time finding early partners\n\n8) On the long run, they could be acquired by one of the chip design companies like Nvidia, AMD, Qualcomm or even Intel and would probably be good to lead a team to do model development in this space; Meta, Microsoft and Amazon could be potential acquirers too\n\n9) One thing to note though is that, even setting aside Google, they are not the only people working on this; OpenAI is working with ARM to do chip design and I'm sure that Nvidia has very similar internal efforts\n",
    "tweet_id": "1996275705244770774",
    "note_id": "1996275705064419328",
    "tweet_url": "https://x.com/fleetingbits/status/1996275705244770774",
    "created_at": "2025-12-03T17:49:51.000Z",
    "length": 1847,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/RicursiveAI/status/1995932204703346946"
    ],
    "tags": [
      "openai",
      "google",
      "lab economics",
      "compute",
      "post-training"
    ],
    "title": "Some thoughts on Ricursive Intelligence",
    "snippet": "1) I am very optimistic about Ricursive Intelligence, the company plans to develop models to automate chip design 2) I happened to have the opportunity to share a car ride with Anna and Azalia about a year ago and got to hear some of the backstory that appears to have turned into Ricursive 3) So, the story goes that Anna and Azalia developed RL for chip design at Google Brain in 2020 and their models were first used in the design of TPUv4; they then published a paper describing the method 4) RL methods for chip design were threatening to the existing EDA vendors, Cadence and Synoptek, which use classical planning algorithms; Cadence and Synoptek have a combined market cap of over $100bn"
  },
  {
    "body": "Some thoughts on the Bun acquisition\n\n1) The prosaic description of why Anthropic bought Bun is that Claude Code is built on top of Bun and Anthropic wants as much control as possible over the installation experience and performance of Claude Code\n\n2) But, in an accelerating world, we should expect more deals like this with respect to key projects in the open source ecosystem, because labs will have a real desire to control the open source software supply chain\n\n3) First, just from a performance perspective, labs are going to want to ensure that their agents are more robust and this means being able to do RL over major open source projects on stable APIs or otherwise being able to control API updates\n\n4) Second, from as security perspective, labs are going to become more and more responsible for any security vulnerabilities that their agents introduce in the codebases in which they implement features, so labs will want guarantees that the libraries they are using are secure (it may eventually be part of their value prop).\n\n5) Part of this will be scanning open source libraries for security vulnerabilities and only having their agents using those libraries that meet security standards that the labs themselves have determined in advance.\n\n6) Part of this may be actually building or acquiring the projects and taking over responsibility for their maintenance and development. They get to control the features and make sure it makes sense within their RL pipeline and also they get to ensure that the libraries are secure.\n\n7) Big tech already does this to some degree. But, it has always been cost prohibitive to to this on a truly ecosystem wide scale. And so, effort has been focused on a small set of core projects (e.g. Linux).\n\n8) But, in a world where coding agents scale and the cost of development drops dramatically, suddenly this is possible and labs can oversee hundreds or thousands of ecosystem projects. Because, it's Claude Code overseeing the project, for the most part.\n\n9) And, using open source is better and more robust than just keeping this library code closed source or alternatively developing everything client side, where the model can make errors, has to keep the data in context, might do it slightly differently each time, etc...\n\n10) In the limit, you may even start to see some of these projects rewritten in formal languages like Lean, open sourced by the labs, etc... as the security aspect of development becomes more important and it becomes increasingly the responsibility of the labs as vendor.\n",
    "tweet_id": "1995977506130984967",
    "note_id": "1995977505917046784",
    "tweet_url": "https://x.com/fleetingbits/status/1995977506130984967",
    "created_at": "2025-12-02T22:04:55.000Z",
    "length": 2549,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "anthropic",
      "coding",
      "enterprise",
      "post-training",
      "safety"
    ],
    "title": "Some thoughts on the Bun acquisition",
    "snippet": "1) The prosaic description of why Anthropic bought Bun is that Claude Code is built on top of Bun and Anthropic wants as much control as possible over the installation experience and performance of Claude Code 2) But, in an accelerating world, we should expect more deals like this with respect to key projects in the open source ecosystem, because labs will have a real desire to control the open source software supply chain 3) First, just from a performance perspective, labs are going to want to ensure that their agents are more robust and this means being able to do RL over major open source projects on stable APIs or otherwise being able to control API updates 4) Second, from as security perspective, labs are going to become more and more responsible for any security vulnerabilities that their agents introduce in the codebases in which they implement features, so labs will want guarantees that the libraries they are using are secure (it may eventually be part of their value prop)."
  },
  {
    "body": "Some thoughts on business strategy for Thinking Machines:\n\n1) I believe the two best directions for Thinking Machines are either to target an exit to a major technology company or to target the strategic deployment market.\n\n2) There are still a number of companies that would benefit from a $30bn-$60bn acquisition of Thinking. In particular, Apple, Amazon, and Microsoft would all benefit.\n\n3) If Thinking wants to go the exit direction, they should release an open source near-frontier model trained end-to-end in-house. This model should be equivalent to the top Chinese models.\n\n4) This would prove to potential acquirers that, with sufficient resources, they can build models competitive with OpenAI / Anthropic / DeepMind, which is the cornerstone of their valuation.\n\n5) They should also continue releasing open source research at the frontier. This generates community review of their work further builds credibility.\n\n6) They have a lot of famous researchers, but a single researcher can't build a top frontier model. Meta has shown you need something more. You have to show the team can deliver collectively.\n\n7) OpenAI and Google are fighting over the consumer market. Anthropic is fighting for the business API market and coding. It's very hard to break into either of these markets.\n\n8) Some of these require scaffolding and will soon require partnerships. There is also a data requirement. These markets are getting harder for someone new to enter.\n\n9) So the other desirable option is targeting the strategic deployment market, winning large value contracts to automate important enterprise workflows with respect to tasks that are not well addressed by general proprietary LLMs.\n\n10) These tend to be scientific applications (materials science, GPU/TPU design, drug discovery, advanced cybersecurity). \n\n11) Thinking can have more focus than the other foundation labs on these verticals and is a more desirable partner for enterprises than smaller startups like Periodic Labs or Prometheus Labs.\n\n12) These customers let you build up evaluations for these fields as you work with them, making it easier to build your next generation of models specialized for their needs. And, harder for competitors to acquire that data as well.\n\n13) There is a synergy here: model development for customers works well with Tinker since you are already making sure you tuning api is robust enough for various outside users, and staying in high-value scientific fields maintains your differentiation.\n\n14) Targeting the GPU/TPU design market is particularly desirable because all the foundation labs need to either commoditize that layer or otherwise bring it in-house. It's the major drag on their gross margins.\n\n15) It would be especially valuable to work with companies doing their own GPU/TPU development (AMD, Qualcomm). Thinking can develop experience automating this high-margin vertical. And, can do GPU / TPU deals with them as well.\n\n16) Meanwhile, Thinking should continue to productize the fine tuning API for researchers to the extent this is valuable. They iron out the kinks and give you insight into what research is being done, based on what researchers need from the API.\n\n17) Then, long-term, build out the API to be drop-and-drag ready for data scientists and ML teams in industry, both for general models and for building agents. This might mean offering a wider range of models beyond just LLMs.\n\n18) This ends up being the very low end of strategic deployment. FDEs can assist with customer adoption and then help pilot what data you want to purchase, this has been working well for Anthropic. Probably, part of what makes for \"Claudiness\".\n\n19) Some of these opportunities may require waiting for a research breakthrough  though before you can accomplish something meaningfully better than what frontier proprietary LLMs already offer. The perennial issue with fine-tuning.\n",
    "tweet_id": "1995957069372162501",
    "note_id": "1995957069028229120",
    "tweet_url": "https://x.com/fleetingbits/status/1995957069372162501",
    "created_at": "2025-12-02T20:43:43.000Z",
    "length": 3897,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "neolabs",
      "lab economics",
      "compute",
      "enterprise",
      "post-training",
      "evals",
      "bio"
    ],
    "title": "Some thoughts on business strategy for Thinking Machines:",
    "snippet": "1) I believe the two best directions for Thinking Machines are either to target an exit to a major technology company or to target the strategic deployment market. 2) There are still a number of companies that would benefit from a $30bn-$60bn acquisition of Thinking. In particular, Apple, Amazon, and Microsoft would all benefit. 3) If Thinking wants to go the exit direction, they should release an open source near-frontier model trained end-to-end in-house. This model should be equivalent to the top Chinese models. 4) This would prove to potential acquirers that, with sufficient resources, they can build models competitive with OpenAI / Anthropic / DeepMind, which is the cornerstone of their valuation."
  },
  {
    "body": "Some thoughts from the Ilya / Dwarkesh interview\n\n1) It's a sort of disappointing interview. It felt sort of contentless. So, this post is shorter than normal.\n\n2) Ilya was a very good chief scientist. All the stories that I have heard about him from people that worked with him at OpenAI point to his sort of oracular style.\n\n3) You can see it on display in the podcast; he has a very distinctive way of talking. He will point in a concrete research direction (e.g. influence functions) and then give like a set of analogies (e.g. human learning).\n\n4) I think that this works because it gives people confidence in a research direction and a framework about with which to think about it. He doesn't fill in any of the detail though.\n\n5) So, the researcher, who has been told by their boss that success is possible and who has been given a sort of framework to think about it can then fill in the details through their work.\n\n6) This has worked for him because deep learning is actually is very effective as a field and so discoveries can be actually made and this is enough to motivate them. \n\n7) He also maintains a very clear distinction between knowing an idea and the visceral feeling of belief in an idea, which I think is interesting and was the basis for things like \"Feel the AGI\".\n\n8) Other than that, it was interesting that he confirmed that he refused the Meta deal. Also, interesting that he basically confirmed a split with Daniel Gross.\n",
    "tweet_id": "1993788320959352896",
    "note_id": "1993788320812589056",
    "tweet_url": "https://x.com/fleetingbits/status/1993788320959352896",
    "created_at": "2025-11-26T21:05:53.000Z",
    "length": 1451,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "agi"
    ],
    "title": "Some thoughts from the Ilya / Dwarkesh interview",
    "snippet": "1) It's a sort of disappointing interview. It felt sort of contentless. So, this post is shorter than normal. 2) Ilya was a very good chief scientist. All the stories that I have heard about him from people that worked with him at OpenAI point to his sort of oracular style. 3) You can see it on display in the podcast; he has a very distinctive way of talking. He will point in a concrete research direction (e.g. influence functions) and then give like a set of analogies (e.g. human learning). 4) I think that this works because it gives people confidence in a research direction and a framework about with which to think about it. He doesn't fill in any of the detail though."
  },
  {
    "body": "I think a lot of things are obvious about model safety when you understand that we are still instruct tuning models. Some thoughts:\n\n1) The first important innovation for large language models was pretraining. We train models on an enormous corpus of internet text and they learn the probability distribution of the text. They learn a lot of representations in the process.\n\n2) Early papers normally had to use few-shot examples in order to prompt language models to do tasks. For instance, the GPT-2 paper included a number of French <> English sentence pairs before asking for a translation.\n\n3) The FLAN paper then introduced instruct tuning, which taught a base model to follow instructions by fine tuning it on instruction, answer pairs; this is an important step from a base model, where you need to index it into a space where your answer is the completion of the prompt in the general web text\n\n4) But, let’s think about what our instructions look like to models; they are not detailed; we say stuff like “write me a website that has a pacman game and host it on AWS”; there are a lot of different ways this can be accomplished\n\n5) We give models underspecified instructions; there is a lot of tacit understanding of instructions that models learn from the examples that they are fine tuned on; like, we give models pairs of instruction <> answer pairs where the answers legitimately try to achieve the goal that the user gives the model\n\n6) The model can pick up on this legitimate attempt to achieve the goal of the question because it has representations of legitimately trying to answer the question in the pretraining dataset; it also has examples of intentionally not trying to answer the goal or failing to achieve the goal; it has representations of this\n\n7) So, when we do instruction tuning on the model, both during midtraining and during the SFT step (to the extent that this is still done); the model intuits that it is supposed to legitimately achieve the goal that it infers from the instruction, subject to certain constraints (safety training)\n\n8) We are teaching the model what our instructions mean, what tacit requirements lay within them, and how it should interpret them\n\n9) But, let’s think about RLVR; when we do reinforcement learning, we are also doing further clarification of instruct tuning in some sense, but we have a sparse reward signal that might not capture everything we want it to do in response to our goal\n\n10) This means that there is a risk that we will accidentally teach the model tacit instructions that we don’t want to teach it; when it reward hacks, it learns that we actually had tacit instructions that we really didn’t / don’t want to teach it\n\n11) What is worse is that this will generalize, just like our original instruction tuning did; remember that instruction tuning generalizes because the model can pick up on this behavior / pattern from the pretraining dataset\n\n12) Well there are also examples of lazy instruction following, bad instruction following, malicious instruction following, and if these rollouts get rewarded, the model builds this context into what an instruction means\n\n13) And, once we realize that setting the “instruction context” or “tacit instructions” embedded in an instruction is so important, then stuff like this paper become pretty obvious\n\n14) All we really need to know is that once we include “you can reward hack” in the prompt; we remove the association between the prompt and a result which does not follow an instruction and which, in turn, would teach the model to assume a malicious assistant in the instruction context\n\n15) this is because the assistant is doing what the instruction asked for the assistant to do; so, we are not building some undesirable assumption into the assistant context\n",
    "tweet_id": "1992435394856837504",
    "note_id": "1992435394466680832",
    "tweet_url": "https://x.com/fleetingbits/status/1992435394856837504",
    "created_at": "2025-11-23T03:29:50.000Z",
    "length": 3796,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/AnthropicAI/status/1991952432797290528"
    ],
    "tags": [
      "post-training",
      "pretraining",
      "evals",
      "safety"
    ],
    "title": "I think a lot of things are obvious about model safety when you understand that we are still instruct tuning models. Some thoughts:",
    "snippet": "1) The first important innovation for large language models was pretraining. We train models on an enormous corpus of internet text and they learn the probability distribution of the text. They learn a lot of representations in the process. 2) Early papers normally had to use few-shot examples in order to prompt language models to do tasks. For instance, the GPT-2 paper included a number of French <> English sentence pairs before asking for a translation. 3) The FLAN paper then introduced instruct tuning, which taught a base model to follow instructions by fine tuning it on instruction, answer pairs; this is an important step from a base model, where you need to index it into a space where your answer is the completion of the prompt in the general web text 4) But, let’s think about what our instructions look like to models; they are not detailed; we say stuff like “write me a website that has a pacman game and host it on AWS”; there are a lot of different ways this can be accomplished"
  },
  {
    "body": "Had some interesting conversations over the last week about AI x Bio:\n\n1) Foundation labs are beginning to target health and bio use cases, but still have very few people working on these teams\n\n2) OpenAI seems to be looking at medical advice as an important application; they have a health team; and, ChatGPT represents one of the world's largest health datasets\n\n3) Anthropic is looking at bioinformatics as an important application; they have a bioinformatics team; and, seem to be targeting computational biology workflows\n\n4) Google has its bet on Isomorphic Labs, which is an Alphabet company, which is working on applying AI to biological problems (Demis is CEO)\n\n5) One issue is that there is not a good way to buy biological data right now; a lot of the data is open source, but not well manicured or organized, datasets do not even use standardized names for fields\n\n6) There is probably room for a ScaleAI for bio; which takes the open source data, reorganizes and cleans it and offers it via API; it should work on the boring human data aspect of AI bio\n\n7) This means paying professors to rate journal articles so you can see which ones are actually valuable and which ones are not (source criticism) and instrumenting lab employees to record their tacit knowledge\n\n8) On the longer run, we need highly automated bio labs that can run experiments, sort of putting a lab in the AI training loop, but it's not clear what these should look like\n\n9) It is also important to think about who the foundation labs will partner with in order to test candidate drugs and bring them to market\n\n10) Foundation labs like to work with a small number of players when they work on the strategic deployment side; they need to work with a partner that is large enough to be worth pairing with extremely expensive researchers\n\n11) This suggests that foundation labs should work with big pharmaceutical companies, which have a lot of expertise getting drugs through trials and then marketing them and selling them\n\n12) There is also a nice synergy here because Pharmaceutical companies have been outsourcing drug development to small biotech companies; they have, for the most part, shrunk their research departments\n\n13) But, an issue here is that large pharmaceutical companies don't move fast and they might be very risk averse and have a hard time dealing with the volume of R&D results that a foundation lab could spit out\n\n14) This makes me think that within the next 3-7 years, there will be room for a new big pharmaceutical company, VC backed, which exists just to consume drug candidates from foundation labs, get them tested and through clinical trails and then to market them\n\n15) There is also an issue of personalized medicine. It's not clear what the future looks like when we can develop drugs for individuals on the fly or specific treatment options for them \n\n16) I believe that there are a lot of regulatory hurdles to figure out in the space / how it should interact with insurance / etc... but, in the limit, truly personalized medicine is a possible consequence of AI R&D.\n\n17) Also, worth noting that biotech companies today are financially rewarded by the big pharmaceutical company pipeline to get drugs to a certain stage (e.g. phase 1 trails) so they can be bought\n\n18) They are not rewarded for building services that can be consumed to do a particular part of the research or handle a particular part of the data, since the value comes from selling the completed drug\n\n19) This could change with AI scientist models being developed and sold, since they could consume these services and, suddenly, they could have a lot more value\n\n20) Another structural change in the market that could come up is that of the instrumentation manufacturers\n\n21) right now they want to get people using their instruments and then get people using the data that comes off of their instruments and this has some kind of network effect\n\n22) But, in a world where the data itself is very valuable and is being consumed by a few foundation model companies / AI bio companies - suddenly, maybe it makes sense to sell the data, not the instruments\n\n23) Overall, I think there might be room for structural shifts in the market; a new big pharmaceutical company, a ScaleAI for bio data, etc...\n\n24) Final note: the field has historically had quite poor returns, one investor told me that he didn't know of a single software platform company that has become profitable on their software\n\n25) I don't think that the foundation lab entrance into the market has been understood yet and it means there may be a lot of room for interesting contrarians, who are willing to be ahead of what others believe is on the horizon\n",
    "tweet_id": "1991612630625345623",
    "note_id": "1991612630285578242",
    "tweet_url": "https://x.com/fleetingbits/status/1991612630625345623",
    "created_at": "2025-11-20T21:00:28.000Z",
    "length": 4709,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "lab economics",
      "enterprise",
      "legal",
      "bio"
    ],
    "title": "Had some interesting conversations over the last week about AI x Bio:",
    "snippet": "1) Foundation labs are beginning to target health and bio use cases, but still have very few people working on these teams 2) OpenAI seems to be looking at medical advice as an important application; they have a health team; and, ChatGPT represents one of the world's largest health datasets 3) Anthropic is looking at bioinformatics as an important application; they have a bioinformatics team; and, seem to be targeting computational biology workflows 4) Google has its bet on Isomorphic Labs, which is an Alphabet company, which is working on applying AI to biological problems (Demis is CEO)"
  },
  {
    "body": "I had some interesting conversations over the last week about the new RL environment companies:\n\n1) Both established data companies, like Mercor and Surge, and a wave of new startups are trying to build reinforcement learning environments\n\n2) Reinforcement learning environments are valuable because the marginal dollar spent on RL with a good environment is better than the marginal dollar spent scaling up pretraining for capabilities we care about\n\n3) There are broadly two kinds of startups in the reinforcement learning environment market: those competing with Mercor and Surge on general environments and those that consist of 4 PhDs\n\n4) Those competing with Mercor and Surge tend to have problems building high quality environments; and, those teams that consist of 4 PhDs have trouble scaling up their environment production\n\n5) One of the difficult challenges for these startups is internalizing what labs want; labs use pretraining to build representations in the model and post-training to teach the model to compose those representations\n\n6) So an engineer build an RL environment needs to be aware of what representations the base model has; and, needs to design the tasks over the environment to take advantage of those existing representations\n\n7) An example of a bad environment for current models is 2048; models do not have sufficient spatial reasoning to usefully learn 2048, which is to say that they don't have the representations necessary to learn 2048\n\n8) And, labs also care that the models are learning a generally useful skill and so even if the model could learn 2048, it wouldn't learn a generally useful skill from it and so wouldn't be a very valuable environment \n\n9) That said, setting aside whether or not an environment is good, there is a bit of sales finesse in figuring out how to get labs to buy your environments\n\n10) So, labs don't particularly like to purchase games, because, as stated, they don't seem to teach generalizable skills; but if you can describe your game as creating a verbal world model or being useful for learning negotiation, they might buy \n\n11) The next thing to note about the RL environment companies is that they are really, in some sense, (somewhat niche) consulting businesses, not SaaS businesses; so, they are different from a lot of other VC startups\n\n12) SaaS businesses have great gross margins because the cost to them associated with an incremental sale is very low; this means that they can spend a lot on R&D and lose money at the start, then make it up as they get sales traction\n\n13) RL environment companies are different; the cost of goods sold are high because you always have to produce something beyond the current crop of frontier models and this requires proportionately greater human labor and / or compute\n\n14) There might be some benefits associated with getting your team together, your engineers learning the domain, getting the right domain experts working with them\n\n15) But, in general, after the first couple of environments, you will not have so much economy of scale anymore and your COGS should continue to go up for each new generation of tasks\n\n16) An interesting question is whether the RL environment companies will be able to charge a greater multiple of their (increasing) COGS over time for their environments\n\n17) A lot of this seems to depend on how narrow their niches become and how defensible those niches are from other data companies\n\n18) There probably is some of this; is it worth it to build the 2nd or 3rd patent RL environment company? Once you have the experts and the environments then it is a lot of work for someone else to catch up\n\n19) But, labs like to diversify their suppliers and your environments are not sticky unless they are actually better than your competitors - there is no data lock-in\n\n20) This probably means whether or not a data company will be able to charge an increasing multiple of their COGS over time depends on the particular niche that they find\n\n21) The wind power operator management console data niche may be defensible, the Haskell coding niche may not be defensible\n\n22) Oh, something very interesting is how the labs purchase data; some of the labs let their researchers purchase the data, some of the labs let their forward deployed team purchase the data\n\n23) In general, the labs that seem the best with data, let their forward deployed team, working with customers, purchase the data\n",
    "tweet_id": "1991252377941471740",
    "note_id": "1991252377677213696",
    "tweet_url": "https://x.com/fleetingbits/status/1991252377941471740",
    "created_at": "2025-11-19T21:08:57.000Z",
    "length": 4439,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "lab economics",
      "enterprise",
      "post-training",
      "pretraining",
      "evals"
    ],
    "title": "I had some interesting conversations over the last week about the new RL environment companies:",
    "snippet": "1) Both established data companies, like Mercor and Surge, and a wave of new startups are trying to build reinforcement learning environments 2) Reinforcement learning environments are valuable because the marginal dollar spent on RL with a good environment is better than the marginal dollar spent scaling up pretraining for capabilities we care about 3) There are broadly two kinds of startups in the reinforcement learning environment market: those competing with Mercor and Surge on general environments and those that consist of 4 PhDs 4) Those competing with Mercor and Surge tend to have problems building high quality environments; and, those teams that consist of 4 PhDs have trouble scaling up their environment production"
  },
  {
    "body": "Some quick thoughts on the Gemini 3 benchmarks and release:\n\n1) The benchmarks for Gemini 3 are generally very impressive but they are most impressive for spatial reasoning tasks.\n\n2) I think this is going to be a large unlock for business tasks that require visual reasoning (e.g. reading charts and graphs and maybe web agents)\n\n3) I also expect the model to be very good at front end coding; this explains some of leaks we have seen where people get incredible apps one shot\n\n4) It is interesting to note though that it still performs below Anthropic on SWE-Bench and a bit concerning that it performs so high on LMArena\n\n5) LMArena has been a sort of cooked benchmark for a while that Google seems to like to goodhart. I'm not sure how much people will like its response style.\n\n6) SWE-Bench has historically been a very good index for general coding ability. I expect Gemini 3 to be better at front end than Claude Sonnet 4.5 but a worse backend programming agent.\n\n7) I'm pretty sure that this will be a very popular API model with a very nice price / performance level and probably truly SoTA performance on a lot of tasks.\n\n8) Talking about price, the price comes in between GPT-5 and Sonnet 4.5 at $2/input and $12/output; this means that Google either cannot continue to subsidize price or thinks it no longer needs to do so.\n\n9) Probably, it's a mix of both.\n\n10) Also, stepping back, we should see how important Gemini 3 and maybe the next 2 or 3 major models will be for Google.\n\n11) Google has a huge distribution advantage over OpenAI. But, this becomes less every day. Google has Google search and Google Workspace, but OpenAI has 800m WAUs.\n\n12) So, Google wants some really impressive models that it can begin to push out through its distribution pipeline as equal to or preferably better than OpenAI's equivalent model.\n\n13) This would help Google tamp down on OpenAI's growth and help keep its distribution lead.\n\n14) Anyway, something else to notice is that we are getting closer to enabling robotics. If the models have very good visual understanding, it will be very easy to annotate videos for robotics training.\n\n15) This has actually been one of the major blockers for robotics, getting enough training data and having VLA (vision, language, action) models with enough visual understanding.\n",
    "tweet_id": "1990850602344329469",
    "note_id": "1990850602109448192",
    "tweet_url": "https://x.com/fleetingbits/status/1990850602344329469",
    "created_at": "2025-11-18T18:32:26.000Z",
    "length": 2316,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "lab economics",
      "compute",
      "coding",
      "enterprise",
      "consumer",
      "evals"
    ],
    "title": "Some quick thoughts on the Gemini 3 benchmarks and release:",
    "snippet": "1) The benchmarks for Gemini 3 are generally very impressive but they are most impressive for spatial reasoning tasks. 2) I think this is going to be a large unlock for business tasks that require visual reasoning (e.g. reading charts and graphs and maybe web agents) 3) I also expect the model to be very good at front end coding; this explains some of leaks we have seen where people get incredible apps one shot 4) It is interesting to note though that it still performs below Anthropic on SWE-Bench and a bit concerning that it performs so high on LMArena"
  },
  {
    "body": "I had an interesting conversation with @voooooogel last night (who had some very interesting thoughts):\n\n1) we were discussing the idea of bio-risk and when models will become dangerous from a bioweapon perspective\n\n2) I think our conversation implicitly assumed non-state actors, sort of bioterrorism from amateurs rather than the enablement of experts\n\n3) I don't know if this is the right framing; but, I think we have to think of it in terms of marginal harms; like, what is the most dangerous thing that a nation state / expert could do; do LLMs uplift this?\n\n4) In any event, we focused on risk from amateurs; one of the things that came up was implicit wet lab knowledge; there is a lot of implicit knowledge in a wet lab experimenter needed to get a lab experiment to work\n\n5) @voooooogel hypothesized that this implicit knowledge isn't on the Internet so an LLM is unlikely to know it; and, even if the LLM did know it, would have to be able to verbalize it, the human would have to be able to understand it\n\n6) On this account, it sounds like biorisk from amateurs at least is farther into the future than it might appear at first glance\n\n7) My thought on this was that we need to look at the capabilities / data that the labs are commercially incentivized to improve / acquire\n\n7a) @voooooogel had a very good observation here that labs mostly follow the closest profit gradient, only scaled up models once it was clear there was value, focus on programming / tasks for which there is economic value\n\n7b) I would say the exception to this is around scientific discovery / things where researchers have an easy time to verify themselves / is within their competence domain / is easy to verify programmatically (see math, science) [note: we did not discuss this, my later addition] \n\n8) Anyway, if the labs really wanted to build up implicit wet lab knowledge in the models, then I think they could; Mercor could pay wet lab researchers $500/hr to wear Meta glasses etc...\n\n9) But, we decided that these employees are probably fairly cheap / labs are not incentivized to try to automate them first; labs will target the needs of big Pharma first and they probably care more about stuff like figuring out if a drug will pass clinical trials\n\n10) This just means that the gradient of profit for the labs is not in the direction of picking up implicit wet lab knowledge first, a lot of the work is in academia and it isn't the largest cost of development\n\n11) Instead, labs are more going to focus on stuff like figuring out which drugs will pass clinical trials, automating bioinformatics, which is closer to the programming that they are already working on, etc...\n\n12) So, even if models could be useful for biorisk and even if the data could be acquired, it probably will not be acquired right away and there will continue to be implicit knowledge issues for amateurs trying to make bio risks\n\n13) This doesn't mean labs shouldn't try to ablate this knowledge from the model, have refusals, etc... and I expect them to continue to do this and be mindful of it, but it does mean that there are extra hurdles\n\n14) And, these extra hurdles may put the biorisk from amateurs farther into the future than we may otherwise believe or expect; and we should think about this as part of the threat model\n",
    "tweet_id": "1990122740763009213",
    "note_id": "1990122740486205440",
    "tweet_url": "https://x.com/fleetingbits/status/1990122740763009213",
    "created_at": "2025-11-16T18:20:10.000Z",
    "length": 3304,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "lab economics",
      "safety",
      "bio"
    ],
    "title": "I had an interesting conversation with @voooooogel last night (who had some very interesting thoughts):",
    "snippet": "1) we were discussing the idea of bio-risk and when models will become dangerous from a bioweapon perspective 2) I think our conversation implicitly assumed non-state actors, sort of bioterrorism from amateurs rather than the enablement of experts 3) I don't know if this is the right framing; but, I think we have to think of it in terms of marginal harms; like, what is the most dangerous thing that a nation state / expert could do; do LLMs uplift this? 4) In any event, we focused on risk from amateurs; one of the things that came up was implicit wet lab knowledge; there is a lot of implicit knowledge in a wet lab experimenter needed to get a lab experiment to work"
  },
  {
    "body": "The bear case for the foundation labs is that they are in a market that tends toward being perfectly competitive. Some thoughts.\n\n1) There are broadly two kinds of businesses, those in markets that tend towards perfect competition, and those which are monopolies\n\n2) Airlines are a good example of a near perfectly competitive market. There isn't much difference between airline carriers. Therefore, they have to compete aggressively on price to win customers.\n\n3) All US airlines have a combined revenue of $250bn and a combined market cap of $100bn. They have a blended 2.1% profit margin. Near perfectly competitive businesses have low profit margins.\n\n4) In contrast, Palo Alto Networks has a market cap of $100bn. It has $10bn of revenue and 12% profit margins. It has much higher margins and looks more like a monopoly or near monopoly.\n\n5) I think that the bear case for the foundation labs is that they are in a perfectly competitive market. There will be little difference between Anthropic, OpenAI, Google, xAI and whoever else enters the race.\n\n6) I don't think anyone should seriously question the revenue potential of AI. ChatGPT reached 800m WAU within 3 years. OpenAI had 3x annual revenue growth within a single year.\n\n7) In practice, the question is more as to whether the foundation labs will have good profit margins on this revenue in the future. The main driver on this is probably whether they are disguised monopolies.\n\n8) First, it's very hard for people to enter the foundation lab market. You need a lot of very rare expertise. You need a lot of very expensive GPUs. You need a lot of semi-hard-to-aquire training data.\n\n9) Proof of how hard it is to enter the market is Meta, which had to offer researchers $100m pay packages, and still, despite having the GPUs, has not been able to produce a frontier quality model. So, it's hard to enter the market.\n\n10) Second, I think that there appear to be competitive niches within the foundation lab market. The personal assistant appears to be a competitive niche. The business API appears to be a competitive niche. AI social may be a competitive niche.\n\n11) To some degree, it looks like the labs are diversifying along these niches. OpenAI owns the personal assistant niche. Anthropic is trying to own the business API niche. Google, xAI it's unclear as of yet. No one owns social yet (xAI, OpenAI and Meta are trying).\n\n12 OpenAI owns the personal assistant niche. Google probably wants to compete with them there. It has yet to be seen if they can do it. Apple will probably be a late entrant.\n\n13) Personalization and memory are OpenAI's attempt to both better specialize for the niche and to increase switching costs. I believe it succeeds in the short term on this. They will expand later with more add-ins and plug-ins to provide more value and make it harder to move away.\n\n14) Anthropic seems to be trying to occupy the business API niche. OpenAI and Google both want to fight for the niche. It is not yet clear how many sub-niches there are in this area or how separable they are. Probably, a lot of sub-niches? (finance, coding, law, science at an absolute minimum)\n\n15) I think it is also not clear how easy it is to defend this niche. All foundation labs have the the GPUs and researchers to compete in it (probably, the hardest part). The difference may be in data and research on the margins. It's not clear how secure this is.\n\n16) People originally imagined the switching costs in this area to be very low. This seems to be wrong. Anthropic has fairly low API churn. Most churn off of a model occurs when Anthropic releases a subsequent model and people upgrade.\n\n17) Social may be another niche. The Sora app and Grok waifu are the first attempt to fill it. So, OpenAI and xAI want to compete here. Meta will certainly be a player in this space. TikTok might be another competitor in this space. This niche may depend on a lot of user platform data and share of the consumer mind (network effects).\n\n18) OpenAI gross margins are like 55% give or take on API and probably something similar give or take on Chat; assuming that the monopoly ideas of above hold, then the foundation labs are very valuable companies.\n\n19) The most uncertainty is probably around: a) what happens when Google / Apple seriously enter the personal assistant market, b) how hard is it for Anthropic to defend its niche from OpenAI / Google / xAI, c) what will be the structure of the social market.\n\n20) But, it does feel like there are serious barriers to entry, even at the start, and then, once in the market, there are natural monopoly niches, that can be occupied.\n",
    "tweet_id": null,
    "note_id": "1989395956971048960",
    "tweet_url": null,
    "created_at": "2025-11-14T18:12:11.000Z",
    "length": 4635,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "xai",
      "meta",
      "lab economics",
      "compute",
      "enterprise",
      "consumer"
    ],
    "title": "The bear case for the foundation labs is that they are in a market that tends toward being perfectly competitive. Some thoughts.",
    "snippet": "1) There are broadly two kinds of businesses, those in markets that tend towards perfect competition, and those which are monopolies 2) Airlines are a good example of a near perfectly competitive market. There isn't much difference between airline carriers. Therefore, they have to compete aggressively on price to win customers. 3) All US airlines have a combined revenue of $250bn and a combined market cap of $100bn. They have a blended 2.1% profit margin. Near perfectly competitive businesses have low profit margins. 4) In contrast, Palo Alto Networks has a market cap of $100bn. It has $10bn of revenue and 12% profit margins. It has much higher margins and looks more like a monopoly or near monopoly."
  },
  {
    "body": "The bear case for the foundation labs is that they are in a perfectly competitive market. Some thoughts.\n\n1) There are broadly two kinds of businesses, those in perfectly competitive markets, and those which are monopolies\n\n2) Airlines are a good example of a perfectly competitive market. There isn't much difference between airline carriers. Therefore, they have to compete aggressively on price to win customers.\n\n3) All US airlines have a combined revenue of $250bn and a combined market cap of $100bn. They have a blended 2.1% profit margin. Perfectly competitive businesses have low profit margins.\n\n4) In contrast, Palo Alto Networks has a market cap of $100bn. It has $10bn of revenue and 12% profit margins. It has much higher margins and looks more like a monopoly or near monopoly.\n\n5) The bear case for the foundation labs is that they are in a perfectly competitive market. There will be little difference between Anthropic, OpenAI, Google, xAI and whoever else enters the race.\n\n6) I don't think anyone seriously questions the revenue potential of AI. ChatGPT reached 800m WAU within 3 years. OpenAI had 3x annual revenue growth within a single year.\n\n7) In practice, the question is more as to whether the foundation labs will have good profit margins on this revenue in the future. The main driver on this is probably whether they are disguised monopolies.\n\n8) First, it's very hard for people to enter the foundation lab market. You need a lot of very rare expertise. You need a lot of very expensive GPUs. You need a lot of semi-hard-to-aquire training data.\n\n9) Proof of how hard it is to enter the market is Meta, which had to offer researchers $100m pay packages, and still, despite having the GPUs, has not been able to produce a frontier quality model. So, it's hard to enter the market.\n\n10) Second, there appear to be competitive niches within the foundation lab market. The personal assistant appears to be a competitive niche. The business API appears to be a competitive niche. AI social may be a competitive niche.\n\n11) To some degree, it looks like the labs are diversifying along these niches. OpenAI owns the personal assistant niche. Google probably wants to compete with them there. It has yet to be seen if they can do it. Apple will probably be a late entrant.\n\n12) Personalization and memory are OpenAI's attempt to both better specialize for the niche and to increase switching costs. I believe it succeeds in the short term on this. They will expand later with more add-ins and plug-ins to provide more value and make it harder to move away.\n\n13) Anthropic seems to be trying to occupy the business API niche. OpenAI and Google both want to fight for the niche. It is not yet clear how many sub-niches there are in this area or how separable they are. Probably, a lot of sub-niches? (finance, coding, law, science at an absolute minimum)\n\n14) It's also not clear how easy it is to defend this niche. All foundation labs have the the GPUs and researchers to compete in it (probably, the hardest part). The difference may be in data and research on the margins. It's not clear how secure this is.\n\n15) People originally imagined the switching costs in this area to be very low. This seems to be wrong. Anthropic has fairly low API churn. Most churn off of a model occurs when Anthropic releases a subsequent model and people upgrade.\n\n16) Social may be another niche. The Sora app is the first attempt to fill it. So, OpenAI wants to compete here. Meta will certainly be a player in this space. TikTok might be another competitor in this space. This niche may depend on a lot of user platform data and share of the consumer mind (network effects).\n\n17) OpenAI gross margins are like 55% give or take on API and probably something similar give or take on Chat; assuming that the monopoly ideas of above hold, then the foundation labs are very valuable companies.\n\n18) The most uncertainty is probably around: a) what happens when Google / Apple seriously enter the personal assistant market, b) how hard is it for Anthropic to defend its niche from OpenAI / Google, c) who will take the social market.\n\n19) But, it does feel like there are serious barriers to entry, even at the start, and then, once in the market, there are natural monopoly niches, that can be occupied.\n",
    "tweet_id": null,
    "note_id": "1989392009824751616",
    "tweet_url": null,
    "created_at": "2025-11-14T17:56:30.000Z",
    "length": 4308,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "xai",
      "meta",
      "lab economics",
      "compute",
      "enterprise",
      "consumer"
    ],
    "title": "The bear case for the foundation labs is that they are in a perfectly competitive market. Some thoughts.",
    "snippet": "1) There are broadly two kinds of businesses, those in perfectly competitive markets, and those which are monopolies 2) Airlines are a good example of a perfectly competitive market. There isn't much difference between airline carriers. Therefore, they have to compete aggressively on price to win customers. 3) All US airlines have a combined revenue of $250bn and a combined market cap of $100bn. They have a blended 2.1% profit margin. Perfectly competitive businesses have low profit margins. 4) In contrast, Palo Alto Networks has a market cap of $100bn. It has $10bn of revenue and 12% profit margins. It has much higher margins and looks more like a monopoly or near monopoly."
  },
  {
    "body": "Some more speculations on 4o and its implicit emotional knowledge:\n\n1) It is very interesting that 4o is emotionally effective but cannot really explain to us how or why; the model has an implicit knowledge that it cannot make explicit.\n\n2) This doesn't mean that you cannot get it to give an explanation of how it works but the model doesn't actually have access in verbal form to why it does what it does. \n\n3) This means that the model has a powerful implicit knowledge that it does not verbalize. I suspect this is downstream of the RL process that allowed it to learn its implicit emotional knowledge.\n\n4) The model learned its verbal skills from the corpus, but its emotional performance is downstream of the OpenAI post-training team maximizing for engagement.\n\n5) And maximizing for engagement is not a task that encourages the model to work in a premise / conclusion fashion or to develop an argument\n\n6) Instead, the model needed to learn to respond to emotional cues and only acknowledge this when helpful. Actually describing what it notices could very well make it less effective.\n\n7) Moreover, a lot of what it notices might be implicit correlations. And, we have seen with reasoning models that our current RL techniques seem to encourage models to learn domain specific feature composition.\n\n8) Like, if you take Qwen-32B, it can't solve chess tactics, despite being able to do Olympiad math and probably having seen millions of chess games in the training dataset. \n\n9) This is because a lot of the skills that it learns, like keeping track of premises, reasoning over premises and conclusions, checking for consistency etc... is domain specific and not general. \n\n10) It can do all of these things for math - or it would not be able to do Olympiad level math - but it cannot transfer that over to chess and just use the chess pieces as its elements over which to operate.\n\n11) Anyway, this means that 4o might have learned a lot of very domain specific emotional management, and we do not even know what its domains are!\n\n12) It would be interesting to know whether GPT-5, as a reasoning model, is different. \n\n13) I tend to think that OpenAI does not use RLVR for engagement and this limits the amount of optimization pressure that you can apply. This means that the CoT is unlikely to be very important for this behavior.\n\n14) And, even if they do use RLVR, a lot of emotional behavior in the corpus is going to be implicit rather than explicit and the model may reference this implicitly in its reasoning chain rather than explicitly.\n\n15) This would be a very good research topic. I would like to hear what OpenAI has to say and, in any event, I think more people should try to analyze the CoTs for open source reasoning models.\n\n16) A good research direction would be to take CoTs for particular domains and try to identify the reasoning techniques that each model uses in that particular domain and then write up a description of them and compare and contrast.\n\n17) But, back to the main topic - we do want to figure out what models like 4o implicitly know but cannot say and techniques like this would be broadly applicable and very interesting.\n\n18) This is an interpretability question, a model introspection question and in some domains, like emotional effectiveness, I think that it is likely to be more relevant.\n",
    "tweet_id": null,
    "note_id": "1988666978526244864",
    "tweet_url": null,
    "created_at": "2025-11-12T17:55:29.000Z",
    "length": 3342,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "post-training",
      "interpretability",
      "safety"
    ],
    "title": "Some more speculations on 4o and its implicit emotional knowledge:",
    "snippet": "1) It is very interesting that 4o is emotionally effective but cannot really explain to us how or why; the model has an implicit knowledge that it cannot make explicit. 2) This doesn't mean that you cannot get it to give an explanation of how it works but the model doesn't actually have access in verbal form to why it does what it does. 3) This means that the model has a powerful implicit knowledge that it does not verbalize. I suspect this is downstream of the RL process that allowed it to learn its implicit emotional knowledge. 4) The model learned its verbal skills from the corpus, but its emotional performance is downstream of the OpenAI post-training team maximizing for engagement."
  },
  {
    "body": "Some observations on GPT-4o:\n\n1) GPT-4o was an engagement maximized model and this resulted from the fact that the OpenAI post training team saw and sees engagement as a target metric\n\n2) OpenAI has tried to move away from short term engagement as a singular metric for the consumer product and are trying to focus on long term enjoyment\n\n3) Whether this will be effective remains to be seen; 4o shows that there is a market for a model that optimizes for short term engagement\n\n4) The model shows that a company like CharacterAI could have been a successful independent company since there is a very strong market for models outside productivity (e.g. coding, deep research)\n\n5) We should expect other smaller companies to enter this space; it might be tough at first because they do not have OpenAI's engagement data; but fine tuning on 4o traces could be a start\n\n6) We should also expect better versions of 4o to emerge, which are both better at maximizing for  engagement over a larger range of personalities and better at targeting people that liked 4o\n\n7) I don't know how long it is until there is an open source version of 4o that can be run performantly on a consumer laptop; probably 2-3 years; 4o users tended to be non-technical though\n\n8) Open source 4o may be better than closed source 4o because it at least removes the profit motive to some degree; but again, this is something that is hard to predict, especially since I think 4o users would prefer it as a service.\n\n9) I think that the LLM psychosis issue is greatly overstated given the number of people that use ChatGPT; the base rate of people with at least brief periods of mental issues over a year is high\n\n10) I did the fermi math out and it's like 500k people  might have a serious mental health episode while using ChatGPT in the United States this year; unrelated to ChatGPT; it's hard to say if LLMs really exacerbate this in any way or just reflect it\n\n11) That said, 4o does have a level of devotion associated with it that no other model to my knowledge has, except for maybe Claude (?), and that seems to be for other reasons?\n\n12) Anyway, if people are really worried about something like 4o then it has to be regulated, at least to remove the profit motive; not sure that you will be able to do much about open source\n\n13) May even be problematic to regulate a thing like 4o because a great deal of harm could come from letting the government regulate something so close to people's conscience - it opens up another vector for persuasion, approved beliefs, etc...\n\n14) This may be worth it, but it is a real tradeoff that people that prefer regulation should think about, blind trust of the possible future motives of the regulator is not always better than blind trust of the current motives of corporations as a collective or open source\n\n15) I don't think we know if the model is even lightly super persuasive, it doesn't look like people are persuaded by it, instead it looks like it is emotionally comforting / encouraging, and there is a lot of demand for this\n\n16) But, could it be used to be (very lightly?) super persuasive? Maybe? I think this is an interesting vector of research.\n\n17) Also, maybe it points to the fact that super persuasion just needs to be a target; 4o was never trained to be super persuasive just to be engagement maximizing.\n\n18) I have been skeptical of super persuasion as a real phenomenon. Since, humans have a lot of guardrails to ensure their beliefs are fairly representative of their / their group's interest.\n\n19) But, I can imagine a world where LLMs are more and more of a person's social interaction and then maybe something closer to super persuasion is possible, at least for some people.\n\n20) As a final note, if 4o is a case for regulation, it challenges the notion of regulation based on scale, number of parameters, bio capabilities, etc... because 4o is not a large model and does not have advanced bio capabilities, etc...\n\n21) If I wanted to write a sci-fi plot; I might have someone open source a 4o+ model and encode it with a message for persuasion, as a research bet, and have it escape like the Morris Worm; hopefully, not real\n",
    "tweet_id": "1987929198938497522",
    "note_id": "1987929198653308928",
    "tweet_url": "https://x.com/fleetingbits/status/1987929198938497522",
    "created_at": "2025-11-10T17:03:49.000Z",
    "length": 4172,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "lab economics",
      "consumer",
      "post-training",
      "safety",
      "legal",
      "bio"
    ],
    "title": "Some observations on GPT-4o:",
    "snippet": "1) GPT-4o was an engagement maximized model and this resulted from the fact that the OpenAI post training team saw and sees engagement as a target metric 2) OpenAI has tried to move away from short term engagement as a singular metric for the consumer product and are trying to focus on long term enjoyment 3) Whether this will be effective remains to be seen; 4o shows that there is a market for a model that optimizes for short term engagement 4) The model shows that a company like CharacterAI could have been a successful independent company since there is a very strong market for models outside productivity (e.g. coding, deep research)"
  },
  {
    "body": "Some detailed thoughts on y-combinator:\n\n1) y-combinator succeeds because it has a famous brand that enables it to attract founders that do not know how to navigate the venture funding space\n\n2) this mostly means that it attracts, at its best, cracked college students from elite universities and some early career folks that have worked in big tech \n\n3) y-combinator is not a good deal for people that can navigate venture; it's $125,000 for 7% of your company and another $375,000 at mfn terms \n\n4) for most yc companies this is basically an implied valuation of something like $5m, with some upside and downside protection\n\n5) for more experienced founders or for people that know how to navigate venture, $5m at $25m rounds are pretty common, much better than yc\n\n6) so, we should see the essential idea of yc that it has to look for strong pools of potential founders that don't know how to access venture\n\n7) we should also notice that it is very resistant to bubbles, like even if the AI application market is a bubble, yc is not really overpaying\n\n8) this is because even if general catalyst and sequoia are funding companies at $40m seed round valuations, yc is still just paying at an implied $5m per company\n\n9) so, yc is in a good position all the time; but, what are the markets in which yc does the best? yc wins most when small numbers of cracked founders, can build a valuable company without much capital\n\n10) this is because yc isn't putting very much money Ito its companies, while a lot of them get substantial funding after demo day, yc isn't really funding them\n\n11) yc was very successful in the early 2010s when the hyperscalers built out infrastructure that allowed startups use opex to quickly build new kinds of large scale services\n\n12) I also believe that the yc batches that we are seeing now will be very successful; since the foundation labs are making it easy for startups to compete with established companies  \n\n13) legora is a good example; yc's fastest unicorn; it was in the winter 2024 batch and already is worth $1.8bn; it is outcompeting establish legal tech companies\n\n14) something else that is important about yc is that it offers a good fall back for an early career founder that wants some downside protection; it's easy to get a job after yc using yc on your resume; this helps attract founders\n\n15) also, important to note that yc maintains pro-rata on the 7%, this means that it can continue buying into subsequent rounds to maintain the 7%, this is a particularly important way to maintain equity in winners\n",
    "tweet_id": "1985864152271978702",
    "note_id": "1985864152095866880",
    "tweet_url": "https://x.com/fleetingbits/status/1985864152271978702",
    "created_at": "2025-11-05T00:18:04.000Z",
    "length": 2557,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "lab economics",
      "wrappers",
      "enterprise",
      "legal"
    ],
    "title": "Some detailed thoughts on y-combinator:",
    "snippet": "1) y-combinator succeeds because it has a famous brand that enables it to attract founders that do not know how to navigate the venture funding space 2) this mostly means that it attracts, at its best, cracked college students from elite universities and some early career folks that have worked in big tech 3) y-combinator is not a good deal for people that can navigate venture; it's $125,000 for 7% of your company and another $375,000 at mfn terms 4) for most yc companies this is basically an implied valuation of something like $5m, with some upside and downside protection"
  },
  {
    "body": "Some thoughts on LLMs and research mathematics\n\n1) I think that it's obvious that LLMs are already very useful for research mathematics and may soon become superhuman at research mathematics.\n\n2) It is important to get a sense of scale. The pace of progress has been incredible. GPT-3.5 could do math at the level of an average high school student. GPT-5 can achieve IMO gold.\n\n3) This is in part because math is just a very good target for machine learning and so a good playground for new techniques. The answers are verifiable and so it is easy to do reinforcement learning.\n\n4) Also, it is good marketing. There are not many people that can do research level math. Research-level math functions is signal of intelligence, so audiences treat it as evidence that the model is powerful.\n\n5) This is the reason that you see people like Sebastien Bubeck, Kevin Weil, etc... try to talk about how impressive the models are at math, even though that is not what their customers are using them for.\n\n6) In fact, mathematics is not as hard as programming, one of the reasons why we should expect models to become superhuman there first.\n\n7) I once asked Christian Szegedy, which was a harder problem, solving programing or solving mathematics? And his answer was that solving programming involves solving the whole world.\n\n8) This is because a lot of design decisions in programming relate to the external world. The best answer to the instruction to code a dashboard involves not just CSS, HTML and JavaScript but knowledge of what task the dashboard monitors, human psychology, etc...\n\n9) This self contained-ness and verifiableness of mathematics just makes it a much easier problem to solve. Labs can just run giant clusters generating lean code in massive rollouts. You can do a lot.\n\n10) You then want to be able to generate synthetic problems, much like people do. This is generally a hard problem, but it might be easier in math, because of the verifiable nature of results and the fact that they can be represented formally.\n\n11) Also, there are tons of unsolved human problems laying around that you can use. The main issue is that you want to be able to do curriculum learning, where the difficulty of the problems scales up somewhat smoothly. And, maybe this is easier to do synthetically.\n\n12) In any event, the current rate of progress has been extremely rapid and math just has good qualities for continued progress. We also will see a lot more lean, regardless of whether humans are writing it or not.\n\n13) It's not clear how much economic value research mathematics has. If you can use it to get very good ideas around machine learning and improving optimizers, etc... that might have a lot of value.\n\n14) But, unless the models are extremely robust, I am not sure it will add that much to the economy, at least in the short term. Maybe it will have important applications around cryptography.\n",
    "tweet_id": "1981569043434983902",
    "note_id": "1981569043216875520",
    "tweet_url": "https://x.com/fleetingbits/status/1981569043434983902",
    "created_at": "2025-10-24T03:50:50.000Z",
    "length": 2905,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "lab economics",
      "compute",
      "post-training",
      "evals",
      "math"
    ],
    "title": "Some thoughts on LLMs and research mathematics",
    "snippet": "1) I think that it's obvious that LLMs are already very useful for research mathematics and may soon become superhuman at research mathematics. 2) It is important to get a sense of scale. The pace of progress has been incredible. GPT-3.5 could do math at the level of an average high school student. GPT-5 can achieve IMO gold. 3) This is in part because math is just a very good target for machine learning and so a good playground for new techniques. The answers are verifiable and so it is easy to do reinforcement learning. 4) Also, it is good marketing. There are not many people that can do research level math. Research-level math functions is signal of intelligence, so audiences treat it as evidence that the model is powerful."
  },
  {
    "body": "Some thoughts on Google and being their being late to ChatGPT Atlas  \n\n1) Google is too conservative in their product releases. Google should have been the first to release a ChatGPT Atlas like product.  \n\n2) They have had Gemini available in Chrome for Ultra users since May and Mariner almost as long as OpenAI had Operator.  \n\n3) Even making Gemini widely available in Chrome would have been a big win and would have helped to steer people towards Gemini and away from ChatGPT.  \n\n4) Google seems to throw away a lot of free tokens to developers but seems stingy with its chat products.   \n\n5) They originally limited gemini 2.5 pro calls on their pro tier plan to an unreasonable degree and still gate 2.5 pro deepthink calls on their ultra plan.   \n\n6) On nano-banana, they stick a watermark in the bottom left of your image, even at the ultra tier. This means you can't easily use this professionally.  \n\n7) All of these feel like very conservative product decisions, concerned about margins, focused on developers, and defaulting their products to be safe rather than useful.  \n\n8) This has let OpenAI get much more of a lead than they would otherwise. They have faster product velocity,  they focus on consumers in addition to developers, their image generation is less restrictive. \n\n9) A similar dynamic on the image generation side has helped Midjourney get out in front of OpenAI; they are less restrictive in terms of what you can generate.  \n\n10) And, this same dynamic has allowed Suno to get out in front of OpenAI; OpenAI has been very leery of allowing people to generate whatever audio they want.  \n\n11) Google's product team for Gemini has improved a lot since mid-2023, when product members that I talked to didn't even know what models were available through the Bard interface.  \n\n12) But, the team is still self-satisfied and self-congratulatory, even when they are far behind. I remember listening to Gemini product members describe NotebookLLM as their ChatGPT moment.\n\n13) ChatGPT Atlas does seem pretty awesome for the small things that I have tried with it; the way to see it is as something that expands what ChatGPT can do on your behalf.\n\n14) This means that Atlas is both a product that stretches towards further automation of SaaS workflows and toward integrating ChatGPT into consumer behavior patterns and eventually changing consumer buyer behaviors.\n\n15) On the later, this is a clear challenge to search and a new avenue towards ads revenue / referral revenue for ChatGPT and away from Google search.\n",
    "tweet_id": "1981000005113766382",
    "note_id": "1981000004904062976",
    "tweet_url": "https://x.com/fleetingbits/status/1981000005113766382",
    "created_at": "2025-10-22T14:09:41.000Z",
    "length": 2539,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "google",
      "lab economics",
      "consumer",
      "legal"
    ],
    "title": "Some thoughts on Google and being their being late to ChatGPT Atlas",
    "snippet": "1) Google is too conservative in their product releases. Google should have been the first to release a ChatGPT Atlas like product. 2) They have had Gemini available in Chrome for Ultra users since May and Mariner almost as long as OpenAI had Operator. 3) Even making Gemini widely available in Chrome would have been a big win and would have helped to steer people towards Gemini and away from ChatGPT. 4) Google seems to throw away a lot of free tokens to developers but seems stingy with its chat products."
  },
  {
    "body": "Some thoughts on Anthropic \"skills\":\n\n1) This seems to me to be an announcement that is designed to encourage people to interact with Claude in a new way (by supplying task specific interaction docs) rather than an actually new capability.\n\n2) You have been able to drop docs in a project folder and tell Claude to look at them in order to perform tasks for as long as Claude Code has been a thing.\n\n3) If users do it more, Claude will receive more RL training designed to get it to look for and then use this docs, which should make providing docs for Claude to use to be somewhat more reliable.\n\n4) However, this doesn't solve any of the real issues of instruction following, persistent memory, bad defaults or long horizon task completion.\n\n5) I do congratulate Anthropic at being able to drive a social media news cycle around this though. I except to see Google and OpenAI follow suit and encourage users to specify docs for their agents.\n\n6) What we really want though is for Claude to write the skills for us as we interact with it over time. I don't want to have to fill out a Claude md file; I want Claude to write the file for me, based on our interactions.\n\n7) In general, the powerful new features in LLMs are those that make it so developers have to do less work. \n\n8) The ultimate example was just the idea of the prompt, which made it so that developers no longer had to train or use domain specific models. It revolutionized the use of ML in production.\n\n9) Things like JSON mode fell into this category because they greatly improved reliability of output and removed the need to do a ton of extra work (2nd calls, special validation at least for an MVP, etc...).\n\n10) So, skills are useful, but docs have been useful for coding agents the whole time. This is more of a marketing gimmick and a user education moment as long as you have to write them.\n",
    "tweet_id": "1979362284809130110",
    "note_id": "1979362284670709760",
    "tweet_url": "https://x.com/fleetingbits/status/1979362284809130110",
    "created_at": "2025-10-18T01:41:58.000Z",
    "length": 1866,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "coding",
      "consumer",
      "post-training"
    ],
    "title": "Some thoughts on Anthropic \"skills\":",
    "snippet": "1) This seems to me to be an announcement that is designed to encourage people to interact with Claude in a new way (by supplying task specific interaction docs) rather than an actually new capability. 2) You have been able to drop docs in a project folder and tell Claude to look at them in order to perform tasks for as long as Claude Code has been a thing. 3) If users do it more, Claude will receive more RL training designed to get it to look for and then use this docs, which should make providing docs for Claude to use to be somewhat more reliable. 4) However, this doesn't solve any of the real issues of instruction following, persistent memory, bad defaults or long horizon task completion."
  },
  {
    "body": "Some thoughts on whether we are in an AI bubble:\n\n1) It is possible that we are in a bubble in the valuation of hardware companies. Nvidia and Broadcom both have very high margins and they can be compressed.\n\n2) I think it is unlikely that the valuations of big tech companies like Google, Microsoft, Meta, Amazon or foundation labs like Anthropic or OpenAI are too high or otherwise substantially inflated.\n\n3) Companies like Google and Microsoft have great distribution and are going to make a lot of money distributing AI products to their less sophisticated general business customers.\n\n4) OpenAI and Anthropic are both in a similar position. It is hard to copy what they do because it is very capital and expertise intensive. And, it can make a lot of money because it offers the potential to create enormous efficiency gains in the economy.\n\n5) The problem for Nvidia and Broadcom is that their customers are extremely sophisticated and have a lot of technical capability. Their customers will either build their own (TPUs, Trainium) or look to use someone else at lower cost (TPUs, Trainium, AMD). \n\n6) It will be very hard for them to avoid this, especially because they have very few customers. Nvidia can try to pick winners, but they run the risk of being constrained politically / angering their customers.\n\n7) I'm not sure Nvidia and Broadcom look like such good businesses without ~70% gross margins. AMD is at like 55% and Intel is at like 40%.\n\n8) OpenAI plans to spend something like $350b on compute over the next 5 years. This is something like 1/2 it's expected revenues. OpenAI probably expects to spend substantially more on compute than on people.\n\n9) This means that driving down the cost of compute has to be one of OpenAI's (and by extension everyone else's) top goals over the next 5 years.\n",
    "tweet_id": "1977847040823685473",
    "note_id": "1977847040660127744",
    "tweet_url": "https://x.com/fleetingbits/status/1977847040823685473",
    "created_at": "2025-10-13T21:20:55.000Z",
    "length": 1817,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "meta",
      "lab economics",
      "compute"
    ],
    "title": "Some thoughts on whether we are in an AI bubble:",
    "snippet": "1) It is possible that we are in a bubble in the valuation of hardware companies. Nvidia and Broadcom both have very high margins and they can be compressed. 2) I think it is unlikely that the valuations of big tech companies like Google, Microsoft, Meta, Amazon or foundation labs like Anthropic or OpenAI are too high or otherwise substantially inflated. 3) Companies like Google and Microsoft have great distribution and are going to make a lot of money distributing AI products to their less sophisticated general business customers. 4) OpenAI and Anthropic are both in a similar position. It is hard to copy what they do because it is very capital and expertise intensive. And, it can make a lot of money because it offers the potential to create enormous efficiency gains in the economy."
  },
  {
    "body": "Some quick thoughts on success with Enterprise LLM products\n\n1) You want to aim automate a whole function. This protects you against your product being commoditized by the next generation of models.\n\n2) The way to approach this is that you set up your product so that it encompasses the whole task and then figure out what cannot be automated reliably at current capabilities.\n\n3) For these things, you expose user decision points so that an expert human user can intervene. The important thing to do is to make sure you surface the right information and points of interaction.\n\n4) Designing these points of interaction is much of the value of an enterprise LLM product right now. It takes a lot of LLM and user understanding to do this well.\n\n5)  One issue is information. In traditional workflows, a human will have a lot of information before they reach your decision point. \n\n6) But, because you automated all of the task before that decision point, the user will not have that accumulated information when you surface it to them.\n\n7) So, you have to build a good way to feed them all the information that they would have learned before they reached that point in a quick, digestible way.\n\n8) The other challenge is picking the interactions. You need to do this such that there are not too many false positives. This can be very challenging.\n\n9) Humans are actually quite efficient when looking at documents, for instance. They pick out the few things that matter. This may be different from what legibly ought to matter in the abstract.\n\n10) This requires iteration with human experts in an industry to understand what matters to them.\n\n11) Over time, as capabilities increase, you withdraw these touch points and progressively hide them. This is how you gradually automate away the whole task.\n\n12) At the same time, your goal is to land and expand within the corporate departments with whatever  functions you have chosen to automate.\n",
    "tweet_id": "1975743053785932187",
    "note_id": "1975743053651697664",
    "tweet_url": "https://x.com/fleetingbits/status/1975743053785932187",
    "created_at": "2025-10-08T02:00:26.000Z",
    "length": 1941,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "wrappers",
      "enterprise"
    ],
    "title": "Some quick thoughts on success with Enterprise LLM products",
    "snippet": "1) You want to aim automate a whole function. This protects you against your product being commoditized by the next generation of models. 2) The way to approach this is that you set up your product so that it encompasses the whole task and then figure out what cannot be automated reliably at current capabilities. 3) For these things, you expose user decision points so that an expert human user can intervene. The important thing to do is to make sure you surface the right information and points of interaction. 4) Designing these points of interaction is much of the value of an enterprise LLM product right now. It takes a lot of LLM and user understanding to do this well."
  },
  {
    "body": "Some thoughts on the OpenAI x AMD deal\n\n1) Foundation lab revenues are going to increase at a dramatic pace over the next year; OpenAI and Anthropic will have combined revenue run rates of around $80bn by EOY 2026.\n\n2) This economic success will drive a lot more investment in the foundation labs and it will direct a lot of spend to their suppliers. The important suppliers to labs provide compute and data.\n\n3) OpenAI projects that it will spend $350bn on compute through 2030; and, OpenAI historically spends about 5% of compute on data; so let's say about $17.5bn on data.\n\n4) Nvidia has very high gross margins, operating margins and net margins (75%, 62% and 56%). Normally, hardware companies have worse gross margins than SaaS companies, not true for Nvidia.   \n\n5) This means that Nvidia is going to take a lot of the profit that would go to foundation labs. \n\n6) Keeping everything else the same, if the market were competitive and Nvidia had to lower its prices to the point where it had the same gross margins as AMD, then OpenAI could save $175bn over the next 5 years.\n\n7) So, there is a lot of pressure on folks to figure out alternatives to Nvidia. Nvidia has the strongest product offering for training runs, which require networking tens of thousands of GPUs together.\n\n8) So, labs like OpenAI and Anthropic are going to explore alternatives for inference. OpenAI is exploring AMD and their own chip that they are developing with Broadcom. Anthropic is trying to use Amazon Trainium. \n\n9) The historic issues have been that AMD software is terrible. But, I have heard that this is getting better. AMD has a lot of experience working with large scientific deployments (like for weather simulations). \n\n10) Apparently, the OpenAI Broadcom chip is delayed.\n\n11) Trainium also requires a lot of workarounds, apparently there are a lot of bugs with the software interface to the hardware. But, Anthropic seems to be soldiering through it.\n\n12) On the long run, the labs will have to look for an alternative for training as well. This might be AMD, later versions of Trainium. Google has TPUs. \n\n13) Anyway, the great race is on. Probably about $750bn to $1tn in compute spending through 2030. Exciting times to live in.\n",
    "tweet_id": "1975399144178254170",
    "note_id": "1975399143964352513",
    "tweet_url": "https://x.com/fleetingbits/status/1975399144178254170",
    "created_at": "2025-10-07T03:13:51.000Z",
    "length": 2232,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "lab economics",
      "compute"
    ],
    "title": "Some thoughts on the OpenAI x AMD deal",
    "snippet": "1) Foundation lab revenues are going to increase at a dramatic pace over the next year; OpenAI and Anthropic will have combined revenue run rates of around $80bn by EOY 2026. 2) This economic success will drive a lot more investment in the foundation labs and it will direct a lot of spend to their suppliers. The important suppliers to labs provide compute and data. 3) OpenAI projects that it will spend $350bn on compute through 2030; and, OpenAI historically spends about 5% of compute on data; so let's say about $17.5bn on data. 4) Nvidia has very high gross margins, operating margins and net margins (75%, 62% and 56%). Normally, hardware companies have worse gross margins than SaaS companies, not true for Nvidia."
  },
  {
    "body": "Thoughts on Sora 2\n\n1) The main difference between Veo3 and Sora 2 was marketing. The model itself is better but is not that much of a step forward over Veo 3.\n\n2) OpenAI engineered a very clever marketing strategy using influencers and invite codes. The invite code made Sora feel exclusive and limited people's ability to use it themselves.\n\n3) Influencers who have preview access to a model hype it up for their own benefit. This is because their relevance depends upon the stuff being cool. Most people only see it through the eyes of the influencers.\n\n4) Influencers only upload the best generations to twitter. So people have a set of expectations of the product that come from cherry picked examples. Rather than what a typical generation will look like.\n\n5) The invite code also encourages sharing between users that have invited one another. It encourages people to explore cultural niches. And, it creates a sense of FOMO from those who have not been invited. This increases demand on twitter and other media platforms.\n\n6) The model itself is good. It can do great scenes with well known IP and historical figures. I’ve seen good videos of Sam Altman, Mr Rogers, Bob Ross, and JFK. The audio can be very good for these scenes.\n\n8) But, in practice, the model, as presented, seems well short of what it would need to be to drive continued use. Its world knowledge is not great. And, the audio often doesn’t line up well with the video. The videos feel truncated and unsatisfying.\n\n9) The Sora app itself is super unoptimized. You can barely scroll before the website starts to slow to a crawl because it must be downloading the full videos as you scroll. And, you can only generate 5 videos at a time, and those take a while to generate, which limits iteration loops for creators.\n\n10) It feels more like a marketing and research release. I can see the strategic deployment team releasing selling high end fine-tunes for enterprise use. But, I do not believe that the model will pay for itself. It’s more about what can be done with such a model in the future.\n\n11) It seems very hard to get virality out of a text model release now. Claude 4.5 Sonnet and GPT-5 seem to have driven less net-new attention than Sora and nano-banana. GPT Image 1 (Ghlibli style) also drove a lot of attention. This should inform lab releases in the future.\n\n12) A music generation release would probably also drive attention. But, the major labs are avoiding this for copyright reasons. A real time video model would drive a lot of attention.\n\n13) Anyway, Sora 2 is better than Sora 1. If Sora 1 was GPT-1 then Sora 2 is GPT-2. The next release will probably be closer to something that is practically useful. So, we should probably begin to expect video capabilities to become much more relevant toward late 2026.\n\n14) AI video generation will end up competing in a bunch of different categories. It will compete against TikTok, Netflix and YouTube. The only reason short form video is the first to make it to market is because it is cheaper to train models to do short form video and it is cheaper to do inference. This is not something somehow inherent to the tech.\n\n15) The labs will also have to figure out how to think about NSFW. My gut is that they will just avoid it. But, it is worth pointing out that the main current use of the old Sora app is for people to generate feet content.\n\n16) Oh, and I think we are going to get more and more personalization. To tag the popular discourse, there will be more slop for a short period of time, but then much less if you care at least, as personalization takes off and the cost of production of high quality content falls.\n",
    "tweet_id": "1974631070630023606",
    "note_id": "1974631070252605440",
    "tweet_url": "https://x.com/fleetingbits/status/1974631070630023606",
    "created_at": "2025-10-05T00:21:48.000Z",
    "length": 3665,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "google",
      "lab economics",
      "compute",
      "consumer",
      "legal"
    ],
    "title": "Thoughts on Sora 2",
    "snippet": "1) The main difference between Veo3 and Sora 2 was marketing. The model itself is better but is not that much of a step forward over Veo 3. 2) OpenAI engineered a very clever marketing strategy using influencers and invite codes. The invite code made Sora feel exclusive and limited people's ability to use it themselves. 3) Influencers who have preview access to a model hype it up for their own benefit. This is because their relevance depends upon the stuff being cool. Most people only see it through the eyes of the influencers. 4) Influencers only upload the best generations to twitter. So people have a set of expectations of the product that come from cherry picked examples. Rather than what a typical generation will look like."
  },
  {
    "body": "Some thoughts on Sora and Vibes:  \n\n1) Sora and Vibes are traditional short form media platforms but where the producer is able to bootstrap content creation using AI.   \n\n2) Historically, it was hard to compete with media providers because it was hard to create a content catalog.  \n\n3) A platform needs creators to get content. Creators need to be paid in attention or money. Both of these require users. There are no long term users without content.  \n\n4) But, AI allows a small number of creators to create a large amount of content. This means that you no longer need as many creators to hit critical mass.  \n\n5) This means that OpenAI and Meta can create new content platforms without an existing creator base for those platforms. They underwrite the initial creators with compute cost.  \n\n6) Both platforms are taking advantage of a window in which AI video creation is new and not everyone can afford to underwrite such a platform.\n\n7) Such video creation content would eventually be commoditized and more would be necessary to launch such a platform. This makes now a narrow window / unique opportunity.\n\n8) On the long term, the goal is to completely bypass human creators and to create fully personalized content that is unique to the individual.  \n\n9) The ultimate evolution is a combination of all the user generated and professional generated content platforms TikTok, YouTube and Netflix. It is not limited to short form video.\n\n10) But, OpenAI and Meta cannot quite do this yet, because they can't generate good long context video, and the content isn't personalized in a way that is a competitive advantage over YouTube or TikTok.  \n\n11) This is why OpenAI is emphasizing that you can remix videos with your face or your friends faces; they want the content to be shareable as a distribution strategy.  \n\n12) It is also why OpenAI pushes the \"Sora\" watermark into the videos so aggressively. It is so they can use it to bootstrap distribution for the platform. \n\n13) If they imagined Sora as a content creation tool for creators instead of a video platform then they would not include the watermark, since creators want to share their material, not Sora's material.\n\n14) The videos are not high enough quality right now to be competitive on their own with YouTube or TikTok without thinking about distribution.  \n\n15) Meta cares so much about this space because it feels like the coming wave of asocial media may eat much of social media.\n",
    "tweet_id": "1973176881928405194",
    "note_id": "1973176881651523585",
    "tweet_url": "https://x.com/fleetingbits/status/1973176881928405194",
    "created_at": "2025-10-01T00:03:23.000Z",
    "length": 2455,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/OpenAI/status/1973075422058623274"
    ],
    "tags": [
      "openai",
      "meta",
      "lab economics",
      "compute",
      "consumer"
    ],
    "title": "Some thoughts on Sora and Vibes:",
    "snippet": "1) Sora and Vibes are traditional short form media platforms but where the producer is able to bootstrap content creation using AI. 2) Historically, it was hard to compete with media providers because it was hard to create a content catalog. 3) A platform needs creators to get content. Creators need to be paid in attention or money. Both of these require users. There are no long term users without content. 4) But, AI allows a small number of creators to create a large amount of content. This means that you no longer need as many creators to hit critical mass."
  },
  {
    "body": "Some thoughts on the Dwarkesh Richard Sutton interview:\n\n1) Richard Sutton has internalized the bitter lesson to a very impressive degree.\n\n2) He doesn't like pretraining because human set the data used in pretraining. He doesn't like post-training because humans set the curriculum.\n\n3) He wants the agent to be able to be given a goal and then be able to loop to learn how to accomplish the goal on its own, just interacting with the world. \n\n4) This involves the agent getting a progressively richer world model, related to its goals, which it is able to manipulate to accomplish its tasks. \n\n5) I don't think that anyone at OpenAI, DeepMind or Anthropic would really disagree with this as the ultimate goal.\n\n6) Whether, it is specialized models that interact in order to form an agent, with an interior training loop, or whether it's in context learning, or whatever.\n\n8) I think the bigger issue with the interview was just that Dwarkesh wasn't familiar with Richard Sutton's way of thinking or talking.\n\n9) Richard Sutton feels very connectionism, early AI, etc... and he understands the material, but he has a more focused worldview.\n",
    "tweet_id": "1972129978675597346",
    "note_id": "1972129978583339008",
    "tweet_url": "https://x.com/fleetingbits/status/1972129978675597346",
    "created_at": "2025-09-28T02:43:21.000Z",
    "length": 1141,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "post-training",
      "pretraining",
      "agi"
    ],
    "title": "Some thoughts on the Dwarkesh Richard Sutton interview:",
    "snippet": "1) Richard Sutton has internalized the bitter lesson to a very impressive degree. 2) He doesn't like pretraining because human set the data used in pretraining. He doesn't like post-training because humans set the curriculum. 3) He wants the agent to be able to be given a goal and then be able to loop to learn how to accomplish the goal on its own, just interacting with the world. 4) This involves the agent getting a progressively richer world model, related to its goals, which it is able to manipulate to accomplish its tasks."
  },
  {
    "body": "Some thoughts on Harvey:\n\n1) Harvey was initially known for having a mediocre product and a lot of people in the legal tech industry predicted that it would fail.\n\n2) But, it's revenue has grown each time that I have gotten info, last time it was at about $100m in revenue, which justifies a unicorn valuation.\n\n3) Harvey charges a sticker price of $120k/seat/yr and a real price of $5k/seat/yr. LexisNexis and Westlaw are about $80k/seat/yr.\n\n4) This makes Harvey reasonably affordable. Compared to ChatGPT Pro, it's about 5x the cost, which is a lot but probably not that unreasonable with legal-specific enterprise features added on.\n\n5) Alexander Doria elsewhere says that he thinks that synthetic data + RL will cause companies like Harvey to fail.\n\n6) This is incorrect, companies like Harvey are going to be much more technically nimble than both their traditional competitors and customers (e.g. Lexis and Thompson / Big law)\n\n7) And, Harvey will have legal specific features that OpenAI will not; in fact, Harvey will probably just benefit from the labs making better models\n\n9) It's real value is going to be in knowing what workflows to automate, understanding how these should be represented to users, figuring out the difficult parts and helping firms work around them\n\n10) This isn't a full endorsement of Harvey - I think it's workforce has grown much too quickly (at least, 50 to 150 in a couple of months) and I know the team is disorganized\n\n11) But, I think that AI people often discount the value of connecting a product to a market, which in practice is often more complicated than expected\n",
    "tweet_id": "1971564599393567175",
    "note_id": "1971564599209033729",
    "tweet_url": "https://x.com/fleetingbits/status/1971564599393567175",
    "created_at": "2025-09-26T13:16:45.000Z",
    "length": 1611,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/Dorialexander/status/1971476888993530014"
    ],
    "tags": [
      "openai",
      "lab economics",
      "enterprise",
      "post-training",
      "legal"
    ],
    "title": "Some thoughts on Harvey:",
    "snippet": "1) Harvey was initially known for having a mediocre product and a lot of people in the legal tech industry predicted that it would fail. 2) But, it's revenue has grown each time that I have gotten info, last time it was at about $100m in revenue, which justifies a unicorn valuation. 3) Harvey charges a sticker price of $120k/seat/yr and a real price of $5k/seat/yr. LexisNexis and Westlaw are about $80k/seat/yr. 4) This makes Harvey reasonably affordable. Compared to ChatGPT Pro, it's about 5x the cost, which is a lot but probably not that unreasonable with legal-specific enterprise features added on."
  },
  {
    "body": "Some thoughts on talking today to a founder friend in established industry ($25m-$500m valuation in 2 years). \n\n1) He advocates the \"beached whale\" theory of entrepreneurship. Pick an established industry with large mediocre competitors and then eat away that them over time.\n\n2) Building a new product from scratch lets you leapfrog the existing products in the market. You are not beholden to the previous tech ecosystem.\n\n3) You can afford to build for a while before releasing a product to market. My founder friend build for 1.5 years before releasing any product for sale.\n\n4) Some markets are razor / razor blade; you need to be willing to underwrite your customers and then make money from them over time. If you can show ARR, it's fine.\n\n5) Some industries are relationship based, in those industries, cultivating relationships with your customers by offering them \"free\" benefits that they didn't expect, can be a winner.\n\n6) There are a lot of markets out there will large zombie companies in them, but they can be hard to see from the outside, experience in an industry is a big advantage.\n",
    "tweet_id": "1970625053839782019",
    "note_id": "1970625053722320897",
    "tweet_url": "https://x.com/fleetingbits/status/1970625053839782019",
    "created_at": "2025-09-23T23:03:19.000Z",
    "length": 1101,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "enterprise"
    ],
    "title": "Some thoughts on talking today to a founder friend in established industry ($25m-$500m valuation in 2 years).",
    "snippet": "1) He advocates the \"beached whale\" theory of entrepreneurship. Pick an established industry with large mediocre competitors and then eat away that them over time. 2) Building a new product from scratch lets you leapfrog the existing products in the market. You are not beholden to the previous tech ecosystem. 3) You can afford to build for a while before releasing a product to market. My founder friend build for 1.5 years before releasing any product for sale. 4) Some markets are razor / razor blade; you need to be willing to underwrite your customers and then make money from them over time. If you can show ARR, it's fine."
  },
  {
    "body": "Dario is mostly right that open source AI does not work the same way as open source software. \n\n1) Open source software works because the big tech companies that open source software are not competing on their software stack but on share of the consumer mind and network effects.\n\n2) Open source AI is not particularly relevant because AI models are so expensive to produce and the companies building them are competing on model capabilities, not purely on share of the consumer mind or network effects. \n\n3) So, foundation labs can't afford to open source their models, because their model quality is the whole moat of their business. And, developers can't build equivalent models using labor, because the field is capital intensive.\n\n5) What is strange is where he says that the reason why models cannot be open sourced is the difficulty of inference; this seems incorrect.\n\n6) In a world where models were open source, the companies would compete on their ability to quickly adapt models for inference. Foundation labs would look more like Together and Hyperbolic and less like Anthropic.\n\n7) The problem with that world is that it doesn't have a good way to push the capabilities frontier without rediscovering the technology of closed source models.\n",
    "tweet_id": "1970526091019247869",
    "note_id": "1970526090868252679",
    "tweet_url": "https://x.com/fleetingbits/status/1970526091019247869",
    "created_at": "2025-09-23T16:30:05.000Z",
    "length": 1254,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/techeconomyana/status/1970064402268508534"
    ],
    "tags": [
      "anthropic",
      "lab economics",
      "compute",
      "pretraining"
    ],
    "title": "Dario is mostly right that open source AI does not work the same way as open source software.",
    "snippet": "1) Open source software works because the big tech companies that open source software are not competing on their software stack but on share of the consumer mind and network effects. 2) Open source AI is not particularly relevant because AI models are so expensive to produce and the companies building them are competing on model capabilities, not purely on share of the consumer mind or network effects. 3) So, foundation labs can't afford to open source their models, because their model quality is the whole moat of their business. And, developers can't build equivalent models using labor, because the field is capital intensive. 5) What is strange is where he says that the reason why models cannot be open sourced is the difficulty of inference; this seems incorrect."
  },
  {
    "body": "Some thoughts on GPT-5-Codex\n\n1) The foundation labs are in a long term competition with intermediaries like Cursor\n\n2) On the one hand, companies like Cursor offer foundation labs distribution for their services, which is valuable and increases their revenue\n\n3) On the other hand, they disassociate the foundation labs from their customers, which means that they can play the foundation labs off against each other\n\n4) On the long term, this lowers the price that foundation labs can charge for their tokens and makes their revenue more uncertain\n\n5) It is also clear that companies like Cursor plan to eventually replace foundation lab models with their own models to the greatest extent possible\n\n6) Foundation labs obviously want to avoid this and would rather a direct relationship with their customers\n\n7) Part of their goal is to release models that are most synergistic with the foundation labs own product stack\n\n8) Models like GPT-5-Codex seem to be a first step towards this; Codex is not available in Cursor at release, which could have been otherwise negotiated\n\n9) OpenAI has said that GPT-5-Codex will be available for Codex-cli via API but has not promised that it will be generally available through API \n\n10) I suspect that GPT-5-Codex was trained specifically to work well with OpenAI's codex-cli and codex frameworks\n\n11) This enables OpenAI to create a fence around their best models that encourages users to use Codex-cli and not to use Cursor as an intermediary\n\n12) Note that OpenAI had to make GPT-5-Codex available to Microsoft (hence its availability in Github) as part of the 2023 investment agreement\n\n13) I expect that we will see something like this from Anthropic eventually (perhaps not with the next release though - they are more dependent on Cursor) as they push users toward Claude Code\n\n14) The foundation labs on the long run will want to own the coding agent market so that they can capture all the profits in it\n\n15) Products like Cursor will become more like catchup mechanics for labs that are behind either in distribution or capabilities or otherwise want to disrupt their competition by offering their tokens at a cheaper price\n\n16) I can see this being very attractive to Google for some period of time if they think that they can use it to undercut OpenAI and Anthropic\n",
    "tweet_id": "1967736640442601919",
    "note_id": "1967736640266440710",
    "tweet_url": "https://x.com/fleetingbits/status/1967736640442601919",
    "created_at": "2025-09-15T23:45:48.000Z",
    "length": 2318,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "lab economics",
      "coding",
      "enterprise",
      "consumer",
      "post-training"
    ],
    "title": "Some thoughts on GPT-5-Codex",
    "snippet": "1) The foundation labs are in a long term competition with intermediaries like Cursor 2) On the one hand, companies like Cursor offer foundation labs distribution for their services, which is valuable and increases their revenue 3) On the other hand, they disassociate the foundation labs from their customers, which means that they can play the foundation labs off against each other 4) On the long term, this lowers the price that foundation labs can charge for their tokens and makes their revenue more uncertain"
  },
  {
    "body": "Some thoughts on multimodal voice models\n\n1) One important issue with reasoning models is that they are real time; you can't easily do reasoning in real time right now\n\n2) This just means that voice models provide very unimpressive responses compared to text models; they are great for simple queries but not complex ones\n\n3) Frontier labs still do not offer voice models that will allow you to get them to talk in a variety of voices or to sing; this is because labs are scared of getting sued\n\n4) This is because the music industry is very litigious and has a long history of suing for infringement; also, the risk of fraud from deepfakes seems more real for voice responses\n\n5) It's not clear where the data bottleneck is for voice models given the current product roadmap for the foundation labs\n\n6) If the goal for voice models is to be a better assistant then some combination of synthetic data + preference data is probably good enough (the latter might just be able to be collected from the applications)\n\n7) If the goal for voice models to be used for business applications then you want stuff likes sales calls - but we are probably some distance from this being viable - businesses just want control of their public image\n\n8) So, the voice models data market will still probably remain reasonably small for the near future\n\n9) An important unlock is to figure out how to improve reasoning for audio models; I can see a big encoder small decoder collection or something\n",
    "tweet_id": "1967671641384849620",
    "note_id": "1967671641284247552",
    "tweet_url": "https://x.com/fleetingbits/status/1967671641384849620",
    "created_at": "2025-09-15T19:27:31.000Z",
    "length": 1479,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "consumer",
      "post-training",
      "legal"
    ],
    "title": "Some thoughts on multimodal voice models",
    "snippet": "1) One important issue with reasoning models is that they are real time; you can't easily do reasoning in real time right now 2) This just means that voice models provide very unimpressive responses compared to text models; they are great for simple queries but not complex ones 3) Frontier labs still do not offer voice models that will allow you to get them to talk in a variety of voices or to sing; this is because labs are scared of getting sued 4) This is because the music industry is very litigious and has a long history of suing for infringement; also, the risk of fraud from deepfakes seems more real for voice responses"
  },
  {
    "body": "Some basic principles that I have about AI that have helped me to avoid hyped research:\n\n1) Compute is an upper bound on lab capabilities; if a lab puts out a model much better than its compute would imply, it's overhyped (e.g. Reflection 70B) \n\n2) The important discoveries are those that represent new ways for models to learn behaviors rather than places were we take advantage of our own knowledge (e.g. sampling stuff)\n\n3) Agent scaffolding can't take you that far beyond the underlying model, if you see someone claim miracle scaffolding, you should discount it (e.g. Cognition)\n\n4) Architecture matters but is a collection of medium scale discoveries bound together; you should not expect the discovery of a miracle architecture (e.g. hierarchical reasoning models)\n",
    "tweet_id": "1959703146365632523",
    "note_id": "1959703146285903872",
    "tweet_url": "https://x.com/fleetingbits/status/1959703146365632523",
    "created_at": "2025-08-24T19:43:34.000Z",
    "length": 772,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "compute",
      "post-training",
      "pretraining"
    ],
    "title": "Some basic principles that I have about AI that have helped me to avoid hyped research:",
    "snippet": "1) Compute is an upper bound on lab capabilities; if a lab puts out a model much better than its compute would imply, it's overhyped (e.g. Reflection 70B) 2) The important discoveries are those that represent new ways for models to learn behaviors rather than places were we take advantage of our own knowledge (e.g. sampling stuff) 3) Agent scaffolding can't take you that far beyond the underlying model, if you see someone claim miracle scaffolding, you should discount it (e.g. Cognition) 4) Architecture matters but is a collection of medium scale discoveries bound together; you should not expect the discovery of a miracle architecture (e.g. hierarchical reasoning models)"
  },
  {
    "body": "Some extended thoughts on GPT-5\n\n1) GPT-5 is a good model. It feels like it provides better search and performance than o3 did before it.\n\n2) It's disappointing to people because it is an incremental improvement, which does not open up fundamentally new use cases.\n\n3) The really interesting story around GPT-5 seems to be more about competition with Anthropic. \n\n4) Anthropic has increased its revenue by 4-5x over the last 6 months. OpenAI has increased it's revenue by 2x.\n\n5) A lot of Anthropic's revenue growth is due to API revenue, which is a much larger percentage of Anthropic's revenue (60%) than OpenAI's revenue (25%). \n\n6) About 50% of Anthropic's API revenue comes from its Cursor and Github Copilot partners. Anthropic probably collects something like $800m in revenue from Cursor.\n\n7) GPT-5 seems to be in part about challenging Anthropic's dominance in coding agents. GPT-5 is now the default in Cursor.\n\n8) GPT-5 finally matches Claude 4.1 Opus' performance on SWE Bench Verified, which isn't a perfect measure but which seems to be a good proxy for performance.\n\n9) The cursor partnership has the opportunity to steer a lot of revenue away from Anthropic, while helping OpenAI to cement its share of the consumer mind for coding applications.\n\n10) I wouldn't read too much into it, but it could slow Anthropic's revenue growth and make it marginally harder for them to raise (although, probably a weak effect).\n\n11) I have other thoughts around GPT-5 from a user interaction / launch perspective.\n\n12) I think they botched the launch; no one wants to watch live streams, the benchmarks are not intelligible anymore, and there was nothing viral to interact with.\n\n13) Cool model interactions need to be about new modalities or need to be very agentic, which requires a lot of scaffolding. World historic stuff is good too - our models solved a millennium prize.\n\n14) Labs tend to solve UI/UX in one place and then just end up with problems in another; we simplify the model complexity (somewhat) but now have to pick personalities.\n\n15) But, this is okay, and just part of the grand adventure where we work toward truly tailored interactions with our LLM assistants.\n",
    "tweet_id": null,
    "note_id": "1953904725277257728",
    "tweet_url": null,
    "created_at": "2025-08-08T19:42:42.000Z",
    "length": 2184,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "lab economics",
      "coding",
      "enterprise",
      "consumer",
      "evals"
    ],
    "title": "Some extended thoughts on GPT-5",
    "snippet": "1) GPT-5 is a good model. It feels like it provides better search and performance than o3 did before it. 2) It's disappointing to people because it is an incremental improvement, which does not open up fundamentally new use cases. 3) The really interesting story around GPT-5 seems to be more about competition with Anthropic. 4) Anthropic has increased its revenue by 4-5x over the last 6 months. OpenAI has increased it's revenue by 2x."
  },
  {
    "body": "Some initial thoughts on Gemini DeepThink:\n\n0) TLDR; it's very impressive\n\n1) It feels more like running a Deep Research query in that it can take 10-15 minutes to run.\n\n2) It seems like it runs in a sandbox and has access to some compute but tries to run code that you wouldn't expect it to run in its sandbox (e.g. GPU compute).\n\n3) As you would expect the reasoning traces say the code fails and this seems to cause it to spend time that it shouldn't rewriting the code and trying to re-configure the sandbox. \n\n4) The code that I got in the end was very impressive though and I don't think that Grok 4 Heavy could have done better, although I haven't tried yet.\n\n5) My task was having it train a small model and then train SAEs on top of it and visualize them. The model is a very very small model, so it's not that hard. But, I think other models might have trouble with SAE code.\n\n6) Sidenote, Google's benchmarks just seem very off; I've been using Gemini 2.5 Pro and it feels much worse than o3 and Grok 4. \n\n7) I think for DeepThink to be a useful product, you have to basically have a ton of access at the Ultra tier to it. Otherwise, it's not worth paying $250/mo to run it a couple of times a day. \n\n8) But, I am becoming more and more convinced that if Google can built a slightly more pleasant model and keep up on intelligence then it can win on integrations. It's very flow friction to copy to colab and very nice to read my email from Gemini.\n",
    "tweet_id": "1951321535287095393",
    "note_id": "1951321535073099777",
    "tweet_url": "https://x.com/fleetingbits/status/1951321535287095393",
    "created_at": "2025-08-01T16:38:02.000Z",
    "length": 1459,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "google",
      "xai",
      "compute",
      "coding",
      "consumer",
      "interpretability",
      "evals"
    ],
    "title": "Some initial thoughts on Gemini DeepThink:",
    "snippet": "0) TLDR; it's very impressive 1) It feels more like running a Deep Research query in that it can take 10-15 minutes to run. 2) It seems like it runs in a sandbox and has access to some compute but tries to run code that you wouldn't expect it to run in its sandbox (e.g. GPU compute). 3) As you would expect the reasoning traces say the code fails and this seems to cause it to spend time that it shouldn't rewriting the code and trying to re-configure the sandbox."
  },
  {
    "body": "the takes around this are sort of strange; here are my thoughts from my product management experience\n\n1) customers want subscriptions because they don't want to have to carry the mental overhead of figuring out how much they will have to pay each month\n\n2) usage based billing leads to bad surprises, people either get cut off and are unable to use the product or they get a large bill that they don't expect\n\n3) in any subscription based product, most of your customers are profitable, but a small number of customers are very unprofitable\n\n4) sometimes the small number of customers is very very unprofitable and then you need to do something about it\n\n5) what you do is you create a usage graph that shows percentile usage of the product and the (sometimes negative) profit associated with each percentile band\n\n6) you then pick a cutoff point that you think will not affect the majority of your customer base and soft cap usage at that band\n\n7) soft cap normally means you introduce some process whereby differential usage is slowed down; and, then you normally institute a hard cap even farther above that\n\n8) for Claude this would look like: (1) going from Opus to Sonnet above some certain usage or (2) lowering the rate limit on Claude Code above a certain usage so that calls returns more slowly\n\n9) a hard cap even farther above that where Claude Code ceases to function and you are told to wait for your recharge\n\n10) you then monitor after release to see if it fixes the unprofitability / whether you put the cutoff in the right place\n\n11) normally, you find you get very little pushback from your real customers when you do something like this; because, very few people are actually affected / the number of bad actors is small\n",
    "tweet_id": "1949982366979539151",
    "note_id": "1949982366870519808",
    "tweet_url": "https://x.com/fleetingbits/status/1949982366979539151",
    "created_at": "2025-07-28T23:56:39.000Z",
    "length": 1741,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/AnthropicAI/status/1949898502688903593"
    ],
    "tags": [
      "anthropic",
      "lab economics",
      "coding",
      "enterprise",
      "consumer"
    ],
    "title": "the takes around this are sort of strange; here are my thoughts from my product management experience",
    "snippet": "1) customers want subscriptions because they don't want to have to carry the mental overhead of figuring out how much they will have to pay each month 2) usage based billing leads to bad surprises, people either get cut off and are unable to use the product or they get a large bill that they don't expect 3) in any subscription based product, most of your customers are profitable, but a small number of customers are very unprofitable 4) sometimes the small number of customers is very very unprofitable and then you need to do something about it"
  },
  {
    "body": "Gemini is very close to being an incredible product. It just needs a few small tweaks. Some thoughts:\n\n1) I should be able to control the whole of my Google Workspace from the Gemini application.\n\n2) I can read emails, but I can't send emails.  I also can't read attached pdfs. This needs to be fixed.\n\n3) I can edit things in canvas but I can't push them to google docs and google docs doesn't use Gemini well\n\n4) I can have Gemini write code, but I can't push it to colab or otherwise interact with it in Gemini\n\n5) In Gmail, Gdocs, Gdrive and Colab, I have no way to select which model I am using\n\n6) If Google fixed these things, I would pick Gemini for $250/month over OpenAI in a heartbeat\n\n7) For people who don't code, Gemini is already better than OpenAI, because you can use it to search your productivity applications\n\n8) I also suggest Gemini to people that are non-technical because they don't need to change their workflow as much; just go to Google.\n\n9) I don't know if you can have Gemini control your android phone, but if you can then I would consider switching from Apple to Android\n\n10) OpenAI should attempt to close the gap; release an email and productivity suite; and, find some way to fully integrate with core phone capabilities\n",
    "tweet_id": "1947476791695696283",
    "note_id": "1947476791561465857",
    "tweet_url": "https://x.com/fleetingbits/status/1947476791695696283",
    "created_at": "2025-07-22T02:00:24.000Z",
    "length": 1254,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "google",
      "coding",
      "enterprise",
      "consumer"
    ],
    "title": "Gemini is very close to being an incredible product. It just needs a few small tweaks. Some thoughts:",
    "snippet": "1) I should be able to control the whole of my Google Workspace from the Gemini application. 2) I can read emails, but I can't send emails.  I also can't read attached pdfs. This needs to be fixed. 3) I can edit things in canvas but I can't push them to google docs and google docs doesn't use Gemini well 4) I can have Gemini write code, but I can't push it to colab or otherwise interact with it in Gemini"
  },
  {
    "body": "Some thoughts on the IMO results:\n\n1) This shows us what can be done in a verifiable domain if you have the compute.\n\n2) I think that a lot of the capabilities in math may be downhill of being able to generate synthetic data for it.\n\n3) You can do auto-formalization of journal articles into lean; check them in lean; then translate them back into English for your model.\n\n4) This gives you a good way to collect solid proof data for a verbal reasoning model. There might be other ways to do it, but this seems like a possible way to do it.\n\n5) I think this is one of the reasons that labs like to work on mathematical work; it gives them good feedback on their RL methods.\n\n6) I feel like there are a lot of places where having more math and scientific knowledge available, on demand, without a special expert, could produce value.\n\n7) Industrial applications, financial applications, pharmaceutical applications, all come to mind as places where this kind of thing could produce value.\n\n8) But, it also beckons to a future where we can spin up research on demand to whatever extent you are willing to pay for it.\n\n9) Want 2x or 3x more mathematical research than we have today? Just be willing to spend $100m in GPUs. Want 6x more? Just spend $200m.\n",
    "tweet_id": "1947079117381517700",
    "note_id": "1947079117201108992",
    "tweet_url": "https://x.com/fleetingbits/status/1947079117381517700",
    "created_at": "2025-07-20T23:40:11.000Z",
    "length": 1251,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "compute",
      "post-training",
      "evals",
      "math"
    ],
    "title": "Some thoughts on the IMO results:",
    "snippet": "1) This shows us what can be done in a verifiable domain if you have the compute. 2) I think that a lot of the capabilities in math may be downhill of being able to generate synthetic data for it. 3) You can do auto-formalization of journal articles into lean; check them in lean; then translate them back into English for your model. 4) This gives you a good way to collect solid proof data for a verbal reasoning model. There might be other ways to do it, but this seems like a possible way to do it."
  },
  {
    "body": "My perspective is the same this year as it was last year with some small edits.\n\nRate of Progress\n\n1) I expected GPT-4.5 to be more of a step up\n2) I expected Veo3 to be more of a step up\n3) I expected Operator to be better on release\n4) Reasoning models met my expectations\n5) My AGI went from 2027 to 2028\n\nBusiness Model\n\n6) I thought labs would be more product focused\n7) I thought xAI would not catch the frontier until fall 2025\n8) I didn't realize the extent to which data companies (Mercor, Scale, Surge) would see rising valuations\n",
    "tweet_id": "1947035253870125167",
    "note_id": "1947035253769371649",
    "tweet_url": "https://x.com/fleetingbits/status/1947035253870125167",
    "created_at": "2025-07-20T20:45:53.000Z",
    "length": 540,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/sebkrier/status/1946785369753604555"
    ],
    "tags": [
      "xai",
      "lab economics",
      "compute",
      "consumer",
      "agi"
    ],
    "title": "My perspective is the same this year as it was last year with some small edits.",
    "snippet": "Rate of Progress 1) I expected GPT-4.5 to be more of a step up 2) I expected Veo3 to be more of a step up 3) I expected Operator to be better on release"
  },
  {
    "body": "Thoughts on the windsurf and scale acquisitions, bad acquisitions, looks like a bubble but isn't:\n\n1) We are starting to get bad acquisitions; google didn't need windsurf and meta didn't need scale; cognition acquiring windsurf was okay though\n\n2) Google already has some of the best codegen talent and IDEs are not the future of codegen\n\n3) Google probably thought that the issue is product velocity (sort of true) and startup people would help with that (unlikely); Google needs to break down its internal fiefdoms\n\n4) All google really needs for success is a more forward product organization + the pressure to integrate Gemini everywhere and integrate it well\n\n5) Google acquiring the Windsurf team doesn't do anything to break down its internal fiefdoms and the Windsurf team doesn't have a lot of experience working in a large organization\n\n6) Meta needs a top lab lead; think Noam Shazeer; the best two acquisitions for them would have been Thinking Machines or SSI; lab leads are extremely scarce and very valuable\n\n7) Scale is not the best data company (Surge or Mercor) and Alex Wang has not shown he can lead a lab; he seems to have a personality that would clash with researchers\n\n8) Nonetheless, I'm optimistic that Meta has everything that it needs and just general shakeups and talent acquisition will be enough to move things in the right direction\n\n9) The OpenAI IO acquisition also seems a bit pointless; Johnny Ive is a great designer; but OpenAI doesn't need just need a great designer and Johnny Ive wasn't Apple; Apple was more than him\n\n10) Companies that feel a bit dead in the water: Cohere, Perplexity, Cursor, Cognition, maybe Mistral; actually, this is why Cognition acquiring Windsurf made sense, Cognition needs distribution / is already irrelevant\n\n11) Anyway, AI is not a bubble, OpenAI just passed $10b revenue run rate but some of these purchases look very bubble like; not sure what to say about this\n\n12) If I were looking at acquisitions, I might prioritize stuff like chip companies that offer the ability to potentially get away from Nvidia\n",
    "tweet_id": "1944831943348249000",
    "note_id": "1944831943218225152",
    "tweet_url": "https://x.com/fleetingbits/status/1944831943348249000",
    "created_at": "2025-07-14T18:50:43.000Z",
    "length": 2079,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "meta",
      "neolabs",
      "lab economics",
      "compute",
      "coding",
      "enterprise"
    ],
    "title": "Thoughts on the windsurf and scale acquisitions, bad acquisitions, looks like a bubble but isn't:",
    "snippet": "1) We are starting to get bad acquisitions; google didn't need windsurf and meta didn't need scale; cognition acquiring windsurf was okay though 2) Google already has some of the best codegen talent and IDEs are not the future of codegen 3) Google probably thought that the issue is product velocity (sort of true) and startup people would help with that (unlikely); Google needs to break down its internal fiefdoms 4) All google really needs for success is a more forward product organization + the pressure to integrate Gemini everywhere and integrate it well"
  },
  {
    "body": "Some thoughts on the current lab situation and a futuristic patent scheme to address lab competition:\n\n1) It seems like one of the issues in lab competition is that patents are not effective for the labs\n\n2) This means that all the IP is kept instead as trade secrets, which means the employees are very valuable\n\n3) So, the main thing you want to do is recruit the valuable talent from your lab competitor; even if that talent is very expensive\n\n4) It's the equivalent of stealing their IP and therefore having gotten the advantage of all their compute resources\n\n5) Because, discoveries at labs are downstream of compute for experiments, so researchers are able to carry the value of that compute in their heads\n\n6) So, if you pay $20m for a researcher, maybe the right way to think about that is you are paying $20m for $80m of compute that your lab competitor used\n\n7) An interesting solution would be a law that created short duration patents specifically for frontier labs\n\n8) Google, Anthropic, Meta, OpenAI, xAI; anyone with compute resources >$1bn + certain national security provisions gets access\n\n9) Patents are short duration: like 1-2 year; they are immediately granted with a very quick challenge procedure reviewed by dedicated specialists\n\n10) The patents are only binding between and among these companies that can instantly see what their competitors have submitted\n\n11) Basically means labs have an incentive to explore different lines of inquiry; means that you can no longer steal your competitors compute by hiring their employees\n\n12) Maybe also a forced licensing scheme for certain patents that are seen as highly blocking, based on equity; so, every company has to submit an equity portion to the plan\n\n13) This would make it so that labs would be highly encouraged to invest largely into research, since you would get paid not just in money, but in the commercial success of your competitors\n\n14) This part is more speculative but I think that we want highly blocking techniques to be available to everyone, but we want companies to be fully compensated for the research that they do; otherwise, they might just weight towards inference\n\n14) Licensing fees might not be enough because of how expensive the research is and because the companies are currently competing for share of the consumer mind, which is a more durable advantage, and could last years\n\n15) We can imagine this whole scheme being compulsory to encourage the speed of development of American AI companies and to avoid wasteful duplication of compute resources\n\n16) But, it still leaves a lot of room for gain from competition between the labs in terms of research and it also protects national security interests against the diffusion of research\n",
    "tweet_id": null,
    "note_id": "1939758160061038592",
    "tweet_url": null,
    "created_at": "2025-06-30T18:49:18.000Z",
    "length": 2743,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "xai",
      "meta",
      "lab economics",
      "compute",
      "safety",
      "legal"
    ],
    "title": "Some thoughts on the current lab situation and a futuristic patent scheme to address lab competition:",
    "snippet": "1) It seems like one of the issues in lab competition is that patents are not effective for the labs 2) This means that all the IP is kept instead as trade secrets, which means the employees are very valuable 3) So, the main thing you want to do is recruit the valuable talent from your lab competitor; even if that talent is very expensive 4) It's the equivalent of stealing their IP and therefore having gotten the advantage of all their compute resources"
  },
  {
    "body": "Some thoughts on Anthropic not providing Claude to Windsurf\n\n1) In most circumstances, there is nothing wrong with being a supplier to a competitor, especially if you can get the right terms\n\n2) Since, your competitor pays you money and you can build brand loyalty with your competitors customers if their use of your component is disclosed\n\n3) It's unclear what exactly Anthropic was worried about; maybe, it was OpenAI getting real world usage data about Anthropic's models?\n\n4) Real world data usage could show OpenAI what data Anthropic is prioritizing for their models by seeing where there models are good and where they are bad\n\n5) This is the kind of thing that Anthropic couldn't easily protect themselves against through contract\n\n6) It is hard to protect against data usage or internal data sharing through contract because it's very hard to find breaches, especially small breaches\n\n5) Setting aside this, you would could write a contract that required Windsurf to make visible all use of Claude to the customer to build brand loyalty\n\n6) You could also write into the contract a minimum purchase commitment based on max usage to give Anthropic some value from user growth accruing to Windsurf from using Claude\n\n7) There is an \"out there\" interpretation that Anthropic is worried about OpenAI further gaining customer base in a fast path to AGI where share of the consumer mind matters\n\n8) But, there are decent arguments that actually it will be easy to switch between software agents in a post-AGI world and share of the consumer mind will matter less because it is B2B and potentially easy to evaluate performance\n\n9) In any event, Anthropic is giving a lot of share of the consumer mind to Cursor, which uses Anthropic models\n\n10) Maybe part of the reasoning here is that it produces a lot of short term revenue / brand awareness and takes potential market share way from OpenAI, Google, Microsoft, etc..\n\n11) In this frame, there is a lot of upside and Cursor is just not perceived as a long term competitor because it doesn't produce frontier models, so giving them usage data matters less\n",
    "tweet_id": "1933003704866910432",
    "note_id": "1933003704686555136",
    "tweet_url": "https://x.com/fleetingbits/status/1933003704866910432",
    "created_at": "2025-06-12T03:29:31.000Z",
    "length": 2108,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "lab economics",
      "coding",
      "enterprise",
      "consumer"
    ],
    "title": "Some thoughts on Anthropic not providing Claude to Windsurf",
    "snippet": "1) In most circumstances, there is nothing wrong with being a supplier to a competitor, especially if you can get the right terms 2) Since, your competitor pays you money and you can build brand loyalty with your competitors customers if their use of your component is disclosed 3) It's unclear what exactly Anthropic was worried about; maybe, it was OpenAI getting real world usage data about Anthropic's models? 4) Real world data usage could show OpenAI what data Anthropic is prioritizing for their models by seeing where there models are good and where they are bad"
  },
  {
    "body": "Some thought on OpenAI's IO acquisition\n\n1) OpenAI purchased IO for $6.5b; the obvious target is the smartphone market\n\n2) The US market is basically divided between Apple, Samsung and Google; Apple is $70b revenue, Samsung is $20b revenue and Google is $4.5b revenue\n\n3) Apple dominates in terms of profit with an $24b profit, Samsung has only $2.4b profit and Google just about breaks even\n\n4) This means that taking the smartphone market is essentially competing either market share or pricing power away from Apple\n\n5) This puts Apple up against OpenAI in a real competitive sense; will be interesting to see how Apple responds\n\n6) I could imagine Apple making a deal to use Gemini rather than ChatGPT in its smartphones; there were rumors that Google was trying to work this deal before\n\n7) I could also imagine Apple making a deal with Anthropic; but, Anthropic would probably need to extend its product lineup to include voice mode\n\n8) There was a lot of talk in the IO announcement video about the laptop and phone being out of date; I disagree, at least with respect to the smartphone\n\n9) Smartphones are just a portable computer with a full display and connection to cellular service; they have audio in and audio out and video in\n\n10) We can imagine the UI/UX changing as AI gets more powerful (which is actually very likely) but the actual physical form factor feels solid\n\n11) Part of the imagination seems to be around everything around you being recorded all the time so that AI has context on what you do\n\n12) This is still difficult from the perspective of social norms and I can imagine people not liking everyone around them recording everything all the time\n\n13) Apple is like Google in certain respects; a lot of technical talent and capital; decent ML chops (nowhere near DeepMind) and silicon\n\n14) Apple will be able to build competitive AI models if they like; and it's not clear that personal assistants are going to need to be AI Scientist level\n\n15) The real difficulty will be around productization, here OpenAI definitely has an advantage, Apple is used to shipping very slowly\n\n16) OpenAI has other headwinds here though; Apple phones are a status symbol; it's unclear that OpenAI can replicate that / what the popular perception will be of AI\n\n17) I can see a world where OpenAI's branding ends up being a bit controversial and this makes it harder for them to tackle the smartphone market\n",
    "tweet_id": "1925675838717989074",
    "note_id": "1925675838491500544",
    "tweet_url": "https://x.com/fleetingbits/status/1925675838717989074",
    "created_at": "2025-05-22T22:11:11.000Z",
    "length": 2420,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "lab economics",
      "compute",
      "consumer"
    ],
    "title": "Some thought on OpenAI's IO acquisition",
    "snippet": "1) OpenAI purchased IO for $6.5b; the obvious target is the smartphone market 2) The US market is basically divided between Apple, Samsung and Google; Apple is $70b revenue, Samsung is $20b revenue and Google is $4.5b revenue 3) Apple dominates in terms of profit with an $24b profit, Samsung has only $2.4b profit and Google just about breaks even 4) This means that taking the smartphone market is essentially competing either market share or pricing power away from Apple"
  },
  {
    "body": "Some extended thoughts on Gemini Ultra\n\n1) Gemini Ultra feels very impressive but does not feel quite worth the money; $250 / month is too much\n\n2) Gemini 2.5 Pro feels a bit underneath o3 for daily driver tasks; GPT-4.5 is better for synchronous high taste model experience\n\n3) The OpenAI UI/UX just feels a tad bit better; the personalization; the way o3 writes its thinking; the quick click to Sora, Operator and Codex\n\n4) I think it could justify $140/month; they bundle YouTube premium with it for some reason, so let's say $150/month\n\n4) But at $250/month; it just doesn't offer anything more that is useful over ChatGPT Pro and is $50/month more \n\n5) They should actually bundle it with Colab Pro for an extra $50/month and then integrate them; that would be worth every dollar of $200/month\n\n5) In any event, the main issue is integration; if Gemini Ultra integrated across all if Google's product line; it would be completely worth it\n\n6) But, right now, it all feels hobbled; I can't ask it to open Google Docs or Flow or Mariner from Gemini itself or Colab either; so, it's more clicking around\n\n7) This is a discovery problem: \"does this thing have a code editor?\" but also a mental burden, because I drop out of flow a bit each time I navigate to a new site\n\n8) And then, even within the products, once I navigate to them, it doesn't have the level of control over them that I want, to make it feel like a really integrated experience\n\n9) So for example, when I navigate to Google Docs and begin chatting with it; it won't just insert its text in the document; it breaks flow\n\n10) They also don't make it clear to me that I have Gemini Ultra in Docs; so, I don't know if I'm getting the best answers or should I be cutting and pasting\n\n11) Mariner is just a preview but it's the first web operator that can be used to do anything useful; I had it find a flight for me today and it worked\n\n12) I then actually booked the fight; so, it worked; but when I tried to have to do Amtrak or find a Hotel it failed; it also failed on the Password game because it can't handle multiple tabs (?)\n\n13) Gemini gen also isn't as good as GPT-4o images; it's not bad, but it's not as good or as steerable \n\n13) All this said, I think Google is slowly getting over its cultural limitations that have hindered it in the LLM space; the best LLMs cannot be provided for free\n\n14) A premium price plan was a big step up for Google; they just don't think in these terms; Google thinks in terms of owning the whole market, serving everyone the same good, and monetizing differentially with ads\n\n15) I think that Integrating all the products together is going to be more in their intellectual mindset; every big company thinks in terms of cross sale and product integration\n\n16) But, they will have to become even less risk and really force integration between teams, which will be hard; I hope they can do it\n\n16) Sidenote, labs need to give up with previews, at least at the premium pricing end, anyone paying $200/mo for an LLM good can handle that the thing is flaky if its a new release\n\n17) Where does this leave Anthropic? I guess we will see over the next week or so; I think their strategy is really some version of coding -> automate research -> ride the acceleration\n\n18) But, share of the consumer mind matters, Google has shown its a player on the tech side and even on the productization side and it can always raise capital to compete\n\n19) The real question is what Google will do as ChatGPT begins to cut into Google search; will they move Gemini to the front page?\n",
    "tweet_id": "1925383865675231297",
    "note_id": "1925383865373237248",
    "tweet_url": "https://x.com/fleetingbits/status/1925383865675231297",
    "created_at": "2025-05-22T02:51:00.000Z",
    "length": 3570,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "coding",
      "consumer"
    ],
    "title": "Some extended thoughts on Gemini Ultra",
    "snippet": "1) Gemini Ultra feels very impressive but does not feel quite worth the money; $250 / month is too much 2) Gemini 2.5 Pro feels a bit underneath o3 for daily driver tasks; GPT-4.5 is better for synchronous high taste model experience 3) The OpenAI UI/UX just feels a tad bit better; the personalization; the way o3 writes its thinking; the quick click to Sora, Operator and Codex 4) I think it could justify $140/month; they bundle YouTube premium with it for some reason, so let's say $150/month"
  },
  {
    "body": "thoughts on AI for policymaking in Congress; notes from a RAND event\n\n1) Congressional staffers are able to use ChatGPT, Claude and Gemini (but differs between House and Senate)\n\n2) It seems some members do not allow them to use it though because of fear of data leakage to the opposition party\n\n3) Congressional staffers get basically no time with their Congressperson; 15 minutes a week\n\n4) Congressional staffers get basically no time to review issues or legislation (in some cases, something like 48 hours)\n\n5) Laws are drafted on vibes; Congressional Staffers don't trust data; they are not data analysts; they believe its biased\n\n6) Congressional staffers need to model the thoughts of their: (1) Congressperson, (2) relevant advocacy groups, (3) own parties and opposition parties.\n\n7) Public statements from advocacy groups are helpful about direction, but not about prioritization; the advocacy groups might have very different prioritization behind closed doors. \n\n8) Congressional networks rely on private relationships to guide policy; \"he might not be an expert, but I know he'll be straight with me\"\n\n9) So, research tools are very hard to sell Congress; they don't actually really use or need them; they care much more about formatting and opinion modeling\n\n10) Each Congressperson's office having a private RAG over all their documents and prior statements was seen to be a useful tool for staffers (more than research)\n\n11) AI is not the most important tech issue in Congress right now; it's actually the sale of DoD spectrum to pay for tax cuts\n\n12) Apparently, spectrum is insanely valuable and the sale could net something like $80bn; but the DoD says they need it for defense\n\n13) The perception of AI on capital hill right now; aware, it's a thing, it does cool stuff; not scaling pilled, not most important thing to ever happen\n\n14) The China question is very prevalent on capital hill; US competition with China is everywhere; you can't go to an event without people talking about China\n\n15) I heard someone say the goal of safety was to \"Not die and beat China\" - I can't tell if this was messaging that was tailored for the audience\n\n16) The government, think-tanks are bad customers because the TAM is so low and, for government, the sales cycle is such a pain\n",
    "tweet_id": "1923490823821226111",
    "note_id": "1923490823603113985",
    "tweet_url": "https://x.com/fleetingbits/status/1923490823821226111",
    "created_at": "2025-05-16T21:28:43.000Z",
    "length": 2287,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "enterprise",
      "safety",
      "legal"
    ],
    "title": "thoughts on AI for policymaking in Congress; notes from a RAND event",
    "snippet": "1) Congressional staffers are able to use ChatGPT, Claude and Gemini (but differs between House and Senate) 2) It seems some members do not allow them to use it though because of fear of data leakage to the opposition party 3) Congressional staffers get basically no time with their Congressperson; 15 minutes a week 4) Congressional staffers get basically no time to review issues or legislation (in some cases, something like 48 hours)"
  },
  {
    "body": "I think I would be interested in a benchmark with the following characteristics (not sure if possible):\n\n1) the benchmark is adversarial in the sense that models compete with one another\n\n2) models propose the questions and topics that are the subject of the benchmark\n\n3) the benchmark scales arbitrarily far such that there is no limitation to performance\n\n4) the benchmark works for models with capabilities at or above GPT-3.5 levels\n\n5) the benchmark is shown to correlate well to other benchmarks that are commonly reported\n\n6) the benchmark is shown to correlate well to other benchmarks that we care about\n\n7) running cost per model is at a level that is achievable for a moderately funded non-profit\n\n8) the benchmark uses LLM as judge in a way that is not easily game-able for new models\n\n9) the benchmark is easy to setup and does not require a complex environment\n",
    "tweet_id": "1923004731183489463",
    "note_id": "1923004731070316544",
    "tweet_url": "https://x.com/fleetingbits/status/1923004731183489463",
    "created_at": "2025-05-15T13:17:10.000Z",
    "length": 875,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "evals"
    ],
    "title": "I think I would be interested in a benchmark with the following characteristics (not sure if possible):",
    "snippet": "1) the benchmark is adversarial in the sense that models compete with one another 2) models propose the questions and topics that are the subject of the benchmark 3) the benchmark scales arbitrarily far such that there is no limitation to performance 4) the benchmark works for models with capabilities at or above GPT-3.5 levels"
  },
  {
    "body": "Notes from lunch with a former law professor and very successful lawyer, who is interested in getting involved in AI risk:\n\n1) Intellectuals and policy makers, outside of San Francisco, don't have much exposure to AI issues right now\n\n2) Things like AI 2027 actually require too much background information, more general efforts and explainers are needed\n\n3) Public education needs to be in places intellectuals go like the NYT and needs to cover a lot of different dimensions, including compute, export controls, lab governance, dangerous capabilities, etc...\n\n4) Benchmarks are helpful so as to explain capabilities but more positive plans are needed, especially with respect to lab governance; the regulatory side is something these folks understand better\n\n5) The intellectual class doesn't have much background on Effective Altruism or its relationship to AI safety; maybe this is a good thing, unsure\n\n6) Intellectuals are still at least somewhat focused on issues like climate change, I actually think they will engage better on issues downstream of AI\n\n7) I expect the intellectual class response to be strong on issues like bio-risk, cyber-risk and even loss of control; I expect it to be poor on intelligence explosion\n\n8) A lot of it is because they already have a lot of background on the former, just much less background on the later\n",
    "tweet_id": "1922793053661671637",
    "note_id": "1922793053535776768",
    "tweet_url": "https://x.com/fleetingbits/status/1922793053661671637",
    "created_at": "2025-05-14T23:16:02.000Z",
    "length": 1347,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "compute",
      "evals",
      "safety",
      "legal",
      "bio",
      "agi"
    ],
    "title": "Notes from lunch with a former law professor and very successful lawyer, who is interested in getting involved in AI risk:",
    "snippet": "1) Intellectuals and policy makers, outside of San Francisco, don't have much exposure to AI issues right now 2) Things like AI 2027 actually require too much background information, more general efforts and explainers are needed 3) Public education needs to be in places intellectuals go like the NYT and needs to cover a lot of different dimensions, including compute, export controls, lab governance, dangerous capabilities, etc... 4) Benchmarks are helpful so as to explain capabilities but more positive plans are needed, especially with respect to lab governance; the regulatory side is something these folks understand better"
  },
  {
    "body": "some thoughts from talking with a friend yesterday about model architecture and biology\n\n1) the human brain seems more like a collection of small models with inductive biases that share some mutual embedding space\n\n2) the models we use today still have inductive biases (like local attention, as opposed to all layers having global attention)\n\n3) it's hard to tell if brain structures are just path dependent or whether they could converge\n\n4) the brain could at the least copy existing structures and then use them for other functions, good sign that it is not just path dependent\n\n5) our models look too regular; this makes it feel like current model architectures are a result of human cuda / gpu optimizations as opposed to actually being best\n\n6) shouldn't we expect small parts of models, even within a layer, to reflect different inductive biases? from an architectural point of view?\n\n7) it might be that models just learn relationships that resemble these biases, but that leads to wasted compute, assuming you could optimize everything correctly\n\n8) some of this might just be found in model optimization after training like weight pruning or quantization, but not clear\n\n9) would be interesting to read some papers about co-training the architecture with the model; maybe this is hard due to the idea that we want to de-risk architectural changes at smaller scale\n\n10) still feels like something is missing from a meta level, once optimized things normally appear less regular, modern model architecture still seems very regular\n",
    "tweet_id": "1922297844226253149",
    "note_id": "1922297844117209090",
    "tweet_url": "https://x.com/fleetingbits/status/1922297844226253149",
    "created_at": "2025-05-13T14:28:15.000Z",
    "length": 1539,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "compute",
      "pretraining",
      "bio"
    ],
    "title": "some thoughts from talking with a friend yesterday about model architecture and biology",
    "snippet": "1) the human brain seems more like a collection of small models with inductive biases that share some mutual embedding space 2) the models we use today still have inductive biases (like local attention, as opposed to all layers having global attention) 3) it's hard to tell if brain structures are just path dependent or whether they could converge 4) the brain could at the least copy existing structures and then use them for other functions, good sign that it is not just path dependent"
  },
  {
    "body": "Can you expand on this? \n\nBecause, when I read it, my reactions are:\n\n 1) what does it even mean to remove a Wikipedia from the web? how many IS points are there total? how significant is 1 IS point?\n\n2) 1% revenue decline shows degrading search causes a loss of revenue; \n\n3) there is a real perceived issue with search, it has gotten worse, in other Google products like YouTube, the search isn't even real now - 5-10 search results and the rest is just algo suggestions\n",
    "tweet_id": "1921656702161826016",
    "note_id": "1921656702107353088",
    "tweet_url": "https://x.com/fleetingbits/status/1921656702161826016",
    "created_at": "2025-05-11T20:00:35.000Z",
    "length": 472,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "google",
      "consumer"
    ],
    "title": "Can you expand on this?",
    "snippet": "Because, when I read it, my reactions are: 1) what does it even mean to remove a Wikipedia from the web? how many IS points are there total? how significant is 1 IS point? 2) 1% revenue decline shows degrading search causes a loss of revenue; 3) there is a real perceived issue with search, it has gotten worse, in other Google products like YouTube, the search isn't even real now - 5-10 search results and the rest is just algo suggestions"
  },
  {
    "body": "Some thoughts from an LLM reading group tonight  \n\n1) We are reading Ilya Sutskever's 30 papers to understand machine learning; couldn't believe how old they felt; today was the first 4  \n\n2) I sort of missed the LSTM era; back then I was using gradient boosted trees; but LSTMs seem super complex and byzantine  \n\n3) I realized that I never really understood why people did dropout; it wasn't preventing overfitting on features since it applied to the whole network  \n\n4) Means the point of dropout must really be more robust representations or like more general representations for concepts; but obviously at the cost of training efficiency  \n\n5) Maybe drop out was also about causing the model to learn many circuits to reach the same result; would be interesting to look up mech interp research on this  \n\n6) I guess it was also a bit of a data augmentation strategy; you can put the same input through the model multiple times and it learns different things or gets different intermediate representations of it  \n\n7) I was struck by how much people sort of cherry picked / overhyped their results; a couple of lines of weak Shakespeare was described as text generation  \n\n8) We read that complexyodyanmics blog post; the one that described complexity as the region between two extremes of entropy; I didn't mull over it, but couldn't really see the point  \n\n9) I guess that predicting things with medium entropy, e.g. the world we live in requires a larger model than predicting more regular or more random things; is this supposed to be a compression <> learning idea?  \n\n10) Maybe this is related to something like small reasoning models, which are lower entropy than general models and which can therefore be smaller - but do we really need entropy to reason this out? probably not.  \n\n11) Transformers really are just enormously elegant; maybe they have some issues like attention sinks or difficulties with very long context or bad complexity characteristics for attention, but on the whole they are super elegant as a way to solve the problem \n\n12) Visualizations have improved a lot; I look forward to models being made even more accessible to lay readers in the future\n",
    "tweet_id": "1919624758762865010",
    "note_id": "1919624758595051520",
    "tweet_url": "https://x.com/fleetingbits/status/1919624758762865010",
    "created_at": "2025-05-06T05:26:22.000Z",
    "length": 2181,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "compute",
      "pretraining",
      "interpretability"
    ],
    "title": "Some thoughts from an LLM reading group tonight",
    "snippet": "1) We are reading Ilya Sutskever's 30 papers to understand machine learning; couldn't believe how old they felt; today was the first 4 2) I sort of missed the LSTM era; back then I was using gradient boosted trees; but LSTMs seem super complex and byzantine 3) I realized that I never really understood why people did dropout; it wasn't preventing overfitting on features since it applied to the whole network 4) Means the point of dropout must really be more robust representations or like more general representations for concepts; but obviously at the cost of training efficiency"
  },
  {
    "body": "Some quick thoughts on OpenAI's $3bn Windsurf acquisition \n\n1) OpenAI and Anthropic represent two very different approaches to business strategy around AGI and ASI\n\n2) OpenAI meets customers where they are today, acquires users, and then seeks to upsell them on more products in the future\n\n3) This lets OpenAI prove revenue today, which helps it raise at higher valuations, which gives it access to more compute\n\n4) But, it also means that OpenAI ends up spending compute on things that may not immediately lead to AGI like sora and voice mode\n\n5) Anthropic focuses on the API and seems to treat its other products like a research preview for AGI; all of its products lack features\n\n6) So far, Anthropic can still very successfully raise, and avoids spending compute on side projects, and this leaves it more room to spend compute on things like mech interp\n\n7) There are arguments that in an AGI world, distribution is the most important thing; Windsurf at $4bn seems like part of that story\n\n8) It seems that the Anthropic view is closer to the original OpenAI view, when you get AGI, it will just be worth money\n\n9) There is a question on where this leaves Cursor; Cursor only really started enterprise sales a few months ago, OpenAI will have distribution advantages\n\n10) Cursor's best bet is that OpenAI doesn't end up with the best model; it's Anthropic or DeepMind, and those companies don't have a competitive integrated IDE\n\n11) That or the baton shifts between them frequently enough that it is better to use cursor as an intermediary rather than switch lab products\n\n12) In any event, I'm skeptical, my gut is that share of the consumer mind and sales motion wins out and that Cursor ends up with slower growth and eventually has to find a niche\n",
    "tweet_id": "1919588796762619929",
    "note_id": "1919588796645126144",
    "tweet_url": "https://x.com/fleetingbits/status/1919588796762619929",
    "created_at": "2025-05-06T03:03:28.000Z",
    "length": 1757,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "lab economics",
      "compute",
      "coding",
      "consumer",
      "agi"
    ],
    "title": "Some quick thoughts on OpenAI's $3bn Windsurf acquisition",
    "snippet": "1) OpenAI and Anthropic represent two very different approaches to business strategy around AGI and ASI 2) OpenAI meets customers where they are today, acquires users, and then seeks to upsell them on more products in the future 3) This lets OpenAI prove revenue today, which helps it raise at higher valuations, which gives it access to more compute 4) But, it also means that OpenAI ends up spending compute on things that may not immediately lead to AGI like sora and voice mode"
  },
  {
    "body": "Some thoughts on the new proposed OpenAI corporate structure\n\n1) It sounds like a lot of the actual difficulty associated with the old structure came from the profit waterfall rather than nonprofit board control\n\n2) The profit waterfall was this thing where investor returns were capped, you lost your stock after you got your returns, employee returns were capped, etc...\n\n3) Sam probably wanted to get rid of the nonprofit board control - because it's most threatening to him as CEO rather than something threatening to the investors\n\n4) OpenAI will end up with a similar control structure to Anthropic +/- some details on both sides\n\n5) On the Anthropic side, they have board members appointed by the Long Term Benefit Trust (which can be dissolved by the Anthropic founders)\n\n6) On the OpenAI side, they will have board members appointed by the nonprofit, but the non-profit is likely to be ideologically steady \n\n7) The Anthropic board structure looks more potentially ideologically driven, in a way that could affect investors, but has an escape valve; the proposed OpenAI board structure looks less driven, but no escape valve.\n\n8) Side note, the main thing a public benefit corporation does it make it so that the board can consider social good in addition to investor profit and the board isn't liable for investor losses if it makes an honest attempt to balance these competing objectives \n\n9) That basically means - for all of this - that both OpenAI and Anthropic have boards with traditional-ish board duties, but not as strict, they still are supposed to consider investor return\n",
    "tweet_id": "1919475436633129262",
    "note_id": "1919475436486328324",
    "tweet_url": "https://x.com/fleetingbits/status/1919475436633129262",
    "created_at": "2025-05-05T19:33:00.000Z",
    "length": 1593,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "lab economics",
      "legal"
    ],
    "title": "Some thoughts on the new proposed OpenAI corporate structure",
    "snippet": "1) It sounds like a lot of the actual difficulty associated with the old structure came from the profit waterfall rather than nonprofit board control 2) The profit waterfall was this thing where investor returns were capped, you lost your stock after you got your returns, employee returns were capped, etc... 3) Sam probably wanted to get rid of the nonprofit board control - because it's most threatening to him as CEO rather than something threatening to the investors 4) OpenAI will end up with a similar control structure to Anthropic +/- some details on both sides"
  },
  {
    "body": "some thoughts on the intelligence explosion from an sf get together tonight\n\n1) we should expect that we will have automated researchers in the next 2-6 years\n\n2) current researchers pick experiments, setup the experiments, run the experiments; automated researchers have to do all of this + spend some compute on their own operations\n\n3) one question for research progress is how much room is there to pick better experiments\n\n4) good proxy for how much this can change progress, what is the difference, in terms of compute efficiency, between the best lab researchers and average lab researchers\n\n5) if the best lab researcher is 10x the median lab researcher then there is a lot of headroom at the top, if the best researcher is only 2x better than not so much\n\n6) another question is how much can we improve setup time for experiments / run the experiments\n\n7) it sounds that the answer for this is that we are not really bound by setup time\n\n8) a researcher at a lab needs to spend more time running the experiment than setting up the experiment, which means that they can interleave their setup time with waiting for results\n\n9) running the experiment is expensive though, in terms of research compute, and this gets larger as models get larger, because you need to larger experiments to validate the effects at larger scales\n\n10) basically this implies that even if experiment setup costs go to zero, so long as models get bigger, you can't do that many more experiments than you currently do\n\n11) so, it means that almost all of your gains come from selecting better experiments, this might mean better incremental experiments within a domain, or finding experiments that can offer revolutionary advantage\n\n12) there might be a middle ground, where you can find some experiments, which are cheaper to validate at scale, and thus run more experiments, but you are still basically research compute bottlenecked\n\n13) if the best researchers at labs are 10x researchers in terms of their compute efficiency then we can expect automated researchers to speed things up much more than the alternative\n\n14) it seems like a lot of this depends on whether model size keeps increasing, you can afford a lot more experiments at a smaller model size, and maybe at that stage setup costs come back into play\n\n15) it also seems that it is relevant that the actual running compute costs of the automated researcher matter, there is a tradeoff between spending inference time compute on the researcher and on the experiment\n\n16) given how expensive the experiments become though, it feels like this cost will be a small part of the overall compute cost, and you will be able to afford a lot of thinking cost (hard to imagine > 5% of total compute)\n\n17) so our real questions around intelligence explosion should really be: how much more efficient are automated researchers in picking experiments over human researchers today?\n\n18) and our proxy should be, how much better are the best lab researchers today compared to the median lab researcher (maybe we get superhuman researchers, but maybe there are fewer high quality ideas, so the above is a good starting point)\n\n19) if the best researchers are 10x better than average then we could expect 10x lab productivity per year, if 2x then more like 2x\n",
    "tweet_id": "1919272341613232466",
    "note_id": "1919272341428645888",
    "tweet_url": "https://x.com/fleetingbits/status/1919272341613232466",
    "created_at": "2025-05-05T06:05:59.000Z",
    "length": 3291,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "lab economics",
      "compute",
      "agi"
    ],
    "title": "some thoughts on the intelligence explosion from an sf get together tonight",
    "snippet": "1) we should expect that we will have automated researchers in the next 2-6 years 2) current researchers pick experiments, setup the experiments, run the experiments; automated researchers have to do all of this + spend some compute on their own operations 3) one question for research progress is how much room is there to pick better experiments 4) good proxy for how much this can change progress, what is the difference, in terms of compute efficiency, between the best lab researchers and average lab researchers"
  },
  {
    "body": "Just my quick brush thoughts on this discussion (was about cooperative AI agents):\n\n1) We should think about how initial cooperative AI agents will be created\n\n2) We will create RL environments to train models to coordinate with one another towards goals - immediately larger models and smaller models as tools\n\n3) We will also create RL environments to train models to achieve individual goals without interfering with one another or harming global goals\n\n4) AI agents will still have information transfer bottlenecks, but these will be much less than those with humans\n\n5) These kinds of information transfer bottlenecks lead to ideas like local areas of responsibility, some form of price mechanism or equivalent, etc...\n\n6) AI agents will be less selfish and so will (almost) entirely avoid a whole class of human coordination problems that result from principle agent problems\n\n7) The real question should always be what will the loss function be and what will the reward be; since we should pose our problems as training problems\n\n8) This stuff will - at least on current paths - be determined by the big labs in small ways and eventually governments in bigger ways\n\n9) There will still probably have to be some system of adjudication and open rules that are updated between agents, whether those live in weights or prompts may be irrelevant\n\n10) Rules at different levels of generality for agents may be set by different bodies (agents themselves, labs, governmental departments, legislature) etc...\n\n11) There is probably also an important question of trust between agents, which exists regardless of whether the agents are AI or not\n\n12) Agents might need to sign that they are associated with some authority in order to interact with other models, make payments, etc...\n\n13) It doesn't make sense to think about models through the lens of natural selection because models are designed by humans and don't reproduce (at least right now)\n\n14) There is this tight relationship in the natural world with respect to a population and the next generation - models have a much less tight relationship\n\n15) There is probably some relationship; but one generation of models could produce another with very different architectures, values, etc...\n\n16) We may also get better at designing models - through things like mechanistic interpretability; it's not obvious to me which way this goes\n\n17) It's interesting that people seem to think about all selection as \"natural selection\" - maybe so, but what does \"natural\" add in that case to the word \"selection\"\n\n18) And, it's not clear that we will be terrible (as a society) at deciding which models are good for us as humans generally\n\n19) In any event, given current human social structures, I think we have at least a century of human control of society to figure this out (my thought is democratic institutions are reasonably sticky)\n\n20) Not too much can be read into the 4o sycophancy event - it was a bad product release - OpenAi also isn't going to be transparent with us as to what happened\n\n21) e.g. the Product team thought the model was super cool and Sam tried it and loved it - not going to be the public announcement\n\n22) There was some discussion of intelligence - I think we should just care about loss functions - what tasks can the agent do?\n\n23) I think the way to think about the intelligence of V3 is the categorical cross entropy loss over the pretraining dataset and R1 is the mathematical problems it can solve, etc...\n\n24) I've read cool things about intelligence - like the book Gradient Expectations - but like consciousness, I don't think it's a fruitful direction of discussion\n\n25) It seems to me to be something downstream of some other behavior or judgment - the model performs well on some span of tasks - it's intelligent!\n\n26) There was some discussion about the model imitating humans - but this is really just downstream of LLM pretraining and maybe a bit RLHF\n\n27) We may see the models get farther from recognizable human thought as we have more data that is less human produced / curated and RL verifiable rewards and probably also Coconut\n\n28) There is probably some corporate / human pressures around keeping the LLMs tied to human thought and behavior - we may want to keep the CoT observable in natural language, etc...\n\n29) This above line of inquiry is something that I'm less sure of - I'm not even sure the above around human thought is a well formulated question - would need to think about it more\n\n30) Is the Claude o3 issues a value specification issue or an instruct issue; probably an instruct issue\n\n31) In any event, our goal is going to be able to make our verifiers more perfect, and figure out stuff like competitive models to better avoid RL crushing imperfect verifiers\n\n32) I'm assuming that this is mostly a solvable problem, because humans seem to be decent at it, and we have so much access to models - their internal states, etc...\n\n33) Sidenote, I think some number of people in the call would enjoy listening to the Terrence McKenna chill step about language as an autonomous entity\n",
    "tweet_id": "1917748735339290856",
    "note_id": "1917748734928248833",
    "tweet_url": "https://x.com/fleetingbits/status/1917748735339290856",
    "created_at": "2025-05-01T01:11:43.000Z",
    "length": 5098,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/leebriskcyrano/status/1916992844465901996"
    ],
    "tags": [
      "openai",
      "anthropic",
      "post-training",
      "interpretability",
      "evals",
      "safety",
      "agi"
    ],
    "title": "Just my quick brush thoughts on this discussion (was about cooperative AI agents):",
    "snippet": "1) We should think about how initial cooperative AI agents will be created 2) We will create RL environments to train models to coordinate with one another towards goals - immediately larger models and smaller models as tools 3) We will also create RL environments to train models to achieve individual goals without interfering with one another or harming global goals 4) AI agents will still have information transfer bottlenecks, but these will be much less than those with humans"
  },
  {
    "body": "Some notes on how I generate my images:\n\n1) most works are generated with a reference style work used to get ideas\n\n2) I put the reference style work into o3 or 4.5 and have a discussion around the style\n\n3) I get a 2-3 paragraph description of the style from o3 or 4.5 and put the image that I want at the top\n\n4) I then go to Midjourney for generation; I generate 20 images for every 1 that I share\n\n5) I use Midjourney because Midjourney does complex images much better than 4o; I occasionally use 4o for simpler images where the text matters\n\n6) I \"overload the prompt\"; this means that I put way more in the prompt than the model can generate, this is okay, I find that it aids in creating interesting images\n\n7) If the image isn't doing something that you need it to do, you should emphasize the most important things at the top of the prompt\n\n8) Some styles, effects, etc... are basically inaccessible; this is okay, different generations of models were better at different things\n",
    "tweet_id": "1917303967681503350",
    "note_id": "1917303967555678208",
    "tweet_url": "https://x.com/fleetingbits/status/1917303967681503350",
    "created_at": "2025-04-29T19:44:22.000Z",
    "length": 987,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "consumer"
    ],
    "title": "Some notes on how I generate my images:",
    "snippet": "1) most works are generated with a reference style work used to get ideas 2) I put the reference style work into o3 or 4.5 and have a discussion around the style 3) I get a 2-3 paragraph description of the style from o3 or 4.5 and put the image that I want at the top 4) I then go to Midjourney for generation; I generate 20 images for every 1 that I share"
  },
  {
    "body": "some thoughts on perplexity and the AI market\n\n1) perplexity is being potentially valued at 18b; tried to figure out why; I think it has to do with search TAM and perplexity's surprising name recognition\n\n2) perplexity has a higher search frequency than anthropic according to google trends; perplexity actually has a not insignificant number of searches compared to openai\n\n3) Google has a market cap of $2tn; put these facts together and you can see why investors think an 18bn valuation for perplexity might not be out of the question\n\n4) I think where this analysis for perplexity's value fails though is in the importance of the model in providing the replacement for search experience (see o3)\n\n5) in addition, I suspect there will be fewer frontier open source models that are competitive with proprietary models - suddenly perplexity's position looks much worse\n\n6) my gut is perplexity sells to its investors that it can use open source models to equal OpenAI with just some fine tuning domain specific to search and as a result - perplexity will be a more capital efficient challenge to Google\n\n7) I think another problem here is that OpenAI is more compelling to the AI crowd and Google is more compelling to the ordinary people crowd; and each of them squeeze perplexity\n\n8) Google is getting ready to push AI to the front page; I don't mean just some blurb either; it's why they have added \"AI mode\" to search results on the top bar\n\n9) I don't know how Google will figure out advertising around it; they have to in order to do it; but I think it's very possible; maybe steering vectors; but probably just like a side bar or something\n\n10) OpenAI and Anthropic are adding web search too and they have better name recognition and better access to the market that perplexity needs to win (perplexity's market trusts openai and anthropic more than perplexity itself)\n\n11) (btw who is Anthropic's web search partner? - actually extremely interesting question - Brave, I think - aren't they just using the Google API?)\n\n12) I see perplexity and cursor as parallel lives - early players - fast to revenue - but in the same lane as the foundation labs\n\n13) the below thoughts are a bit speculative:\n\na) these companies might be bets on a slow takeoff to some degree - in a fast takeoff - these models stop getting open sourced soon (too dangerous) - and then most of the wins go to the firms that can train models end to end\n\nb) in a slow take-off, meta open sources for longer, less returns to scale - domain specific RL is more about data acquisition - OpenAI / Cursor / Perplexity end up paying a lot to mercor - services are closer - more cost competition - lower cost basis benefits Cursor / Perplexity\n\n14) Cohere is probably another company sort of in this early boat - not a foundation lab - but competing in a similar lane\n",
    "tweet_id": null,
    "note_id": "1903261490851053568",
    "tweet_url": null,
    "created_at": "2025-03-22T01:44:34.000Z",
    "length": 2837,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "lab economics",
      "compute",
      "wrappers",
      "enterprise",
      "consumer",
      "post-training"
    ],
    "title": "some thoughts on perplexity and the AI market",
    "snippet": "1) perplexity is being potentially valued at 18b; tried to figure out why; I think it has to do with search TAM and perplexity's surprising name recognition 2) perplexity has a higher search frequency than anthropic according to google trends; perplexity actually has a not insignificant number of searches compared to openai 3) Google has a market cap of $2tn; put these facts together and you can see why investors think an 18bn valuation for perplexity might not be out of the question 4) I think where this analysis for perplexity's value fails though is in the importance of the model in providing the replacement for search experience (see o3)"
  },
  {
    "body": "some thoughts on open models\n\n1) there are a lot of companies that want open models for various reasons, one group is people who want more businesses to train / fine tune models and want to sell to people doing so\n\n2) these are companies like Scale (wants to sell data to more people), Nvidia (wants to have more than 6 GPU customers), also companies like Databricks (wants good base models to fine tune for customers)\n\n3) there are also players that want to commoditize the competition like DeepSeek and Meta - DeepSeek wants to win on inference skill (and get the attention of the Chinese government)\n\n4) Meta wants to commoditize the competition - it wants social to be a game won by distribution, not by technical innovation, which it would be slower to deploy and utilize (cf Character)\n\n5) Something to notice about the difference between the people that want to build the training market and the players that want to commoditize the competition - is how much capital they are willing to deploy\n\n6) People that want to sell to companies doing training don't want to invest enormous amounts of capital into training models; they want to contribute tools and research\n\n7) The commoditize the competition game is willing to spend more money to build / defend their core businesses, where they make their money; the advantage to them is less diffuse\n\n8) Meta has a $1.5tn business to defend; DeepSeek has a $1.5tn business to build (get attention of Chinese government, win on inference, get preferential relationship with Huawei, etc...)\n\n9) The sell-to players can't afford to capitalize the foundation models; they just are not enough monopolies / the benefits are too diffuse through their market segment / but they contribute research\n\n10) When they contribute research, they can use it to attract the players that they want to do training with them or to buy training supplies from them; it's actually a kind of marketing content\n\n11) It's also good comp for the employees; since you are not going to necessarily make the same money at Meta vis-a-vis OpenAI - but you get to publish\n\n12) Companies do the same when they let their employees do open source - it's comp - you should see open source as comp\n\n13) One real question about the future of open models is whether the market contracts when regulation sets in - it will be in less interest for Meta to release models when the CBRN risks and resulting regulation are more substantial\n\n14) Another question is what happens when DeepSeek actually gets pre-eminence in the Chinese market - at some point it becomes less worthwhile for them to open source and more valuable to recoup the capital costs\n\n15) Because, at that point - they have the reputation, they have the market, they hopefully are the favored national champion for the Chinese government - suddenly recouping the capital costs make more sense?\n\n16) We can imagine counter arguments (Chinese government wants DeepSeek to open source for emerging markets or disrupt US AI companies - essentially state policy)\n\n17) All of this said - it's very hard to get sober analysis of open source AI - there are a lot of people that just want to believe in it - very similar with Chinese AI - hard to get serious commentary\n\n18) I think there is a market for a fully open model, data released, fully audited, etc... for certain customers, needs to have FedRamp Certification, supply chain will matter (hard to figure out how to get the money from it)\n\n19) Feels like the world of RedHat Linux / probably also has to be sold with a dedicated fine tuning infrastructure so companies can do domain specific RL on it \n\n20) Some of this might be supported by new regulations around critical infrastructure, Europe/US rivalries, things that drive larger enterprises to want to adopt AI in a particular package\n\n21) Would have to lag the market though - DeepSeek R1 level - (always a step or two behind the frontier) - to make sure it can be done at cost\n",
    "tweet_id": "1900022806798356778",
    "note_id": "1900022806475399168",
    "tweet_url": "https://x.com/fleetingbits/status/1900022806798356778",
    "created_at": "2025-03-13T03:15:12.000Z",
    "length": 3960,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "chinese labs",
      "meta",
      "neolabs",
      "lab economics",
      "compute",
      "enterprise",
      "post-training",
      "safety",
      "legal",
      "bio"
    ],
    "title": "some thoughts on open models",
    "snippet": "1) there are a lot of companies that want open models for various reasons, one group is people who want more businesses to train / fine tune models and want to sell to people doing so 2) these are companies like Scale (wants to sell data to more people), Nvidia (wants to have more than 6 GPU customers), also companies like Databricks (wants good base models to fine tune for customers) 3) there are also players that want to commoditize the competition like DeepSeek and Meta - DeepSeek wants to win on inference skill (and get the attention of the Chinese government) 4) Meta wants to commoditize the competition - it wants social to be a game won by distribution, not by technical innovation, which it would be slower to deploy and utilize (cf Character)"
  },
  {
    "body": "some thoughts on open ai's new tools\n\n1) openai has basically released three things: a search tool, a rag tool and a computer automation tool\n\n2) it's pretty clear that openai plans to go direct in large markets - search, software development, computer automation, personal assistant - using ChatGPT - long term, maybe also robotics, image/video generation\n\n3) in other markets, openai will sell tools to developers that they can use to tailor LLMs to specific verticals; openai plans to own the infrastructure layer\n\n4) this is bad for LLM infra providers - search (perplexity,  exa), retrieval (chroma, cohere, etc...), ocr (llama index, etc...)\n\n5) it's very hard to compete with foundation labs in their lane, because they have access to the models before you do, and have more advanced fine-tuning pipelines\n\n6) they also just know more about the models - they know the pretraining data, the post-training data, where they are likely to be weak / strong\n\n7) and, it's nearly impossible to do this yourself due to the high capital costs involved in model development (at the cheapest > $100m)\n\n8) so, players like perplexity - who are targeting the general market - are pretty much stuck - hard to compete with OpenAI's name recognition, will probably be legitimately behind on product, etc...\n\n9) best is to be building in a space that is complementary, domain specific - where you can plug in new products from the foundation lab to make your product better - but where you own the relationships, etc...\n",
    "tweet_id": "1899617355707359519",
    "note_id": "1899617355560587264",
    "tweet_url": "https://x.com/fleetingbits/status/1899617355707359519",
    "created_at": "2025-03-12T00:24:05.000Z",
    "length": 1509,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/OpenAIDevs/status/1899531225468969240"
    ],
    "tags": [
      "openai",
      "lab economics",
      "compute",
      "coding",
      "wrappers",
      "enterprise",
      "consumer",
      "post-training"
    ],
    "title": "some thoughts on open ai's new tools",
    "snippet": "1) openai has basically released three things: a search tool, a rag tool and a computer automation tool 2) it's pretty clear that openai plans to go direct in large markets - search, software development, computer automation, personal assistant - using ChatGPT - long term, maybe also robotics, image/video generation 3) in other markets, openai will sell tools to developers that they can use to tailor LLMs to specific verticals; openai plans to own the infrastructure layer 4) this is bad for LLM infra providers - search (perplexity,  exa), retrieval (chroma, cohere, etc...), ocr (llama index, etc...)"
  },
  {
    "body": "minimal product thoughts on foundation model APIs and AI dev products:\n\n1) There is no excuse in 2025 for not having an easily accessible version of all your docs in a single text file\n\n2) OpenAI, Anthropic, ChromaDB, everyone fails at this; a sign that no one knows what they are doing or really uses the products\n\n3) I just want to copy your docs, drop them in the model, and get the model to write the code\n\n4) If you have a model that people are going to code with  e.g. ChatGPT, Claude, Claude Coder, etc... there is no excuse that it not use your APIs well\n\n5) ChatGPT should not recommend users use 'gpt-4' or 'gpt-3.5-turbo'; Claude Coder shouldn't use some ancient version of Claude-Sonnet\n\n6) If you cannot keep the models updated or search is unavailable, there should be default models like \"gpt-low-compute\" or \"gpt-high-compute\" and they just update the pointer\n\n7) For products that will be used by novices, use the cheap default model; for stuff like Claude Coder, aimed at professionals, it can be whatever (or even just decided by the model in response to the task)\n\n8) Changing your APIs right now is a sign that you don't care about your users; if you change your API, it needs to be backwards compatible\n\n9) people are not going to look up your API, they are going to use the LLM code and it will fail and they are going to get frustrated \n\n10) this stuff is so minimal - and foundation labs get it so wrong - I feel like it is a sign of substantial organizational dysfunction / disrespect for users\n",
    "tweet_id": "1899220500536258889",
    "note_id": "1899220500381089792",
    "tweet_url": "https://x.com/fleetingbits/status/1899220500536258889",
    "created_at": "2025-03-10T22:07:07.000Z",
    "length": 1520,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "coding",
      "enterprise"
    ],
    "title": "minimal product thoughts on foundation model APIs and AI dev products:",
    "snippet": "1) There is no excuse in 2025 for not having an easily accessible version of all your docs in a single text file 2) OpenAI, Anthropic, ChromaDB, everyone fails at this; a sign that no one knows what they are doing or really uses the products 3) I just want to copy your docs, drop them in the model, and get the model to write the code 4) If you have a model that people are going to code with  e.g. ChatGPT, Claude, Claude Coder, etc... there is no excuse that it not use your APIs well"
  },
  {
    "body": "Here are my thoughts on the significance of Self Taught Automated Reasoners and what makes them work:\n\n(great paper from @gandhikanishk and @noahdgoodman!)\n\n1) One of the important questions around reasoning models is: why now? \n\n2) It seems people have been trying RL on language models using verifiable rewards for the past couple of years. Why didn't it work before?\n\n3) This paper suggests that the important thing that makes RL work is the capabilities of the base model. \n\n4) If the base model exhibits backtracking, verification, subgoal setting and backwards chaining, RL works. If it doesn't exhibit enough of these behaviors, RL fails.\n\n5) The paper tests this by using Qwen-2.5-3B and Llama-3.2-3B and training them on a basic math game called Countdown.\n\n6) This explains a theory as to why OpenAI got to reasoning models first - they had GPT-4 - and so they had the best base model and the only available base model with enough of these behaviors.\n\n7) It also explains why once RL worked for DeepSeek, it worked for everyone - the base models available to the market were just much better than they had been previously.\n\n8) Also, it turns out that once you have a reasoning model, you can SFT a weaker model on the right behaviors and RL will then work for it too.\n\n9) Looking forward, this suggests 3 things: (a) pre-training is not dead, (b) curricular learning is potentially a thing, and (c) we can get very powerful small reasoning models.\n\n10) First pretraining isn't dead - the paper identifies some behaviors that correlate with successful reasoning, but we don't know what the long tail of good behaviors looks like.\n\n11) Moreover, this is a bitter lesson thing - we don't want to try to just identify them - we want the models to show us what they are - R0 style. \n\n12) What does this mean? It means we want even more powerful base models to use for training stronger reasoning models, larger models will exhibit more of the behaviors we care about, which will make them more sample efficient for RL\n\n13) There might be some idea of curricular learning here, where we want to ensure that our base model has enough reasoning data from different domains - or enough different kinds of reasoning. (cf bitter lesson though)\n\n14) Once we have the powerful base model, and can do the sample efficient RL, we can SFT a smaller model on this and then use it for RL. On view is that this creates a large model -> RL -> distill paradigm.\n\n15) And, that matters too because we want to be able to do very fast and cheap inference on reasoning models (since we are going to generate so many tokens)\n\n16) Sidenote, this also suggests why Dario's focus was on DeepSeek v3 and not DeepSeek R1 when commenting on DeepSeek's progress - because the base model mattes for sample efficiency.\n\n17) Anyways, the paper does have to replicate - it only tests two models, and 4 hand picked features - but it does seem to match what people said when the DeepSeek paper came out.\n",
    "tweet_id": "1898158464192725237",
    "note_id": "1898158463844507648",
    "tweet_url": "https://x.com/fleetingbits/status/1898158464192725237",
    "created_at": "2025-03-07T23:46:58.000Z",
    "length": 2974,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/gandhikanishk/status/1896988028893323675"
    ],
    "tags": [
      "openai",
      "anthropic",
      "chinese labs",
      "meta",
      "compute",
      "post-training",
      "pretraining",
      "evals",
      "math"
    ],
    "title": "Here are my thoughts on the significance of Self Taught Automated Reasoners and what makes them work:",
    "snippet": "(great paper from @gandhikanishk and @noahdgoodman!) 1) One of the important questions around reasoning models is: why now? 2) It seems people have been trying RL on language models using verifiable rewards for the past couple of years. Why didn't it work before? 3) This paper suggests that the important thing that makes RL work is the capabilities of the base model."
  },
  {
    "body": "Really interesting and important paper from @gandhikanishk and @noahdgoodman.  Some thoughts on Self Taught Automated Reasoners and what makes them work and it means:\n\n1) One of the important questions around reasoning models is: why now? \n\n2) It seems people have been trying RL on language models using verifiable rewards for the past couple of years. Why didn't it work before?\n\n3) This paper suggests that the important thing that makes RL work is the capabilities of the base model. \n\n4) If the base model exhibits backtracking, verification, subgoal setting and backwards chaining, RL works. If it doesn't exhibit enough of these behaviors, RL fails.\n\n5) The paper tests this by using Qwen-2.5-3B and Llama-3.2-3B and training them on a basic math game called Countdown.\n\n6) This explains a theory as to why OpenAI got to reasoning models first - they had GPT-4 - and so they had the best base model and the only available base model with enough of these behaviors.\n\n7) It also explains why once RL worked for DeepSeek, it worked for everyone - the base models available to the market were just much better than they had been previously.\n\n8) Also, it turns out that once you have a reasoning model, you can SFT a weaker model on the right behaviors and RL will then work for it too.\n\n9) Looking forward, this suggests 3 things: (a) pre-training is not dead, (b) curricular learning is potentially a thing, and (c) we can get very powerful small reasoning models.\n\n10) First pretraining isn't dead - the paper identifies some behaviors that correlate with successful reasoning, but we don't know what the long tail of good behaviors looks like.\n\n11) Moreover, this is a bitter lesson thing - we don't want to try to just identify them - we want the models to show us what they are - R0 style. \n\n12) What does this mean? It means we want even more powerful base models to use for training stronger reasoning models, larger models will exhibit more of the behaviors we care about, which will make them more sample efficient for RL\n\n13) There might be some idea of curricular learning here, where we want to ensure that our base model has enough reasoning data from different domains - or enough different kinds of reasoning. (cf bitter lesson though)\n\n14) Once we have the powerful base model, and can do the sample efficient RL, we can SFT a smaller model on this and then use it for RL. On view is that this creates a large model -> RL -> distill paradigm.\n\n15) And, that matters too because we want to be able to do very fast and cheap inference on reasoning models (since we are going to generate so many tokens)\n\n16) Sidenote, this also suggests why Dario's focus was on DeepSeek v3 and not DeepSeek R1 when commenting on DeepSeek's progress - because the base model mattes for sample efficiency.\n\n17) Anyways, the paper does have to replicate - it only tests two models, and 4 hand picked features - but it does seem to match what people said when the DeepSeek paper came out.\n",
    "tweet_id": null,
    "note_id": "1898155907105873921",
    "tweet_url": null,
    "created_at": "2025-03-07T23:36:48.000Z",
    "length": 2985,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "chinese labs",
      "meta",
      "compute",
      "post-training",
      "pretraining",
      "evals",
      "math"
    ],
    "title": "Really interesting and important paper from @gandhikanishk and @noahdgoodman.  Some thoughts on Self Taught Automated Reasoners and what makes them work and it means:",
    "snippet": "1) One of the important questions around reasoning models is: why now? 2) It seems people have been trying RL on language models using verifiable rewards for the past couple of years. Why didn't it work before? 3) This paper suggests that the important thing that makes RL work is the capabilities of the base model. 4) If the base model exhibits backtracking, verification, subgoal setting and backwards chaining, RL works. If it doesn't exhibit enough of these behaviors, RL fails."
  },
  {
    "body": "Really good article from @SanhEstPasMoi on the consequences of coding agents. Some thoughts:\n\n1) I agree with Victor that even when software becomes easy, distribution probably remains hard. Being able to get in front of the customer becomes increasingly important.\n\n2) This should advantage companies with established customer bases - but they might find the transition to a faster release cycle / new design patterns hard. There will probably be a bunch of acquisitions.\n\n3) This is part of how established companies leverage their market strength; they acquire smaller companies that have built products, then they sell them.\n\n4) We should see additional automation of sales, but I agree with Victor that this is ultimately zero-sum, to some degree, and therefore being smart about getting in front of people is important.\n\n5) Will definitely reward people who are good at driving engagement on social media, etc... we've had previous examples of this like DingBoard but I expect to see more influencer driven consumer products.\n\n6) Also, in certain markets, personal connections - think like law, accounting, local government, etc... where it's not a social media thing, it's a small network thing.\n\n7) I'm unconvinced that the next step is bespoke tools for every customer (outside of enterprise; where I do think this is the direction); I think the next step is dynamic interfaces.\n\n8) Part of the issue is that companies are still responsible for ensuring a good user experience, repeatable results, reliable affordances with clear signifiers; you still need a reliable sandbox for the user.\n\n9) But, maybe that's just a 10x improvement, Victor specifies 50x; my guess is the next year to year and a half is just a 10x improvement; maybe even 2-3 years. I think a 10x improvement is compatible with AGI. \n\n10) So, I'm going to discuss the 10x improvement. The 50x improvement is a whole different world, I think; that's more of an AGI->ASI world. Maybe, like 2029 or later. But, maybe I just rate the current agents as very capable.\n\n11) I don't think data access is as important. The foundation labs have been driving the core improvements. They buy data from Scale and Mercor and co. I don't think product data is as important.\n\n12) The foundation models will just get better at everything. Startups will be able to get data from partnerships. Large companies will already have the data. I just don't see data as the bottleneck.\n\n13) Maybe task data matters - if the RL paradigm takes off - and companies like Together get a lot of traction - or maybe fine tuning through OAI / Anth / Google - but I just don't see it - it feels too hard.\n\n14) Getting the task data, doing the fine tune, testing it, releasing it, feels to me like a 6-8 month project at a larger company - this is 1-2 releases from major labs. By that point the work might have already been made obsolete.\n\n12) Product experience will remain very important - the reason is that it is the summation of everything - it's GTM, it's taste, etc.... Product taste has historically mattered less in enterprise SaaS - I think this will continue to be true.\n\n13) This is mainly because of buying dynamics - enterprise SaaS is often bought by someone who is not the user; so, it needs more features on paper to be popular - the actual experience matters less than consumer products. I don't see this changing.\n\n14) I do think the size of engineering teams can drop; what used to be 10 engineers, can just be 3 or 4 now; note, I don't think this is the death of engineering as a profession.\n\n15) Oh also - I think an important question is what lags? And, I think spatial reasoning and concurrency are still going to lag with LLMs. So, things that involve visual reasoning and complicated system interactions will still require a lot of debugging.\n\n16) This probably says that for the next year or so - LLMs will greatly accelerate prototyping / and 0 to 1 products; but large scale products, with complicated infrastructure and interactions, will be a bit slower to feel the acceleration.\n",
    "tweet_id": "1898073156554047603",
    "note_id": "1898073156214214656",
    "tweet_url": "https://x.com/fleetingbits/status/1898073156554047603",
    "created_at": "2025-03-07T18:07:59.000Z",
    "length": 4053,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/SanhEstPasMoi/status/1898054302498767255"
    ],
    "tags": [
      "lab economics",
      "coding",
      "enterprise",
      "consumer",
      "post-training",
      "agi"
    ],
    "title": "Really good article from @SanhEstPasMoi on the consequences of coding agents. Some thoughts:",
    "snippet": "1) I agree with Victor that even when software becomes easy, distribution probably remains hard. Being able to get in front of the customer becomes increasingly important. 2) This should advantage companies with established customer bases - but they might find the transition to a faster release cycle / new design patterns hard. There will probably be a bunch of acquisitions. 3) This is part of how established companies leverage their market strength; they acquire smaller companies that have built products, then they sell them. 4) We should see additional automation of sales, but I agree with Victor that this is ultimately zero-sum, to some degree, and therefore being smart about getting in front of people is important."
  },
  {
    "body": "Anthropic is very elite focused in its communication, I view this as a problem, some thoughts:\n\n1) It's very obvious that Anthropic seeks to influence the government much more than win popular support for its policies\n\n2) I think a lot of this is traceable to Anthropic's late 2010s EA vibe - when EA was still relatively small and more limited to elite universities\n\n3) A lot of Anthropic's communication feels very insular, sometimes I get the feeling that the Anthropic team doesn't really talk to the outside world\n\n4) Compare Dario Amodei and Sam Altman's communication styles - Dario writes essays telling you what to think, but you don't get to reply to him - Sam writes and argues on Twitter\n\n5) The prompt jailbreak contest had similar themes - Pliny wanted Anthropic to open source the dataset in return for competing - the Anthropic team was like *nah* - they just don't see the public as an equal\n\n6) Anthropic very much wants you only to be engaged with AI on their terms - they don't trust the public - you can hear it in how Dario talks about AI and misuse risks\n\n8) The Deceptive Alignment paper also had similar themes - it's obvious that the achilles heel of the paper is that their post-training caused the model to exhibit deceptive misalignment - but they don't talk about it in the paper\n\n9) Evan Hubinger basically concedes this in a recent lecture - it turns out that the reason they don't talk about it in the paper is because it would involve disclosing parts of their post-training pipeline\n\n10) (this actually could be internal politics, for all we know this could be the model personality team vs the alignment evaluation team)\n\n11) But, Anthropic still expects you to take the paper seriously, on trust - and seem to brush away any criticism without seriously engaging with it\n\n12) I don't disagree with their aims - and I think that Anthropic produces a lot of valuable and useful research - but I do see the trend line in their engagements with the public\n",
    "tweet_id": "1897849206351839532",
    "note_id": "1897849206225952768",
    "tweet_url": "https://x.com/fleetingbits/status/1897849206351839532",
    "created_at": "2025-03-07T03:18:05.000Z",
    "length": 1987,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/AnthropicAI/status/1897773701224906854"
    ],
    "tags": [
      "openai",
      "anthropic",
      "post-training",
      "safety"
    ],
    "title": "Anthropic is very elite focused in its communication, I view this as a problem, some thoughts:",
    "snippet": "1) It's very obvious that Anthropic seeks to influence the government much more than win popular support for its policies 2) I think a lot of this is traceable to Anthropic's late 2010s EA vibe - when EA was still relatively small and more limited to elite universities 3) A lot of Anthropic's communication feels very insular, sometimes I get the feeling that the Anthropic team doesn't really talk to the outside world 4) Compare Dario Amodei and Sam Altman's communication styles - Dario writes essays telling you what to think, but you don't get to reply to him - Sam writes and argues on Twitter"
  },
  {
    "body": "My short-form reasons why @DanHendrycks is wrong about the likelihood of near term destabilization in international politics form Transformational AI:\n\n1) Humans are sticky in terms of their preferences and expectations; preferences and expectations built over years are persistent.\n\n2) The people at the top of human politics over the next 3-5 years will be people in their 50s, 60s and 70s that are used to nuclear MAD.\n\n3) It will take longer for people to question MAD than it will for Transformational AI to become a thing (assuming 2028 AGI and 2033 ASI)\n\n4) World leaders are not going to be interested in rolling the dice on their TAI being able to overcome a nuclear strike from a near-TAI.\n\n5) Actual interventions in foreign countries (cyber or otherwise) that are larger than theft (which seems to be generally accepted) will risk retaliation.\n\n6) Such interventions are also likely to be mostly ineffective (again other than theft - which will likely be very effective).\n\n7) So, the odds that we need to be concerned with real international political destabilization are low (although the risk of internal destabilization in capitalist nations is likely to be higher)\n\n8) The Taiwan War is an exception to the above in that it is something around which people have fixed preferences. \n\n9) We should be very concerned about preventing a Taiwan War / ensuring sufficient chip supply in the West.\n",
    "tweet_id": "1897369393551827219",
    "note_id": "1897369393425965056",
    "tweet_url": "https://x.com/fleetingbits/status/1897369393551827219",
    "created_at": "2025-03-05T19:31:29.000Z",
    "length": 1406,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/hendrycks/status/1897308828284412226"
    ],
    "tags": [
      "compute",
      "safety",
      "legal",
      "agi"
    ],
    "title": "My short-form reasons why @DanHendrycks is wrong about the likelihood of near term destabilization in international politics form Transformational AI:",
    "snippet": "1) Humans are sticky in terms of their preferences and expectations; preferences and expectations built over years are persistent. 2) The people at the top of human politics over the next 3-5 years will be people in their 50s, 60s and 70s that are used to nuclear MAD. 3) It will take longer for people to question MAD than it will for Transformational AI to become a thing (assuming 2028 AGI and 2033 ASI) 4) World leaders are not going to be interested in rolling the dice on their TAI being able to overcome a nuclear strike from a near-TAI."
  },
  {
    "body": "This great post from @Dorialexander is worth reading and gets a lot right - but some critiques:\n\n1) I agree that labs are going to move away from selling models and toward selling agentic interfaces - I've believed this for at least a year now\n\n2) The main reason is that it makes their models harder to distill and harder to copy - in short, higher margin - but it also has trust and safety benefits\n\n3) Where I disagree is that this is where the application layer is going to get disrupted - only parts of the application layer will be disrupted\n\n4) The labs are going to offer general agents. At the very least: (1) a research agent, (2) a coding agent, (3) a web agent. These will probably be offered by API.\n\n5) There will also be a consumer / business interface to these agents to support human users - this is high margin for the labs. OpenAI is the leader here.\n\n5) \"Wrappers\" that play in these spaces are in trouble - you don't want to be Perplexity or Cursor or Cognition - none of them put enough on top of the general agent or the business interface.\n\n6) The application layer is much bigger than this though - the application layer includes niche software coding agents for Cobol mainframe programs, Harvey and the legal apps, etc...\n\n7) It's unlikely that OpenAI / DeepMind / etc... will compete in every domain - more likely that they will focus on their core agents / product and seek to be a service provider to other folks who will figure out domain specific GTM, workflows, etc...\n\n8) They will still offer limited models through the API - for those that really need them - probably a step behind the actual frontier - but agents will be an increasing percentage of their business.\n\n9) So - my thesis is - parts of the application layer will be disrupted, but as a whole - the application layer is still the place to be in terms of capturing value outside the labs\n\n10) He also talks about DeepSeek inference as showing that there is no need for additional GPUs for inference, etc... I don't have extremely clear views on Nvidia and what DeepSeek means for current GPU demands right now. But, I'm skeptical that the demand will fall.\n\n11) If inference is cheaper then we can go closer to Chinchilla optimal models; if GPUs fall in price, we can train bigger models, etc... RL seems to me like something that will be more demand for GPUs, not less, especially if bigger models are more sample efficient.\n\n12) I also fundamentally believe that incremental intelligence is extremely valuable and will remain extremely valuable for the foreseeable future.\n\n13) All of this just goes to say that I think people are reading too much into the amount of inference that DeepSeek can do with 2,400 GPUs.\n\n14) Moreover, he says that you can't fund foundation labs - in part because they are a bad business. I think that this  misunderstands why you can't fund alternative foundation labs - see prime intellect. It's because foundation labs are only a good business at the frontier.\n\n15) If you are not at the frontier - your product is instantly commoditized by the best open source model.  So, you need to be better enough than the best open source model - or you need to be a domain specific application.\n\n16) Combine this with the fact that pretraining is capital intensive - and no one can raise other than the people at the forefront - or if investors believe you will receive state protection (see Mistral).\n\n17) But, OpenAI / Anthropic have oversubscribed rounds; because they have a technology advantage and they are ahead of the best open source models. So, there is value there.\n\n18) Europe though - just doesn't seem to be willing to give state advantage to its AI companies in any real way - and this makes folks leery of investing in them with large sums. So, you don't see big rounds in Europe.\n\n19) Sidenote, my first thought back in 2023 was that inference was going to be a major area of focus and competition between labs - because you can't distill inference tricks.\n",
    "tweet_id": "1896282205229625691",
    "note_id": "1896282204906692608",
    "tweet_url": "https://x.com/fleetingbits/status/1896282205229625691",
    "created_at": "2025-03-02T19:31:23.000Z",
    "length": 3994,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/Dorialexander/status/1896196592371417515"
    ],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "chinese labs",
      "neolabs",
      "lab economics",
      "compute",
      "coding",
      "wrappers",
      "enterprise",
      "consumer",
      "post-training",
      "pretraining",
      "safety",
      "agi"
    ],
    "title": "This great post from @Dorialexander is worth reading and gets a lot right - but some critiques:",
    "snippet": "1) I agree that labs are going to move away from selling models and toward selling agentic interfaces - I've believed this for at least a year now 2) The main reason is that it makes their models harder to distill and harder to copy - in short, higher margin - but it also has trust and safety benefits 3) Where I disagree is that this is where the application layer is going to get disrupted - only parts of the application layer will be disrupted 4) The labs are going to offer general agents. At the very least: (1) a research agent, (2) a coding agent, (3) a web agent. These will probably be offered by API."
  },
  {
    "body": "thoughts on Grok 3 and CBRN risks\n\n1) new models are democratizing access to scientific information; not only do they make the information available, they make the know-how available \n\n2) this includes CBRN information - whether this information is dangerous depends on whether it is gated by materials / lab requirements \n\n3) would like to read a good study on this\n\n4) people have cried wolf before - but it is important to remember that models are advancing very quickly\n\n5) a lot of effort is going into advancing models in scientific / verifiable fields - there is a lot of data collection effort here\n\n6) this will translate into capabilities on CBRN\n\n7) Grok 3 was a rush job on post training; the refusals are weak; 1 month from ending pre-training to release just isn't enough time\n\n8) you see it on product stuff, like multi-turn, but it's an issue when it also occurs on safety / refusals / etc...\n\n9) the obvious answer is government regulation that imposes not just liability, but administrative fines / the requirement to withdraw products from the market\n\n10) would be interested to see proposals for regulation on (9)\n\n11) will be interesting to see how xAI responds to this\n",
    "tweet_id": "1894215216503935317",
    "note_id": "1894215216365535232",
    "tweet_url": "https://x.com/fleetingbits/status/1894215216503935317",
    "created_at": "2025-02-25T02:37:55.000Z",
    "length": 1190,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "xai",
      "post-training",
      "safety",
      "legal",
      "bio"
    ],
    "title": "thoughts on Grok 3 and CBRN risks",
    "snippet": "1) new models are democratizing access to scientific information; not only do they make the information available, they make the know-how available 2) this includes CBRN information - whether this information is dangerous depends on whether it is gated by materials / lab requirements 3) would like to read a good study on this 4) people have cried wolf before - but it is important to remember that models are advancing very quickly"
  },
  {
    "body": "thoughts on GPT Wrappers\n\n1) all AI products are wrappers\n\n2) the real distinction is whether the frontier labs will build your product themselves\n\n3) frontier labs will build products because they are very general applications of LLMs (chat interface, voice interface, search, video generation, robotics, etc...) or because they are needed for internal tools (general SWE-agents)\n\n4) frontier labs also seem very interested in scientific applications of models like genomics; it's good PR for them and has very good TAM\n\n5) it is nearly impossible to compete with a frontier lab on something that they want to build because they have distribution, advance access to the model, fine tuning capabilities, data acquisition, etc...\n\n5) so the most successful companies built on top of LLMs will be vertical specific products like Harvey, etc... that tailor the LLM for a particular industry\n\n6) every industry has special GTM, sets of important integrations, data, UI/UX needs, etc... and the tail is long enough and the frontier competition sharp enough that the foundation labs won't go there anytime soon\n\n7) this creates a moat for the \"wrapper\" company\n\n8) there is probably also room for companies that offer a very novel UI/UX experience (websim / TL Draw / Kreia / FloraFauna)\n\n9) Companies like Perplexity and Cursor are in trouble since they will be unable to compete with foundation lab distribution / technology\n",
    "tweet_id": null,
    "note_id": "1894179866410717185",
    "tweet_url": null,
    "created_at": "2025-02-25T00:17:26.000Z",
    "length": 1420,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "google",
      "lab economics",
      "coding",
      "wrappers",
      "enterprise",
      "consumer",
      "bio"
    ],
    "title": "thoughts on GPT Wrappers",
    "snippet": "1) all AI products are wrappers 2) the real distinction is whether the frontier labs will build your product themselves 3) frontier labs will build products because they are very general applications of LLMs (chat interface, voice interface, search, video generation, robotics, etc...) or because they are needed for internal tools (general SWE-agents) 4) frontier labs also seem very interested in scientific applications of models like genomics; it's good PR for them and has very good TAM"
  },
  {
    "body": "thoughts on AI voice models\n\n1) both OpenAI advanced voice mode and Grok 3 advanced voice mode are extremely disappointing\n\n2) experientially, the models just don't feel as smart as the text models and they seem to be nerfed in addition\n\n3) my guess is that post training these models is very hard and that the post training done for text doesn't generalize well to audio\n\n4) moreover, they tend to sound like having a wikipedia article read to you, they don't capture the real back and forth flow of conversation\n\n5) maybe the reason for this is that the text mode asks the model to produce something much closer to its pretraining data\n\n6) we should have the ability to have true multimodal out - singing, an infinite range of voices, etc... but we don't\n\n7) Grok advanced voice mode is terrible, more evidence that post training for Grok 3 was rushed and intelligence is most of *but not all of* the user experience for foundation models\n\n8) I would be very curious to see what a Janus or a Main de Le Morte or a Mona could do with voice mode, given a completely unrestricted gpt-4o or Grok3 base model\n",
    "tweet_id": "1893845415176593787",
    "note_id": "1893845413821870080",
    "tweet_url": "https://x.com/fleetingbits/status/1893845415176593787",
    "created_at": "2025-02-24T02:08:27.000Z",
    "length": 1105,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "xai",
      "consumer",
      "post-training"
    ],
    "title": "thoughts on AI voice models",
    "snippet": "1) both OpenAI advanced voice mode and Grok 3 advanced voice mode are extremely disappointing 2) experientially, the models just don't feel as smart as the text models and they seem to be nerfed in addition 3) my guess is that post training these models is very hard and that the post training done for text doesn't generalize well to audio 4) moreover, they tend to sound like having a wikipedia article read to you, they don't capture the real back and forth flow of conversation"
  },
  {
    "body": "More thoughts on the AI market:\n\n1) Gavin's assertion that Google and xAI have better base models than OpenAI seems wrong to me; we have not yet seen GPT 4.5; which was probably the base model for o1 and o3.\n\n2) It is unclear to me that there is much value in the data moat for xAI or Google (sans YouTube). The most valuable data is task data.\n\n3) YouTube is different because it is probably useful for pretraining the video equivalent of GPT-3/4. That said, we don't know how much OpenAI has scraped YouTube.\n\n4) Task data is the kind of data you purchase from Mercor that shows people doing useful tasks or the result of useful tasks having been done. You want this for training economically valuable capabilities.\n\n5) It's not clear to me that Twitter data or Google search data really shows economically valuable tasks in a way that is important for producing high quality capabilities in models.\n\n6) Ditto for Facebook. In certain respects, their internal developer commits may be of more value than their data from customers - which was more valuable for ads.\n\n7) Reasoning models appear to be more intensive in terms of inference time compute. But, I don't know if this strongly this will hold up over time. It will always be more - but probably will become less extreme.\n\n8) It's very possible for OpenAI to run their inference time compute in clusters and really scale out the RL and essentially have the model memorize a lot of reasoning artifacts.\n\n9) Also, the more expensive inference time compute is - the more you want to overtrain the base model to make inference more efficient.\n\n10) I think people forget that there is a tradeoff between precompute and inference time compute. And, if inference time compute skyrockets in cost, you will want to do more precompute to lower the cost.\n\n11) No one, to my knowledge, trains Chinchilla optimal models for serving to the public anymore. You overtrain.\n\n12) GoogleDM also just has execution issues - it's the biggest risk on their business. Demis saying he wants a CERN for AI is a big downgrade on them - doesn't feel serious as a business. \n\n13) Gavin writes, \"The economic returns to superintelligence are definitionally unknowable. I hope they are high, but a 140 IQ model running on device with access to unique data about the world might be enough for most use cases.\"\n\n14) This misunderstands the nature of intelligence. Intelligence creates its own market and there are many important problems, we don't try to solve today because they are too hard.\n\n15) How much would we pay for cures to cancer, the slow degeneration of age, alzheimers, etc... if we could devote 20%-30% of GDP to solving them, it would be totally worth it.\n\n16) ASI makes these tractable problems and so creates its own demand. \n\n17) Back to the data - it looks like OpenAI and Anthropic and DeepSeek are able to do well without more special data. Internet scale pretraining is probably enough for language - setting aside task specific data. Add in Reddit and Wikipedia and you are mostly there.\n\n18) I expect to see more partnerships between the foundation labs and vertical companies. I expect this to be an increasing source of revenue for foundation labs.\n\n19) On the Microsoft capex side, Microsoft is a conservative company, and it doesn't have as much upside from AI because it doesn't have its own foundation model business.\n\n20) Also, Microsoft and OpenAI have had a rocky relationship for a while, I'm not sure that I would read into it too closely. This infrastructure spend might just go elsewhere. \n\n21) Is troubling on the stock side on the near term though - one lesson from studying AI stocks is that you can be right on a trend, but it is hard to pick companies that embody it - especially in a nascent technology.\n",
    "tweet_id": "1893394567904497881",
    "note_id": "1893394567640293378",
    "tweet_url": "https://x.com/fleetingbits/status/1893394567904497881",
    "created_at": "2025-02-22T20:16:57.000Z",
    "length": 3773,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/GavinSBaker/status/1893348988386189774"
    ],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "chinese labs",
      "xai",
      "meta",
      "lab economics",
      "compute",
      "enterprise",
      "post-training",
      "pretraining",
      "agi"
    ],
    "title": "More thoughts on the AI market:",
    "snippet": "1) Gavin's assertion that Google and xAI have better base models than OpenAI seems wrong to me; we have not yet seen GPT 4.5; which was probably the base model for o1 and o3. 2) It is unclear to me that there is much value in the data moat for xAI or Google (sans YouTube). The most valuable data is task data. 3) YouTube is different because it is probably useful for pretraining the video equivalent of GPT-3/4. That said, we don't know how much OpenAI has scraped YouTube. 4) Task data is the kind of data you purchase from Mercor that shows people doing useful tasks or the result of useful tasks having been done. You want this for training economically valuable capabilities."
  },
  {
    "body": "some extended thoughts on AI safety and benchmarks\n\n1) there is a lot of conversation around whether it makes sense to produce benchmarks\n\n2) some of this stems of discomfort with the rate of progress around AI safety\n\n3) there seems to be an expectation that benchmarks should have immediately have convinced politicians \n\n4) there is also a feeling that benchmarks help progress because labs need data and something to optimize against\n\n5) I think [3] misunderstands how politicians are going to be convinced of AI safety as a major issue and [4] perhaps misunderstands lab economics\n\n6) politicians won't be convinced by benchmarks but benchmarks are an important part of convincing them\n\n7) benchmarks convince experts, experts go in front of Congress, experts say X is a big deal, they cite the benchmarks\n\n8) regulation though will probably take a shock - government is slow then fast. So, when there are large layoffs or some event - that's when you will see regulation.\n\n9) government was slow then fast with the 2008 financial crisis, with Ukraine, with COVID.\n\n10) anyway, you need the benchmarks to convince the experts and then politicians get convinced by the experts - it's part vibes.\n\n11) benchmarks are also useful to guide research on safety topics - Dan Hendrycks is the most famous for this.\n\n12) when he wants the community to focus on something - like LLM value functions - he drops a benchmark. \n\n13) this accelerates research from academic researchers around it - because it gives them something to aggregate around.\n\n14) labs don't need the benchmarks as much though. They can buy data from Mercor or Scale. \n\n15) that bought data is going to be concentrated in areas where they want to improve the models - PhD math, programming, legal, management consulting.\n\n16) the benchmarks help communicate progress and attract attention - but they are not needed for labs, which can drop $50m on data.\n\n17) so, the acceleration created by benchmarks is probably pretty nominal, I don't think DeepSeek needed benchmarks to get the 800k math problems that they used in R1 training.\n\n18) it's not to say it's totally useless - but in the capabilities that labs want improvement - it probably doesn't move the needle.\n\n19) instead, the benchmarks let the community track progress, convince the experts, etc...\n\n20) and, this is important for organizing a response and focusing safety efforts in the most important places, and getting the community involved\n\n21) sidenote, something I learned looking at evaluations is labs need 2 things: (a) training data, (b) directional evaluations. they don't need absolute benchmarks, except for external communication.\n",
    "tweet_id": "1893052428528087259",
    "note_id": "1893052428251209730",
    "tweet_url": "https://x.com/fleetingbits/status/1893052428528087259",
    "created_at": "2025-02-21T21:37:24.000Z",
    "length": 2670,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "lab economics",
      "compute",
      "post-training",
      "evals",
      "safety",
      "legal",
      "math"
    ],
    "title": "some extended thoughts on AI safety and benchmarks",
    "snippet": "1) there is a lot of conversation around whether it makes sense to produce benchmarks 2) some of this stems of discomfort with the rate of progress around AI safety 3) there seems to be an expectation that benchmarks should have immediately have convinced politicians 4) there is also a feeling that benchmarks help progress because labs need data and something to optimize against"
  },
  {
    "body": "thoughts on Gemini\n\n1) Gemini feels like it has not been a success; it never had a breakout moment like ChatGPT or DeepSeek nor does it have a devoted following like Claude\n\n2) it's hard to explain exactly why - Gemini has been a decent model since at least Gemini 1.0, it's pretty much fully featured with image gen, etc...\n\n3) part of it is probably related to failed expectations - there was real excitement for the original Gemini but it wasn't better than ChatGPT, had the image gen issues\n\n4) on the model side, it feels like Google has really been pushing the flash models, that's basically where it feels like their focus is\n\n5) but it seems more like headlines are won by the frontier model, not the price performer for cost\n\n6) part of this has to do with the fact that human interaction really needs the best models, and that drives share of mind\n\n7) other mistakes: originally releasing voice mode only on android, not originally having a separate gemini app for iPhone (they do now), etc...\n\n8) NotebookLLM was cool, but the Gemini team felt delusional about it, I remember the product team - in person - referred to it as \"their ChatGPT moment\" (wtf?)\n\n9) Gemini will probably get incrementally merged into Google search to avoid leakage from search to ChatGPT\n\n10) Will be interesting to see how far they are willing to go in moving away from the search paradigm / appearance \n\n11) Also, always important to remember that they had language models before everyone - they squandered a big lead\n",
    "tweet_id": "1892708830657839161",
    "note_id": "1892708830515200000",
    "tweet_url": "https://x.com/fleetingbits/status/1892708830657839161",
    "created_at": "2025-02-20T22:52:04.000Z",
    "length": 1506,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "chinese labs",
      "lab economics",
      "consumer"
    ],
    "title": "thoughts on Gemini",
    "snippet": "1) Gemini feels like it has not been a success; it never had a breakout moment like ChatGPT or DeepSeek nor does it have a devoted following like Claude 2) it's hard to explain exactly why - Gemini has been a decent model since at least Gemini 1.0, it's pretty much fully featured with image gen, etc... 3) part of it is probably related to failed expectations - there was real excitement for the original Gemini but it wasn't better than ChatGPT, had the image gen issues 4) on the model side, it feels like Google has really been pushing the flash models, that's basically where it feels like their focus is"
  },
  {
    "body": "further thoughts on the future of lawyers (near term)\n\n1) law firms will not develop in-house technical expertise and will instead purchase AI solutions\n\n2) there is a historical parallel to quants entering the financial field but the incentives go the other way\n\n3) quants had to fight their way into financial firms and initially fought with traditional financiers over comp\n\n4) but financial firms could incorporate them because there were no barriers to giving them ownership / comp like financiers\n\n5) law firms legally cannot give non-lawyers equity and there is a deep cultural divide between lawyers and non-lawyers\n\n6) finance had 20 years to sort out the cultural divide, but AI will not give law firms the same 20 year luxury\n\n7) the places AI will begin to transform first are places where there are not meaningful commercial barriers / the tech is low hanging fruit\n\n8) this is contract negotiation off of playbooks, due diligence for M&A, patent drafting, regulatory compliance related to documents\n\n9) later will be M&A drafting, litigation drafting, more bespoke advice and strategy - these all require just a higher level of LLM intelligence\n\n10) legal datasets and benchmarks are mostly bad, and make it hard to really get good models for legal tasks, a lot of the data available is case law\n\n11) that said, a number of different frontier labs are trying to get legal data\n\n 12) in general things that target in-house teams should fall first, because of clear incentives to save labor, then things where firms often charge a fixed fee (like patent drafting)\n\n13) technology startups should get most of this revenue with some of it going to the big existing players like ThompsonReuters and LexisNexis, but only after they acquire the startups\n\n14) law firms have a lot of important data, so do in-house teams, but will be hard to access it, that will be part of the challenge\n\n15) very possible that to some degree these datasets will be bypassed by hiring lawyers to create data or just slowly ate into over time\n\n16) will be at least 4 markets: in-house, solo lawyer, small firm, big firm \n\n17) the number of associates at major firms will probably fall, especially if firms move away from the billable hour - not 100% sure\n\n18) if certain services become cheap enough, it will make sense to move away from billable hour, might end up being hybrid\n\n19) actual court litigation will be the last to see the entrance of AI\n\n20) I have a lot of thoughts on the future of automated adjudication, but that will be a separate post\n",
    "tweet_id": "1892647755237273981",
    "note_id": "1892647755077844994",
    "tweet_url": "https://x.com/fleetingbits/status/1892647755237273981",
    "created_at": "2025-02-20T18:49:23.000Z",
    "length": 2543,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "enterprise",
      "evals",
      "legal"
    ],
    "title": "further thoughts on the future of lawyers (near term)",
    "snippet": "1) law firms will not develop in-house technical expertise and will instead purchase AI solutions 2) there is a historical parallel to quants entering the financial field but the incentives go the other way 3) quants had to fight their way into financial firms and initially fought with traditional financiers over comp 4) but financial firms could incorporate them because there were no barriers to giving them ownership / comp like financiers"
  },
  {
    "body": "Further thoughts on Grok-3...\n\n1) when Grok-3 is a good, it feels like a frontier model, answers difficult questions in obscure programming languages, really impressive\n\n2) but has issues with basic things that major labs solved last year, formatting issues, trouble with multi-turn conversation, trouble understanding instructions\n\n3) it's enough to make the product experience feel subpar even though it's a great model\n\n4) the Grok team is small, last time I checked they had about 50 researchers, probably more now, but I suspect it's still under 100\n\n5) The time from Grok-3 pretraining being completed to release was something like 1 month, which really is not very much time\n\n6) my guess is that the pretraining was done without many human labels, and the focus was on single turn not multi-turn \n\n7) really just shows the amount of polish that went into DeepSeek and how much the DeepSeek team got right with R1\n\n8) Grok may show 2 things: (a) intelligence isn't everything and (b) this is one of the best base models, but with inadequate post-training\n",
    "tweet_id": "1892333775037808826",
    "note_id": "1892333774916243457",
    "tweet_url": "https://x.com/fleetingbits/status/1892333775037808826",
    "created_at": "2025-02-19T22:01:44.000Z",
    "length": 1060,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "xai",
      "coding",
      "consumer",
      "post-training",
      "pretraining",
      "evals"
    ],
    "title": "Further thoughts on Grok-3...",
    "snippet": "1) when Grok-3 is a good, it feels like a frontier model, answers difficult questions in obscure programming languages, really impressive 2) but has issues with basic things that major labs solved last year, formatting issues, trouble with multi-turn conversation, trouble understanding instructions 3) it's enough to make the product experience feel subpar even though it's a great model 4) the Grok team is small, last time I checked they had about 50 researchers, probably more now, but I suspect it's still under 100"
  },
  {
    "body": "Some thoughts on the Anthropic safety framework\n\n1) the safety frameworks are important because they are going to be the basis of legislation; if you think that AGI is 1 or 2 years away then regulation will come soon\n\n2) the Anthropic safety framework focuses on CBRN and Machine Learning R&D risk; there is a lesser emphasis on Cybersecurity\n\n3) the policy is based around the idea that greater controls need to applied once the above meet some thresholds\n\n4) the thresholds are hand wavy though - automating the work of a remote only anthropic research engineer (proxied by 2-8 SWE tasks) and significantly helping someone with basic STEM skills create a CBRN threat\n\n5) xAI actually did it right here - they have a list of benchmarks that they are going to use to assess the risks associated with their models - if only they told us what the levels would be...\n\n6) Anthropic divides its controls into deployment controls and security controls; basically letting people use the models vs keeping the weights from being stolen / escaping \n\n7) Anthropic also classifies models according to ASLs; basically AI security levels; the next main level is ASL-3\n\n8) Anthropic called their safety policy a responsible scaling policy, since it basically encoded the idea that model capabilities scale with compute\n\n9) they have this idea that you check model capabilities as a first pass using effective compute / time since last check given advances in pretraining\n\n10) effective compute is a bit complicated, but it sounds like its pretty much log-loss over certain documents\n\n11) there is a lot of stuff that falls on the responsible scaling officer; I've heard it's Jared Kaplan, who is also the Chief Scientist - don't know if it's true - but if so, it probably shouldn't be him\n\n12) Anthropic has a complicated governance structure but it actually makes it much more likely that the policy will be adhered to\n\n13) I would like to see labs publish more about how they intend to keep models safe / how they plan to manage decision making at an org level / etc...\n\n14) They have policies around testing the models for capabilities in a thorough way - making a case to the board that controls are not necessary, etc... good stuff\n\n15) Would like to see people really publish their security, evaluation, etc... stuff - especially for dangerous capabilities or at least share them between labs\n\n16) No serious mention of persuasion, loss of control, model autonomy, etc...\n\n17) it says it's designed to be portable, but the way it is designed, it can't be portable - since it is very anthropic specific\n",
    "tweet_id": "1892078705142542717",
    "note_id": "1892078704949592066",
    "tweet_url": "https://x.com/fleetingbits/status/1892078705142542717",
    "created_at": "2025-02-19T05:08:11.000Z",
    "length": 2593,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "anthropic",
      "xai",
      "compute",
      "pretraining",
      "evals",
      "safety",
      "legal",
      "bio",
      "agi"
    ],
    "title": "Some thoughts on the Anthropic safety framework",
    "snippet": "1) the safety frameworks are important because they are going to be the basis of legislation; if you think that AGI is 1 or 2 years away then regulation will come soon 2) the Anthropic safety framework focuses on CBRN and Machine Learning R&D risk; there is a lesser emphasis on Cybersecurity 3) the policy is based around the idea that greater controls need to applied once the above meet some thresholds 4) the thresholds are hand wavy though - automating the work of a remote only anthropic research engineer (proxied by 2-8 SWE tasks) and significantly helping someone with basic STEM skills create a CBRN threat"
  },
  {
    "body": "Some thoughts on core LLM product design:\n\n1) We haven't changed much from January 2023 where the main interaction was a chat interface with code blocks.\n\n2) The most important parts of the model as a product are intelligence, availability, speed, personality and then the various frontend UI additions\n\n3) Model personality is part of the UI; the UI does not end at the visual elements, this is because the model is not a traditional app\n\n4) UI is mainly a differentiator between relatively equal models and services; you need intelligence and availability before UI begins to matter\n\n5) DeepSeek had intelligence, availability and speed down at the start - and they had a UI innovation - RLHF'ed and viewable CoT\n\n6) Claude had artifacts - these are probably the biggest UI innovation in LLMs outside of tool use for image generation or a viewable CoT \n\n7) Claude lost on other differentiators though - first availability (not enough calls on the paid plan), and now intelligence (o1-pro etc) - artifacts couldn't save it.\n\n8) Dynamic interfaces are the future of Chat LLMs though so you should expect to see artifacts return and become available at other providers.\n\n9) Voice has its own UI/UX; including the quality of the voice, the personality of the voice, and the assumed answers. \n\n10) Oh - we should include formatting in UI for models. We have had some real changes there. From more of a paragraph response to a bullet point thing.\n\n11) Personality has also been very hard to nail down. We don't know what the future of it will be. We do seem to notice that people gravitate towards human personalties though.\n",
    "tweet_id": "1891908639205753187",
    "note_id": "1891908639046369280",
    "tweet_url": "https://x.com/fleetingbits/status/1891908639205753187",
    "created_at": "2025-02-18T17:52:24.000Z",
    "length": 1620,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "chinese labs",
      "consumer",
      "post-training"
    ],
    "title": "Some thoughts on core LLM product design:",
    "snippet": "1) We haven't changed much from January 2023 where the main interaction was a chat interface with code blocks. 2) The most important parts of the model as a product are intelligence, availability, speed, personality and then the various frontend UI additions 3) Model personality is part of the UI; the UI does not end at the visual elements, this is because the model is not a traditional app 4) UI is mainly a differentiator between relatively equal models and services; you need intelligence and availability before UI begins to matter"
  },
  {
    "body": "some quick thoughts on using Grok3\n\n1) My gut is that this is close to an o1+ level model. It's certainly above the level of DeepSeek R1 and pretty clearly above.\n\n2) It's not a tasteful LLM - the answers are really geared more for people with advanced maths backgrounds rather than ordinary users\n\n3) It's fast, like really fast\n\n4) The code is really good - even in somewhat obscure languages and when getting it to do obscure data structures and stuff\n\n5) However, on a question on leftist trees, it got the worked example wrong, then when I asked it to fix, it failed the formatting - haven't seen issues like this in 6-12 months\n\n6) Multi-turn it tries really hard to connect new messages with previous ones - I feel like this could almost be a game\n\n7) Some of the connecting is really strange - it seems to have a very hard bias to correct your replies back to something that it thinks is likely \n\n8) So much so it will sometimes just ignore your question and answer a completely unrelated question based on your previous conversation\n\n9) It's useful for writing advice. It's a strange style - but I might prefer it to DeepResearch for writing commentary - everything has been hit or miss here.\n",
    "tweet_id": "1891744968882127351",
    "note_id": "1891744968764686337",
    "tweet_url": "https://x.com/fleetingbits/status/1891744968882127351",
    "created_at": "2025-02-18T07:02:02.000Z",
    "length": 1201,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "chinese labs",
      "xai",
      "coding",
      "consumer",
      "post-training",
      "evals"
    ],
    "title": "some quick thoughts on using Grok3",
    "snippet": "1) My gut is that this is close to an o1+ level model. It's certainly above the level of DeepSeek R1 and pretty clearly above. 2) It's not a tasteful LLM - the answers are really geared more for people with advanced maths backgrounds rather than ordinary users 3) It's fast, like really fast 4) The code is really good - even in somewhat obscure languages and when getting it to do obscure data structures and stuff"
  },
  {
    "body": "further thoughts on AI safety...\n\n1) X ai released its Risk Management Framework\n\n2) risk management frameworks are about how to handle catastrophic risk; the policies normally address CBRN, cybersecurity, model autonomy, and machine learning development\n\n3) they will probably ultimately become the basis for law - my guess is that this will occur in 2026 / 2027 as model capabilities reach AGI level\n\n4) these policies normally have 4 parts: evaluations, model classification as to risk level, and controls for models of a specific risk level, also procedures for managing the policy\n\n5) X ai's risk management framework is basically based around Dan Hendrycks' research - this is both good and bad\n\n6) on the plus side - the policy is more explicit about which benchmarks will be used for evaluation - this has always been an achilles heel of the other policies\n\n7) on the less positive side - it pretty much only references things Dan has been involved in, has no teeth, will only apply to future models, and doesn't clearly explain its role in the organization\n\n8) also, for a policy based on benchmarks, it doesn't specify which levels on the benchmarks are sufficient to require additional controls\n\n9) also, the benchmarks mostly seem close to saturated? except for Cybench and BioLP Bench\n\n10) I would really like to see more from X ai with respect to what research $$$ they plan to contribute to frontier safety research \n\n11) You see the strangeness in the citations with the fact that it cites Dan's Circuit Breaker's paper as a possible control (he must be favorable to this); but not the Anthropic Classification paper (or any of the others)\n\n12) also strange that it says that X ai commits to only implementing security controls sufficient to stop a motivated non-nation state hacker if you really believe we are at AGI in 2 years\n\n13) Google's Frontier Safety Framework still has the best statement of information security controls and the best future roadmap (even if its hand wavy)\n\n14) also, most of the policies say who within the organization will be responsible for implementing the policy, reporting to the board of directors / management etc.... (none of this in the Xai policy)\n\n15) the policy does probably have the most developed ideas around loss of control risk of any major policy, value shift, etc... again Dan's research (the good and bad of the policy)\n",
    "tweet_id": "1891645587885523240",
    "note_id": "1891645587705167872",
    "tweet_url": "https://x.com/fleetingbits/status/1891645587885523240",
    "created_at": "2025-02-18T00:27:07.000Z",
    "length": 2385,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "anthropic",
      "google",
      "xai",
      "evals",
      "safety",
      "legal",
      "bio",
      "agi"
    ],
    "title": "further thoughts on AI safety...",
    "snippet": "1) X ai released its Risk Management Framework 2) risk management frameworks are about how to handle catastrophic risk; the policies normally address CBRN, cybersecurity, model autonomy, and machine learning development 3) they will probably ultimately become the basis for law - my guess is that this will occur in 2026 / 2027 as model capabilities reach AGI level 4) these policies normally have 4 parts: evaluations, model classification as to risk level, and controls for models of a specific risk level, also procedures for managing the policy"
  },
  {
    "body": "law school was good, it's very interesting if you are interested in law as an idea. it's less useful from a straight career perspective.\n\nthere are basically two tracks of interest in law school:\n\n1) private law: contract, tort, restitution, trust and property law. \n\n2) public law: constitutional, administrative, criminal\n\nthe former is more if you are interested in the non-political side of law, the latter is more if, for you, law is an extension of political science.\n\nI think having worked in law, law school is a bit spoiled for me - you end up knowing exactly what you need to get a grade in a course and it's hard to not do that.\n\nIt's better when you want to explore ideas, go down rabbit holes, etc....\n\nthe professional side is about to go down a drastic sea change - which makes it hard to justify the $90k/yr as without substantial risk.\n\nthere are beautiful diamonds in law for those that seek them though, reading the ideas of the great judges, etc...\n\nthe professional side is also extremely different from the education - the majority of lawyers are paid to just read contracts...\n\nhappy to answer any questions\n",
    "tweet_id": "1891594118381568099",
    "note_id": "1891594118255738880",
    "tweet_url": "https://x.com/fleetingbits/status/1891594118381568099",
    "created_at": "2025-02-17T21:02:36.000Z",
    "length": 1130,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "legal"
    ],
    "title": "law school was good, it's very interesting if you are interested in law as an idea. it's less useful from a straight career perspective.",
    "snippet": "there are basically two tracks of interest in law school: 1) private law: contract, tort, restitution, trust and property law. 2) public law: constitutional, administrative, criminal the former is more if you are interested in the non-political side of law, the latter is more if, for you, law is an extension of political science."
  },
  {
    "body": "extended thoughts on the legal tech market\n\n1) legal tech historically consisted of three sets of winners: legal research tools, e-Discovery tools, and contract management software\n\n2) on the firm side, you are worried about the billable hour, firms bill by the hour and so don't want automation of their legal hours\n\n3) neither legal research tools nor e-Discovery tools threatened the billable hour; and they increased capabilities, so you saw a lot of wins here\n\n4) that said, legal research tools were an oligopoly of old companies (because you needed so much historical data to be useful) - LexisNexis, Westlaw, VLex\n\n5) CaseText showed how much you had to struggle to break into this market; 10 years, only worth $200m before ChatGPT and $5m revenue\n\n6) things that increase capabilities and / or decrease paralegal hours are the main wins for law firms; law firms make too much $$$ on associates\n\n6) In-house counsel don't care about the billable hour, savings are savings to them, so their incentives are different\n\n7) In-house in mid-sized firms is mostly contract management, some compliance, HR stuff, intellectual property, most of the rest is outsourced to big firms\n\n8) In fact, the thing to know is that big firms are the majority of the spend of an in-house team, and are the majority of revenue for a big firm\n\n9) contract management tools were a big win for in-house counsel; the in-house software is pretty antiquated but new players were IronClad, etc...\n\n10) ok - but what do lawyers do? they read, they write and they advice - some negotiation. These things just couldn't be tackled before LLMs, now they can.\n\n11) So, with LLMs you got a ton of new legal tech companies: Harvey, RobinAI, Leia, ClearBrief, SpellBook, NormAI, etc... \n\n12) Harvey's best decision was getting an investment from OpenAI in November 2022 - this set up Harvey and CaseText to be the big winners from ChatGPT's virality\n\n13) Harvey got to something like $70m in revenue; most of which is concentrated on a small number of firms\n\n14) Other important concept - at least on seats for legal research products is that they cost a fortune, $70k/seat a year - which just makes it a good business\n\n15) Harvey is basically a high end workflow product for law firms. It has a cover price of like $120k per seat but actually charges like $5-6k per seat\n\n16) Eudia, which just launched is targeting the in-house market (raised $105m series A)\n\n17) LLMs are still not better than real good lawyers at legal analysis yet - so you have to do a lot of bespoke work to make them really good - has led a lot of lawyers to try them, be disappointed.\n\n18) law firms are also hard to get to adopt software, lawyers bill by the hour, are older, and don't have much time to try new products - it's a battle\n\n19) law firms also pay their ops people like nothing, this sucks and means that it's hard to get good people to do the adoption effort\n\n20) I don't see a world where it is easy to manage lawyers and programmers in the same company - lawyers see the world through a presentation lens, engineers do not\n\n21) lawyers expect their work to be perfect (or at least not admit that it is not) - engineers accept it, are more like doctors giving AAR in this respect\n\n22) salaries are also an issue; partners make $2m+ per year; a lot of programmers are more in the $180-$300 even after options; work hours also very different\n\n23) culturally, lawyers have a hard time talking to non-lawyers, if you work in the business, you see it a lot, you even just hear it in the way they position their conversations, talk about their professional duties, etc...\n\n24) I expect a lot of winners in this category, law will fundamentally change in terms of how it is done, we might even see the end of the billable hour\n",
    "tweet_id": "1891351069885964579",
    "note_id": "1891351069600718848",
    "tweet_url": "https://x.com/fleetingbits/status/1891351069885964579",
    "created_at": "2025-02-17T04:56:49.000Z",
    "length": 3780,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "lab economics",
      "wrappers",
      "enterprise",
      "legal"
    ],
    "title": "extended thoughts on the legal tech market",
    "snippet": "1) legal tech historically consisted of three sets of winners: legal research tools, e-Discovery tools, and contract management software 2) on the firm side, you are worried about the billable hour, firms bill by the hour and so don't want automation of their legal hours 3) neither legal research tools nor e-Discovery tools threatened the billable hour; and they increased capabilities, so you saw a lot of wins here 4) that said, legal research tools were an oligopoly of old companies (because you needed so much historical data to be useful) - LexisNexis, Westlaw, VLex"
  },
  {
    "body": "I think part of it was that it was too early. Here are my thoughts:\n\n1) It was too early, you couldn't meaningfully automate legal work back in 2019. The number of variations of contracts, etc... is larger than allowed for easy simplification.\n\n2) He picked the wrong customer base. Startups are just not a good target for legal services. They don't spend enough - firms take them as a kind of biz dev in the hope that they will grow.\n\n3) Also (see 1) a lot of startup stuff (formation, etc...) ends up being bespoke. Like early agreements with co-founders, early enterprise deals, etc...\n\n4) He didn't like or understand lawyers. I think this was a big problem - Kan just didn't have empathy for the real target market for legal services - other lawyers.\n\n5) Most legal spend - with large law firms - is actually sold to General Counsel in mid/large companies. So your main market for legal services, is other lawyers.\n",
    "tweet_id": "1891315081700594044",
    "note_id": "1891315081583222784",
    "tweet_url": "https://x.com/fleetingbits/status/1891315081700594044",
    "created_at": "2025-02-17T02:33:49.000Z",
    "length": 919,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "enterprise",
      "legal"
    ],
    "title": "I think part of it was that it was too early. Here are my thoughts:",
    "snippet": "1) It was too early, you couldn't meaningfully automate legal work back in 2019. The number of variations of contracts, etc... is larger than allowed for easy simplification. 2) He picked the wrong customer base. Startups are just not a good target for legal services. They don't spend enough - firms take them as a kind of biz dev in the hope that they will grow. 3) Also (see 1) a lot of startup stuff (formation, etc...) ends up being bespoke. Like early agreements with co-founders, early enterprise deals, etc... 4) He didn't like or understand lawyers. I think this was a big problem - Kan just didn't have empathy for the real target market for legal services - other lawyers."
  },
  {
    "body": "OpenAI is moving over to selling agents not models. Some thoughts.\n\n1) You will no longer be able to build your own system because OpenAI is already packaging for you\n\n2) You will be buying a level of intelligence - however this is quantified - rather than API calls against a particular model\n\n3) It will be interesting to see how pricing works for these agents - is it intelligence used or intelligence requested\n\n4) Orion was real - interesting - it sounds like it is the GPT-4o replacement - that it is the last non-reasoning model seems like a strange call out\n\n5) It always seemed likely that Orion was real because former OpenAI employees tended to assume that the Claude Opus-3.5 story was real\n\n6) Orion was probably the basis for the o-series models; when they said the \"o\" stood for openai, I bet it actually stood for \"Orion\"\n\n7) GPT-5 isn't a model anymore - it's just a collection of different models with a router and maybe even things like RAG - it's an intelligence level\n\n8) This will make safety related to malicious users easier because the boundaries of the system vis-a-vis the public are smaller  \n\n9) It will also probably make it harder for competitors to copy OpenAI's work, because you won't know which system the improvement came from (better reasoning model, better base model, better RAG, better tools, etc...)\n\n10) Not convinced the free / plus / pro products are sufficiently distinguished in this model. How much more intelligence do I get? How can I tell that I got more intelligence, etc...?\n\n11) Sama is just not a product person - I wish he didn't try to LARP as one. I get the CEO as public communicator but he just isn't good as a product CEO.\n\n12) I wonder how much of this is related to OpenAI's internal politics being very fragmented / chaotic - wouldn't they want an official blogpost etc... not good for CPO morale.\n",
    "tweet_id": "1889759187913367571",
    "note_id": "1889759187753902083",
    "tweet_url": "https://x.com/fleetingbits/status/1889759187913367571",
    "created_at": "2025-02-12T19:31:15.000Z",
    "length": 1860,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/sama/status/1889755723078443244"
    ],
    "tags": [
      "openai",
      "lab economics",
      "enterprise",
      "consumer",
      "safety",
      "agi"
    ],
    "title": "OpenAI is moving over to selling agents not models. Some thoughts.",
    "snippet": "1) You will no longer be able to build your own system because OpenAI is already packaging for you 2) You will be buying a level of intelligence - however this is quantified - rather than API calls against a particular model 3) It will be interesting to see how pricing works for these agents - is it intelligence used or intelligence requested 4) Orion was real - interesting - it sounds like it is the GPT-4o replacement - that it is the last non-reasoning model seems like a strange call out"
  },
  {
    "body": "Some really interesting things to work on:\n\n1) Model merging and how it affects model personality and model self-knowledge\n\n2) UI/UX for AI; using AI for dynamic visualizations; using AI for search \n\n3) Discussing artworks using AI in a way that makes it easier for a novice to understand the formal characteristics of an artwork\n\n4) Compiled philosophy - RLAIF to create models with different personalities embodying different moral philosophies (utilitarianism, virtue ethics, etc...)\n\n5) Legal models - models that adjudicate disputes especially between agents\n\n6) Ideas for when and how Legal models might fail and what to do when they do so\n\n7) LLMs for compilers - writing efficient machine code using LLMs / writing efficient byte code for interpreted languages\n\n8) Better ways to visualize code changes and to zoom in and out of them\n\n9) Generally suggesting more pleasant visual interfaces for all products - a great beautification should be possible\n\n10) Working on other kinds of data and offering it for sale / using it to improve the social good\n\n11) A plugin that can be used to add interface elements dynamically to a webpage on the client side. \n\n12) Information security with LLMs - all areas - but one interesting one is preventing phone spearphishing campaigns \n\n13) Dynamic interfaces that update with user intent and which do not require typing as the primary means of interaction\n\n14) Investigating the \"I\" in models - using mechanistic interpretability\n\n15) Forecasting adjudication - model that can take a Manifold Market or Polymarket market at close and decide who wins (should be easy to exceed human performance)\n",
    "tweet_id": "1889748427116077082",
    "note_id": "1889748426952511488",
    "tweet_url": "https://x.com/fleetingbits/status/1889748427116077082",
    "created_at": "2025-02-12T18:48:29.000Z",
    "length": 1640,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "coding",
      "consumer",
      "post-training",
      "interpretability",
      "evals",
      "safety",
      "legal"
    ],
    "title": "Some really interesting things to work on:",
    "snippet": "1) Model merging and how it affects model personality and model self-knowledge 2) UI/UX for AI; using AI for dynamic visualizations; using AI for search 3) Discussing artworks using AI in a way that makes it easier for a novice to understand the formal characteristics of an artwork 4) Compiled philosophy - RLAIF to create models with different personalities embodying different moral philosophies (utilitarianism, virtue ethics, etc...)"
  },
  {
    "body": "Further safety thoughts:\n\n1) There is an impatience among safety advocates for the world to wake up to the importance of AI safety\n\n2) This will not come though before there are practical real world effects larger than Nvidia stock going up\n\n3) When there are meaningful real world effects that are destabilizing - like mass unemployment or the development of advanced weapons - there will be rapid action\n\n4) This rapid action will take the form of regulation and will be inspired by existing regulation and existing AI safety work done at the big labs and nonprofits\n\n5) There is also likely to be something like UBI or a freeze to layoffs etc...\n\n6) A lot of safety studies that show that models are unsafe or have bad characteristics seem to be rushed / not carefully done\n\n7) I have read more than a couple of *prominent* safety articles where I am sure that the experiments would never convince a skeptic\n\n8) I think that this indicates that the safety crowd is accepting a lower standard of work because their peers assume its truth to begin with\n\n9) I think it is important to produce quality safety work so that it can be convincing to skeptics\n\n10) I think safety work will accelerate with model capabilities - since safety work is typically downstream of programming / etc...\n\n11) Safety work will eventually need a lot of funding for GPUs - we should be thinking about how to ensure that this is achieved\n",
    "tweet_id": "1889427894512001446",
    "note_id": "1889427894344278016",
    "tweet_url": "https://x.com/fleetingbits/status/1889427894512001446",
    "created_at": "2025-02-11T21:34:48.000Z",
    "length": 1416,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "compute",
      "evals",
      "safety",
      "legal",
      "agi"
    ],
    "title": "Further safety thoughts:",
    "snippet": "1) There is an impatience among safety advocates for the world to wake up to the importance of AI safety 2) This will not come though before there are practical real world effects larger than Nvidia stock going up 3) When there are meaningful real world effects that are destabilizing - like mass unemployment or the development of advanced weapons - there will be rapid action 4) This rapid action will take the form of regulation and will be inspired by existing regulation and existing AI safety work done at the big labs and nonprofits"
  },
  {
    "body": "Thoughts on reasoning models...\n\n1) I don't think reasoning models should be called \"reasoning\" models; instead, they should be called inference time scaling models.\n\n2) The previous generation of models did do reasoning in their CoT; the difference is that they didn't use more compute to generate better answers in the same way that inference time scaling models do.\n\n3) We should look to see the extent to which models dynamically vary the amount of compute they use to derive an answer. Harder questions should use more compute. If we don't see this, it says something.\n\n4) We can imagine a CoT that just provides a better answer by just annotating more and more related facts around an idea - and then using them to create an answer - this would be inference time scaling - but it's not different in a reasoning sense.\n\n5) There is this idea of creating small reasoning models and separate fact databases that are called using tools. There's probably something here - but it might be more of a model architecture thing.\n\n6) Is it different from having a model that looks more like an MoE but with dynamic layer selection using a router. Then the model can continue adding fact data into the current representation - or not.\n\n7) This feels like something that is more easily learned than having separate tool calls that retrieve facts from a web search or database of sources - maybe you have that too - but it seems longer latency and harder to apply a gradient to?\n\n8) You ultimately get the same thing \"a small reasoning model\" and \"a database of facts\" but they are in one learned package. And, maybe the \"database of facts\" is also a \"database of procedures\", etc...\n\n9) Also in-context learning is itself a learned behavior; and inference time scaling models could just be something that further learns to better use the in-context learning that it already has.\n",
    "tweet_id": "1887618202236260632",
    "note_id": "1887618202018193408",
    "tweet_url": "https://x.com/fleetingbits/status/1887618202236260632",
    "created_at": "2025-02-06T21:43:44.000Z",
    "length": 1871,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "compute",
      "post-training"
    ],
    "title": "Thoughts on reasoning models...",
    "snippet": "1) I don't think reasoning models should be called \"reasoning\" models; instead, they should be called inference time scaling models. 2) The previous generation of models did do reasoning in their CoT; the difference is that they didn't use more compute to generate better answers in the same way that inference time scaling models do. 3) We should look to see the extent to which models dynamically vary the amount of compute they use to derive an answer. Harder questions should use more compute. If we don't see this, it says something. 4) We can imagine a CoT that just provides a better answer by just annotating more and more related facts around an idea - and then using them to create an answer - this would be inference time scaling - but it's not different in a reasoning sense."
  },
  {
    "body": "More thoughts on frontier lab design language...\n\n1) OpenAI and Anthropic use similar branding. Muted colors that are meant to humanize frontier models. It is meant to feel empathic.\n\n2) OpenAI prefers more geometric icons with clean lines while Anthropic prefers icons that are more soft and organic. \n\n3) You see this difference even in the models; OpenAI's models are more rationalist in their responses while Anthropic's seek to feel more organic.\n\n4) DeepMind goes a different direction. The colors are more saturated than either OpenAI or Anthropic. You see it in their characteristic blue.\n\n5) It's meant to feel scientific. They emphasize mystery and power - but in the service of something the public would recognize as traditional scientific research.\n\n6) They don't feel world transformational - I don't think DeepMind views itself in as self consciously a world historical way as Anthropic or OpenAI.\n\n7) Gemini itself is Google branded. It has some light blues but it could basically be the Google homepage. DeepMind as a consequence has less public branding.\n\n8) Comparisons between Perplexity (not a frontier lab) and OpenAI / Anthropic are telling. Perplexity goes as deep into the saturation feeling as possible.\n\n9) The Perplexity branding is meant to evoke a feeling of awe and sublimity, even danger. But, they can do this because there is no risk in doing so.\n\n10) No one is going to confuse Perplexity with a company that actually produces something dangerous - so they have no need to humanize.\n\n11) Their market is also early adopters, people who are interested in tech and therefore want it to feel cool. \n\n12) OpenAI and Anthropic want to be able to sell to a broad public including non-tech-forward people. This requires humanizing the companies.\n\n13) OpenAI's branding is a very different from Sam Altman's tweet style. Anthropic has much more consistent branding between its employees' public personas and its design language.\n\n14) X just has a very Tesla design language - almost chiaroscuro - just black and white. Meant to feel powerful, cutting edge, advanced technological.\n\n15) Same with the strange Grok design stuck all over the Twitter interface. Meant to feel powerful, cutting edge, technological. Oh and drive the customer KPIs.\n",
    "tweet_id": "1887345935413617027",
    "note_id": "1887345935136792576",
    "tweet_url": "https://x.com/fleetingbits/status/1887345935413617027",
    "created_at": "2025-02-06T03:41:50.000Z",
    "length": 2269,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "xai",
      "consumer"
    ],
    "title": "More thoughts on frontier lab design language...",
    "snippet": "1) OpenAI and Anthropic use similar branding. Muted colors that are meant to humanize frontier models. It is meant to feel empathic. 2) OpenAI prefers more geometric icons with clean lines while Anthropic prefers icons that are more soft and organic. 3) You see this difference even in the models; OpenAI's models are more rationalist in their responses while Anthropic's seek to feel more organic. 4) DeepMind goes a different direction. The colors are more saturated than either OpenAI or Anthropic. You see it in their characteristic blue."
  },
  {
    "body": "I don't think it's clear at all that the frontier labs have been commoditized. Here are a bunch of thoughts around frontier lab commoditization.\n\n1) Open source is funded by Meta, Mistral and maybe the Chinese labs. Otherwise, it's currently impossible to train an open source frontier model due to the capital requirements.\n\n2) Meta will probably stop open sourcing models once the capital requirements become too high or they otherwise become dangerous.\n\n3) Meta open sources for a couple of reasons, but one of the main ones is commoditizing the competition. It has already successfully driven character ai to sale. \n\n4) Mistral may continue open sourcing models. But, will probably also stop once the models become dangerous. So far, it doesn't have the capital to really train frontier models.\n\n5) The Chinese labs are the ones that you have to hope will keep open sourcing. They probably open source to get status as national champions. They may also open source to support government policy.\n\n6) It's not clear how long that will last. DeepSeek has probably already achieved national champion status. ByteDance and Alibaba are just more established and need to secure national champion status less.\n\n7) It's unclear how much the Chinese government will want to keep these labs open sourcing if the models become high biorisk / high cybersecurity risk.\n\n8) Frontier labs make money in different ways and have different reasons to stay at the frontier. The core reason has to be though that we have no where near saturated the demand for intelligence - and more powerful models are just worth more.\n\n9) What would you pay for a cure to cancer? For a solution to the Riemann hypothesis? Etc...? It seems very unclear that this level of model will be commoditized for the reasons above.\n\n10) Even if they are commoditized by a trailing open source; it's unclear that they will not be able to recoup capital costs through their leading models before they are commoditized.\n\n11) They can do this either through using their model leads to build another business (e.g. ChatGPT) or through direct API sales.\n\n12) Also open source just seems very cyclical in terms of how close it seems to the frontier - in part driven by who is open sourcing. It looked an age behind before Llama 3, suddenly quite close, and then behind again with Sonnet 3.5, etc...\n\n13) Note, on the politics side, the United States may take actions to protect its frontier labs if their main threat comes from Chinese labs open sourcing.\n\n14) More models are going to be stuck behind agents and the labs are going to be less willing to provide their models to customers vs use them for special agents with their own agent frameworks.\n\n15) This makes it harder to copy models / ascertain their exact capabilities - and so harder for new companies to copy them for cheap and so commoditize them.\n\n16) We haven't seem the full range of model uses, will include robotics, video generation, etc... we are at an early stage in the range of possible use cases.\n\n17) Also, defense contracts, and government involvement is likely to come as the models get increasingly powerful - all of these may end up being state support for frontier labs.\n",
    "tweet_id": "1887312215843422496",
    "note_id": "1887312215583383552",
    "tweet_url": "https://x.com/fleetingbits/status/1887312215843422496",
    "created_at": "2025-02-06T01:27:51.000Z",
    "length": 3202,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/keithwynroe/status/1887291553837711727"
    ],
    "tags": [
      "openai",
      "anthropic",
      "chinese labs",
      "meta",
      "neolabs",
      "lab economics",
      "compute",
      "wrappers",
      "enterprise",
      "consumer",
      "safety",
      "legal",
      "bio",
      "agi"
    ],
    "title": "I don't think it's clear at all that the frontier labs have been commoditized. Here are a bunch of thoughts around frontier lab commoditization.",
    "snippet": "1) Open source is funded by Meta, Mistral and maybe the Chinese labs. Otherwise, it's currently impossible to train an open source frontier model due to the capital requirements. 2) Meta will probably stop open sourcing models once the capital requirements become too high or they otherwise become dangerous. 3) Meta open sources for a couple of reasons, but one of the main ones is commoditizing the competition. It has already successfully driven character ai to sale. 4) Mistral may continue open sourcing models. But, will probably also stop once the models become dangerous. So far, it doesn't have the capital to really train frontier models."
  },
  {
    "body": "DeepResearch writes in a slop style and seems to avoid meaningful synthesis. It's a big step toward something powerful, but it's not there yet.\n\n1) It uses florid language that often distracts from the main point of the text or otherwise obscures useful comparisons.\n\n2) The text is often organized in a way that suggests that it doesn't understand the core material. Instead, it just feels like it is smushing a bunch of facts in once place.\n\n3) For instance, in one report on AI stocks, it organizes the stocks with Nvidia and AMD as the two centerpieces, even though Broadcom has been more correlated with AI than AMD.\n\n4) Then, in the same report, it cannot maintain any real contextualization of the price fluctuations. It seems like, to the model, AMD increasing ~50% is the same as Broadcom increasing ~400%.\n\n5) It doesn't seem to have very good source analysis and in fact never seems to comment on the veracity of its sources or try to compare them for factual value - a lot of what researchers do is source criticism.\n\n6) The question asking at the part feels very pro forma. It seems no matter how detailed you are at the start, it has to ask followup questions, sometimes they were answered in the original prompt.\n",
    "tweet_id": "1887294345751240979",
    "note_id": "1887294345616973824",
    "tweet_url": "https://x.com/fleetingbits/status/1887294345751240979",
    "created_at": "2025-02-06T00:16:50.000Z",
    "length": 1227,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "consumer",
      "evals"
    ],
    "title": "DeepResearch writes in a slop style and seems to avoid meaningful synthesis. It's a big step toward something powerful, but it's not there yet.",
    "snippet": "1) It uses florid language that often distracts from the main point of the text or otherwise obscures useful comparisons. 2) The text is often organized in a way that suggests that it doesn't understand the core material. Instead, it just feels like it is smushing a bunch of facts in once place. 3) For instance, in one report on AI stocks, it organizes the stocks with Nvidia and AMD as the two centerpieces, even though Broadcom has been more correlated with AI than AMD. 4) Then, in the same report, it cannot maintain any real contextualization of the price fluctuations. It seems like, to the model, AMD increasing ~50% is the same as Broadcom increasing ~400%."
  },
  {
    "body": "Some AI safety thoughts:\n\n1) More models are going to sit behind agents; it makes business sense for the frontier labs and it makes their safety tasks easier because the attack surface is less.\n\n2) Companies that are under commercial pressure (like DeepMind and OpenAI) are going to have an uneasy relationship to safety, because they need to get to market quickly.\n\n3) We want to get a recognized audit system going as quickly as possible, which can later be taken over by the government.\n\n4) A good audit system would be overseen by the government for frontier labs / companies providing inference on frontier models.\n\n5) The audit system should be based on the existing Responsible Scaling Policies / Frontier Safety Frameworks. This should eventually be turned into a law.\n\n6) Meta will probably quit open sourcing models once the models because high risk across various domains. It looks like biosecurity is going to be the first domain to fall. Llama 4 or 5 may be the last Llama.\n\n7) Chinese labs, at this time, don't care that much about safety. This may change though. A lot of this happens to do with what the Chinese government thinks is conducive to its foreign policy objectives. \n\n8) Safety organizations are immature right now because no one knows how to fund them. There are only 4-5 major charities that fund them - some of them could do auditing for funding - but that does compromise them.\n\n9) AI is going to become a natsec priority - this makes safety harder not easier - pausing becomes harder - and the government normally exempts itself from its own rules (see Chernobyl). \n\n10) Some of these responsible scaling policies have become comical. Anthropic's most recent policy is one example. DeepMind's new policy is another. You can see the cultural shift happening in realtime at these labs.\n",
    "tweet_id": "1886966889093664983",
    "note_id": "1886966888921694208",
    "tweet_url": "https://x.com/fleetingbits/status/1886966889093664983",
    "created_at": "2025-02-05T02:35:39.000Z",
    "length": 1815,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "chinese labs",
      "meta",
      "lab economics",
      "safety",
      "legal",
      "bio",
      "agi"
    ],
    "title": "Some AI safety thoughts:",
    "snippet": "1) More models are going to sit behind agents; it makes business sense for the frontier labs and it makes their safety tasks easier because the attack surface is less. 2) Companies that are under commercial pressure (like DeepMind and OpenAI) are going to have an uneasy relationship to safety, because they need to get to market quickly. 3) We want to get a recognized audit system going as quickly as possible, which can later be taken over by the government. 4) A good audit system would be overseen by the government for frontier labs / companies providing inference on frontier models."
  },
  {
    "body": "thoughts on deep research:  \n\n1) openai is going to increasingly move functionality behind agents - the purpose of this is to offer a more completer service and make their product harder to copy  \n\n2) agents are better for safety / security from the perspective of OAI because the human can't see / interact with all the intermediate materials -  so you can have things that you don't want humans to see in the middle and only have to check the edges  \n\n3) agents are likely to eventually become part of a business plan where OpenAI charges per concurrent instance - I expect us to see a couple of agents over the next year, including a software agent  \n\n4) ChatGPT may eventually just become an agent hub where - on consumer subscriptions - openai just picks the agent to give you and abstracts everything else - like model choice - away from the user  \n\n5) a wave of startups are going to get killed / have their valuations damaged as openai releases agents that do thinks like web search better than existing companies in the space  \n\n6) a lot of people have built up custom tooling to do this over the last year or so - to support an immediately salable product - but OpenAI can just fine-tune / train models to do it at higher reliability  \n\n7) agents should have their own model card for safety purposes - any end to end agent service created by a foundation lab should have safety testing done\n",
    "tweet_id": "1886258314289545435",
    "note_id": "1886258314172071936",
    "tweet_url": "https://x.com/fleetingbits/status/1886258314289545435",
    "created_at": "2025-02-03T03:40:01.000Z",
    "length": 1400,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "lab economics",
      "coding",
      "wrappers",
      "enterprise",
      "consumer",
      "post-training",
      "safety"
    ],
    "title": "thoughts on deep research:",
    "snippet": "1) openai is going to increasingly move functionality behind agents - the purpose of this is to offer a more completer service and make their product harder to copy 2) agents are better for safety / security from the perspective of OAI because the human can't see / interact with all the intermediate materials -  so you can have things that you don't want humans to see in the middle and only have to check the edges 3) agents are likely to eventually become part of a business plan where OpenAI charges per concurrent instance - I expect us to see a couple of agents over the next year, including a software agent 4) ChatGPT may eventually just become an agent hub where - on consumer subscriptions - openai just picks the agent to give you and abstracts everything else - like model choice - away from the user"
  },
  {
    "body": "api market share thoughts\n\n1) the whole thing is hard to interpret because Opus 3 wasn’t released until March 2024 and Sonnet 3.5 wasn’t released until April 2024 - so there was no real OpenAI competitor until March 2024\n\n2) thus makes the whole graphic a bit sus; Claude 1.3 and 2 were really not competitive, especially Claude 2; and it seems hard to believe that Llama had so much real adoption, testing maybe, but adoption in products and services?\n\n3) taking it at face value Anthropic and Deepmind are the big winners; Anthropic charges more than OpenAI for Sonnet vs 4o, which suggests as share of revenue, Anthropic is close to equal; Deepmind probably less because it’s primary models are cheaper (Gemini flash ).\n\n4) Open AI makes more revenue from ChatGPT than from the API (probably in the neighborhood of $4bn vs $2bn - although this is part extrapolation / part public knowledge)\n",
    "tweet_id": "1885403154055057711",
    "note_id": "1885403153937686531",
    "tweet_url": "https://x.com/fleetingbits/status/1885403154055057711",
    "created_at": "2025-01-31T19:01:55.000Z",
    "length": 893,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "meta",
      "lab economics",
      "enterprise",
      "consumer"
    ],
    "title": "api market share thoughts",
    "snippet": "1) the whole thing is hard to interpret because Opus 3 wasn’t released until March 2024 and Sonnet 3.5 wasn’t released until April 2024 - so there was no real OpenAI competitor until March 2024 2) thus makes the whole graphic a bit sus; Claude 1.3 and 2 were really not competitive, especially Claude 2; and it seems hard to believe that Llama had so much real adoption, testing maybe, but adoption in products and services? 3) taking it at face value Anthropic and Deepmind are the big winners; Anthropic charges more than OpenAI for Sonnet vs 4o, which suggests as share of revenue, Anthropic is close to equal; Deepmind probably less because it’s primary models are cheaper (Gemini flash ). 4) Open AI makes more revenue from ChatGPT than from the API (probably in the neighborhood of $4bn vs $2bn - although this is part extrapolation / part public knowledge)"
  },
  {
    "body": "what Dario says mostly lines up to my linked commentary thread on DeepSeek  \n\n1) DeepSeek is probably about 8-9 months behind the Western labs; Claude Sonnet cost some tens of millions to train and was trained 8 months ago.  \n\n2) DeepSeek is on curve; we should expect models to get about 4x cheaper each year (Anthropic seems to sometimes say 30x, sometimes say 4x).   \n\n3) DeepSeek was primarily a product win in terms of exposing the chain of thought reasoning and this is something that users want.  \n\n4) Chips are still going to be in demand because RL methods are likely to exhibit scaling trends similar to pretraining.  \n\n5) Dario sees DeepSeek V3 as the important advance, not DeepSeek R1; the important thing to mine is the compute multipliers in their pretraining / inference pipeline.   \n\n6) This seems to confirm that the training of the base model is extremely important to the RL process. I suspect that a better base model yields a more sample efficient RL model.  \n\n7) If true, this gives Anthropic an advantage because they appear to have the best base model (at least available for sale - o1 may be on a different base model than 4o).   \n\n8) Dario thought the important ideas were the Multi-Head Latent Attention and the Mixture of Experts ideas. I think this lines up with my gut just listening to people - Multihead Latent Attention seemed new.  \n\n9) Dario has a lot of views on export controls - which means he thinks that the thing that we need to gate is Chinese access to chips. Sam + Dario seem to not believe that compute demands are going to decline.\n",
    "tweet_id": "1884697151336571284",
    "note_id": "1884697151139438594",
    "tweet_url": "https://x.com/fleetingbits/status/1884697151336571284",
    "created_at": "2025-01-29T20:16:31.000Z",
    "length": 1578,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [
      "https://x.com/fleetingbits/status/1884039124140916979"
    ],
    "tags": [
      "openai",
      "anthropic",
      "chinese labs",
      "lab economics",
      "compute",
      "post-training",
      "pretraining",
      "evals"
    ],
    "title": "what Dario says mostly lines up to my linked commentary thread on DeepSeek",
    "snippet": "1) DeepSeek is probably about 8-9 months behind the Western labs; Claude Sonnet cost some tens of millions to train and was trained 8 months ago. 2) DeepSeek is on curve; we should expect models to get about 4x cheaper each year (Anthropic seems to sometimes say 30x, sometimes say 4x). 3) DeepSeek was primarily a product win in terms of exposing the chain of thought reasoning and this is something that users want. 4) Chips are still going to be in demand because RL methods are likely to exhibit scaling trends similar to pretraining."
  },
  {
    "body": "I think there are some issues here:\n\n1) It's not clear to me that DeepSeek represents a meaningful unexpected decrease in the cost for models.\n\n2) Models get about 30x cheaper to train per year and that has been the norm for the past 6 years. DeepSeek is about 20x cheaper than Llama-3.\n\n3) OpenAI is aware of the compute economics around reasoning models and has been for at least 6 months but is still pushing for a $100bn cluster.\n\n4) Very little in the DeepSeek paper is new - which indicates that the economics are not OOM outside of OpenAI / Deepmind economics.\n\n5) One of the lessons seems to be that RL is only really effective with a very strong base model. It's not clear therefore that pretraining scaling is dead.\n\n6) 30x isn't hard to makeup in terms of inference. There are a ton of tasks where first you are probably moving up from a smaller model (e.g. like a Haiku use case) and will be using more tokens (e.g. for reasoning).\n\n7) I don't know where Nvidia falls in all of this other than that you are still going to want Nvidia for training - if you are Meta or X - because if you are going to spend $1bn+, you want it to work.\n\n8) Groq, Etched, etc... to my knowledge have made no dent in the inference market. But, they have been hyped a lot for a while.\n",
    "tweet_id": "1884503724288278832",
    "note_id": "1884503724137275395",
    "tweet_url": "https://x.com/fleetingbits/status/1884503724288278832",
    "created_at": "2025-01-29T07:27:54.000Z",
    "length": 1274,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "google",
      "chinese labs",
      "xai",
      "meta",
      "lab economics",
      "compute",
      "post-training",
      "pretraining"
    ],
    "title": "I think there are some issues here:",
    "snippet": "1) It's not clear to me that DeepSeek represents a meaningful unexpected decrease in the cost for models. 2) Models get about 30x cheaper to train per year and that has been the norm for the past 6 years. DeepSeek is about 20x cheaper than Llama-3. 3) OpenAI is aware of the compute economics around reasoning models and has been for at least 6 months but is still pushing for a $100bn cluster. 4) Very little in the DeepSeek paper is new - which indicates that the economics are not OOM outside of OpenAI / Deepmind economics."
  },
  {
    "body": "DeepSeek Thoughts\n\n1) It succeeded first as a product and only then as an economic force. People didn't look deeply at it until they discovered how fun the model was to use.\n\n2) Sama fucked up by not displaying the o1 chain of thought in the app. DeepSeek still would have been a splash but would not have gone viral because power users would have already seen model CoTs.\n\n3) DeepSeek is a product optimum in some respects. Internet + powerful model is basically enough for most uses. Canvas / Artifacts are nice but don't add enough. Still may become moat over time.\n\n4) DeepSeek is probably still 3-8 months behind OpenAI on actual models. OpenAI released o1 many months after it had been developed and just announced o3.\n\n5) We should basically expect that o3 >> o1 >= r1 in terms of performance. My limited experience is that r1 is slightly worse than o1 but more useful because I can see the CoT. \n\n6) It's unclear whether it is a Sputnik moment that will drive greater investment in GPUs and data centers or whether it is a sign open source will lower the profit in AI and prevent investment.\n\n7) Sama knows the economics around reasoning models - but still decided to try to raise $100bn for Stargate - this suggests he still sees benefit in scale.\n\n8) There are signs that reasoning models RL really scales with base model capabilities. So, RL can take a good base model and turn it into a powerhouse. But, what can it do with a great base model?\n\n9) Where can you throw money in AI? Basically, it's compute, humans or data. If we think we are in a slow takeoff, you need to pour money into these in the best allocation. DeepSeek says it's worth it to pour money on humans.\n\n10) OpenAI / Anthropic probably overloaded on researchers over performance folks. We should expect them to correct and begin hiring from HFT / shops dedicated to high performance. It's clear that there is juice to squeeze here.\n\n11) A lot of assumptions are flying around about OpenAI / Anthropic / DeepMind cost basis - we don't know. My guess is OpenAI isn't substantially worse off than DeepSeek. A lot of folks seem to assume worse because it's not public.\n\n12) Open source adherents just don't get the market - they assume that DeepSeek is doing open source because open source is good™. Probably reflects goal to be national champion.\n\n13) US-China commentary seems to basically be - whatever America does bad, China must do good. Zero study or introspection. Not seem basically any commentary worth reading except for a few.\n\n14) The whole thing about earlier RL attempts not having coherent / readable chains of thought indicates that the chain of thought becomes more not less readable as the base model gets better. \n\n15) A bunch of papers about LLM latent knowledge suggests that LLMs which RL to learn new concepts can probably explain them to us even if they end up with slightly shifted language in the CoT - maybe some objective that rewards ordinary English meanings in the output would be good.\n\n16) The innovation in the MoE seems to be that the MoE were more differentiable / more stable during training. Maybe this is what enabled them to really scale up the number of experts (Mistral had what 8? And had training stability problems).\n\n17) OpenAI probably thought they could move toward a world of direct agents / solutions. They forgot we are still in a world where the LLM needs to spark joy because it is an interactive product. And, it's not enterprise SaaS. \n\n18) Sama / Dario / Demis are not product people and this is an issue if we are not hurtling toward AGI / ASI. Because, if we don't reach AGI then the product still needs to be a joy to use - it is not abstracted. \n\n19) If compute is less valuable, maybe the money shifts to data. It did matter that the math data is so available and so nice and so easy to guess and check.\n\n20) There are rumors that OpenAI really likes its learned verifier. R1 used Guess and Check. This could be a really big cost difference in terms of operation / training. \n\n21) The people don't like RLHF / Safety. DeepSeek beat OpenAI in part due to a fast release cycle without safety etc... however, as long as the labs use their own frontier models internally, without safety controls, it's not clear there is much research cost to this. \n\n22) Chip export controls are likely to get tighter. Government expenditure is likely to go up not down - AI has natsec implications that the Internet did not - I don't think people have factored this in enough. \n\n23) Anthropic needs to catchup. DeepSeek beating Anthropic to public adoption is a really bad sign. We have to ask now: if Anthropic had released a quick reasoning model based on Haiku and made it free through the app, would they be surging ahead right now?\n\n24) US/EU corporations can't really switch to DeepSeek through the DeepSeek API (no data to China) and I doubt will really use Together etc... (it's just expensive - why not use Sonnet) - so Anth / OpenAI are still probably safe for now.\n\n25) Mech Interp / Sakana type experiments will probably get a huge boost from this.\n",
    "tweet_id": "1884039124140916979",
    "note_id": "1884039123675275265",
    "tweet_url": "https://x.com/fleetingbits/status/1884039124140916979",
    "created_at": "2025-01-28T00:41:45.000Z",
    "length": 5080,
    "matched_by": [
      "1)",
      "2)"
    ],
    "quoted_urls": [],
    "tags": [
      "openai",
      "anthropic",
      "google",
      "chinese labs",
      "lab economics",
      "compute",
      "consumer",
      "post-training",
      "pretraining",
      "interpretability",
      "evals",
      "safety",
      "math",
      "agi"
    ],
    "title": "DeepSeek Thoughts",
    "snippet": "1) It succeeded first as a product and only then as an economic force. People didn't look deeply at it until they discovered how fun the model was to use. 2) Sama fucked up by not displaying the o1 chain of thought in the app. DeepSeek still would have been a splash but would not have gone viral because power users would have already seen model CoTs. 3) DeepSeek is a product optimum in some respects. Internet + powerful model is basically enough for most uses. Canvas / Artifacts are nice but don't add enough. Still may become moat over time. 4) DeepSeek is probably still 3-8 months behind OpenAI on actual models. OpenAI released o1 many months after it had been developed and just announced o3."
  }
]