Market overviews
•
Jan 5, 2024
OS AI Ecosystem: Substantial growth in AI projects as well as contributors
Specifically for Gen AI, the term “open source” typically implies that the source code, any applicable weights and parameters (for training models) of these components are publicly accessible, usable, modifiable, and their distribution is permitted.
Adhering to this definition, the open source AI stack includes a comprehensive set of tools to build Gen AI applications - foundational models (such as Llama, Mistral), developer tools & frameworks (such as Langchain, Fixie), model training platforms (such as Weights & Biases, Anyscale), and monitoring tools (Datadog, Seldon).
Open source AI innovation is thriving with new projects and developers
Open source Gen AI projects are seeing significant and growing projects as well as contributors. Last year, Github witnessed 148% YOY growth in contributors and a 248% YOY growth in the total number of Gen AI projects. There are 60K Gen AI projects on Github and over 400K models on Huggingface as of 2023.
Contributor set is becoming increasingly Global, not restricted to the US and Europe
Beyond the US and Europe – where a majority of open source projects originate from – the highest number of individual contributors to open source Gen AI came from India and Japan in 2023. Developers from Hong Kong, UK, Brazil, Germany and Singapore are also making numerous contributions to open source Gen AI. By 2027, India is projected to overtake US as the largest developer community on Github.

Steady increase in serious contributors, while “tourist” interest has tempered since Q1 hype
Gen AI overall has experienced a shift from initial widespread hype (peaking in Q1) to more focused and value-driven engagement - the "trough of disillusionment" phase, where initial excitement gives way to sustained, serious development.
A similar trend can be seen in # of stars across Github repos - the growth has tempered since Q1. On the other hand, serious developers (# of contributors to these projects) have grown steadily - 148% cumulatively in 2023.

Python is the preferred language for open source AI
While JavaScript has been the top programming language on Github in 2023, Python is the top choice when it comes to AI repositories. Python’s preference for ML projects has carried over to Gen AI because of its comprehensive ML libraries like TensorFlow and PyTorch. Python's flexibility in data handling and its platform-independent nature make it highly adaptable for diverse AI projects.
Mojo, a variation of Python that combines the usability of Python with the performance of C++, is gaining traction as an AI-specific programming language. In Q4’23, Mojo saw a 73% MOM increase in Github stars, indicative of the repo’s popularity amongst developers.

AI repositories favouring more protective licensing
A disproportionate share of AI repos are using the Apache License, under which developers can claim patents on derivative projects. The Apache license is known to be extensive in legal terminology and therefore offer better patent protection than other licenses. Though the open MIT license is the most popular across Github; Gen AI developers are predictably keen on securing their work with more protective licensing.
Market Map: Multiple projects /startups emerging across the Gen AI tech stack

Foundational models and developer tools, the core stack of AI, are the focus areas for new startups
Over 60% of new companies in the open source AI space are focusing on foundational models and developer tools, the core elements of the AI stack. This is expected, given that these components are fundamental for building, deploying, and managing generative AI applications across various use cases. Innovation in other areas like model training, fine-tuning tools, monitoring tools, and cloud computing services primarily revolves around these core AI stack elements.
High-quality open source AI reducing reliance on proprietary big tech AI, but data is key
The volume and quality of open source AI are sufficiently robust, enabling developers and startups to effectively compete with proprietary solutions. OS model Mixtral 8x7B surprassed closed source GPT 3.5 in chatbot as well as holistic performance. Other OS models like Llama, Yi are not far behind.
However, a crucial advantage that big tech firms with closed systems hold is their access to extensive data resources. This is evident in the fact that some recent OS models, such as Llama-2 or Mistral 7B, do not open source their training data. Data is likely to be the key proprietary element in the space.
Funding Landscape: Robust funding in 2022-23; foundational models & training tools secure maximum dollars
Gen AI infrastructure, due to its heavy reliance on vast amounts of data, extensive research, and substantial compute power, requires significant capital investment, which has led to larger funding rounds compared to typical enterprise solutions.

Robust funding activity in 2022-23; foundational models and model training software secured maximum dollars
75% of open source AI startups secured funding in 2022-23. Foundational models and model training/fine-tuning software have attracted >70% of the investment dollars.
Nvidia, a leading graphics chip manufacturer for AI, has been a strategic investor in this space, with investments in top startups like Mistral AI and Adept AI.