Trends and perceptions of youth entrepreneurship in China: a mixed-text mining analysis
LDA model overview
LDA is a generative probabilistic model that assumes each document is a mixture of topics, and each topic is a distribution over words. The model works by iteratively assigning words to topics and topics to documents, ultimately revealing the underlying thematic structure of the corpus. Mathematically, the core idea of LDA can be simplified as a process of finding the probability distribution of topics in documents and words in topics, given a corpus of text. Consequently, a document can be seen as a blend of various topics. The core idea of LDA is mathematically represented as follows:
$$\left({\rm{\theta }}|{\rm{\alpha }}\right)={\rm{Dir}}\left({\rm{\theta }}|{\rm{\rho }}{\rm{\alpha }}\right)$$
LDA functions as a three-layer Bayesian probability model organized into document–topic–word hierarchies. As shown in Fig. 5, LDA selects topics probabilistically and then chooses words based on those topics, continuously repeating this process to generate all words in the document. This method allows for a clustering of terms to reveal latent semantic relationships across documents, while also reducing dimensionality and alleviating data sparsity. For this study, we applied LDA to Zhihu’s “College Student Entrepreneurship” dataset, preprocessing the questions and answers before training the model to categorize each response under relevant topics. This enabled a systematic analysis of recurring themes within youth entrepreneurship discussions.

LDA model workflow diagram.
Pre analyze data from a comprehensive perspective
Analyzing 7493 Zhihu topic entries, Fig. 6 displays the trend in “College Student Entrepreneurship” discussions from 2011 to 2024. Notably, in 2012, government efforts to promote innovation and entrepreneurship sparked an initial surge in topic growth. This trend continued, peaking in 2014 as entrepreneurship became a mainstream concern. Although topic frequency slightly declined during the COVID-19 pandemic in 2020, it rebounded in 2022, as challenging employment conditions intensified public interest in entrepreneurship.

Trend of topic frequency in college student entrepreneurship.
These trends underscore a rising public interest in youth entrepreneurship, driven by market constraints and heightened job competition. Additional analysis, as shown in Fig. 7, highlights word co-occurrences within the dataset. Larger nodes signify higher word frequency and greater connectivity with other terms, with 19,881 connections between 200 primary terms. The dense interconnectedness illustrates a wide-ranging and interrelated public interest in youth entrepreneurship.

Word co-occurrence network diagram.
Data preprocessing and model training
Before applying the LDA model, we performed several crucial preprocessing steps on the 7,493 Zhihu entries to prepare the text data for analysis. These steps were systematically applied to clean the data, reduce noise, and transform the text into a suitable format for the subsequent topic modeling process. First, text segmentation was performed, which is fundamental for analyzing Chinese text. To enhance the accuracy of segmentation, particularly for specialized terms or specific vocabulary, we loaded a custom dictionary (‘dict.txt’). Simultaneously, part-of-speech (POS) tagging was applied to the segmented words. Next, the segmented words were filtered based on several criteria to remove noise and focus on terms with significant semantic meaning. These criteria included:(1)Removing non-Chinese characters.(2)Excluding words present in a comprehensive stopword list and words with a length less than 2 characters. The stopword list was constructed by manually downloading the ‘Chinese.txt’ file from the ‘punkt’ package, reading the stopwords, and removing irrelevant characters such as newline characters.(3)Retaining only words with specified POS tags (nouns ‘n’, other proper nouns ‘nz’, verbal nouns ‘vn’). These three filtering criteria collectively ensured that the analysis focused on words carrying substantial semantic weight relevant to youth entrepreneurship discourse. Finally, text vectorization was performed using the CountVectorizer method. This step converted the preprocessed text data into a term-document matrix, which is the numerical format required for LDA modeling. After completing these preprocessing and vectorization steps, we trained the LDA model on the resulting numerical data.
We then trained the LDA model on the preprocessed data. A critical step in applying the LDA model is determining the optimal number of topics (K). Selecting an appropriate K is essential to ensure the generated topics are both statistically meaningful and theoretically interpretable, providing a balance between model fit and human understanding. To achieve this, we evaluated the model performance across a range of K values (from 2 to 8) using two widely accepted metrics for topic model evaluation: topic perplexity and topic coherence. Perplexity measures how well the trained model predicts a held-out set of documents; generally, a lower perplexity score indicates a better generalization capability of the model. Topic coherence measures the degree of semantic similarity between the high-scoring words within a topic; a higher coherence score suggests that the words forming a topic are more related and the topic is more interpretable.
Our primary criterion for selecting the optimal K was to find a point where perplexity was sufficiently low while coherence was maximized. In cases where perplexity and coherence did not indicate the exact same optimal K, we prioritized models with higher coherence, as coherent topics are more aligned with human interpretability and facilitate a more meaningful qualitative analysis of the themes. As shown in Fig. 8 (Topic Perplexity) and Fig. 9 (Topic Coherence Analysis), we observed that perplexity decreased significantly and plateaued around K = 7, while coherence reached its peak at K = 7 and began to decline thereafter. Based on the optimal balance clearly indicated by these quantitative metrics, and considering the potential for meaningful interpretation of the resulting topics, we selected K = 7 as the number of topics for our LDA analysis.


Topic coherence analysis.
Entrepreneurship theme analysis and identification of core influencing factors
-
(1)
LDA Topic Keyword Distribution and Visualization Analysis
Applying the LDA model with K = 7 yielded seven distinct topics within the “College Student Entrepreneurship” responses. Each topic’s top 15 keywords were extracted and classified, resulting in the themes shown in Table 3. To provide a more integrated and theoretically grounded understanding, Table 3 also maps these themes to relevant literature and theoretical bases. The analysis reveals that youth entrepreneurship interests revolve around team management, entrepreneurial preparation, online/offline integration, entrepreneurial pathways, business operations, market trends, and foundational experience.
To gain a deeper understanding of the discourse structure of youth entrepreneurship revealed by the LDA model, we mapped and dialogued the seven themes with existing theories and literature in the field of entrepreneurship research. This not only allows us to validate the effectiveness of our findings but also to uncover their theoretical implications within specific contexts.
Topic 1: foundational elements of entrepreneurship and risk management
This theme reflects the most core issues in entrepreneurship research: the process of new venture creation. Its keywords—“company” “product” “market” “team” and “resources”—form the basic framework of entrepreneurial activity, which highly aligns with the “opportunity, resources, team” three-element model emphasized by the classic Timmons Entrepreneurship Model (Timmons and Spinelli, 2008). Furthermore, terms like “investment” “funding” and “risk” highlight the critical role of resource acquisition and risk management in the entrepreneurial process, confirming the core tenet of the Resource-Based View (Barney, 1991), which states that the source of a firm’s competitive advantage lies in its unique ability to possess and integrate valuable, rare resources. Therefore, this theme can be regarded as a “textbook” discussion of universal principles of entrepreneurship within the discourse of youth entrepreneurship.
Topic 2: personal growth and career preparation
This theme focuses on the “pre-stage” of entrepreneurial behavior, namely, individual preparation and accumulation within the campus environment. Keywords such as “university” “school” “skills” and “specialization” point to Human Capital Theory, which posits that an individual’s knowledge, skills, and educational background are important determinants of their future productivity (Becker, 1964). Simultaneously, “friends” “peers” and “teachers” reflect the role of Social Capital, i.e., an individual’s ability to acquire information, support, and resources through social networks. These two theories collectively constitute important antecedent variables for the Theory of Entrepreneurial Intention, explaining why university students represent a significant group within youth entrepreneurship (Ajzen, 1991; Sid et al., 2025).
Topic 3: internet platforms and digital entrepreneurship
This theme distinctly reflects a new paradigm of entrepreneurship in the digital age. Keywords such as “platform” “e-commerce” “traffic” and “user” are core elements of Platform Economics theory, which involves creating value by connecting multi-sided groups (e.g., producers and consumers) through technological platforms (Parker et al., 2016). Specifically, terms like “video” “content” and “streaming” point to the rise of Digital Entrepreneurship and content creation models in recent years, characterized by asset-light operations, high iteration, and data-driven approaches. This finding indicates that, within the youth entrepreneurship discourse system, business model innovation based on internet platforms has become an extremely important and active topic.
Topics 4 and 7: entrepreneurship competitions and project practice and scientific research innovation and technology application
These two themes collectively reveal a typical path of “academic-style” entrepreneurship: the incubation process from idea to project. “Competition” “event” and “campus” embody an important practical form of Entrepreneurship Education—business plan competitions. Extensive research shows that such competitions can effectively enhance students’ entrepreneurial skills and self-efficacy. Meanwhile, “research” “scientific” and “analysis” in Theme 7 further narrow the focus to the early stages of Technology Commercialization, specifically how to transform scientific research findings into viable business ideas. This aligns with the “build-measure-learn” loop advocated by the Lean Startup methodology (Ries, 2011), emphasizing thorough analysis and validation before committing significant resources.
Topic 5: brick-and-mortar entrepreneurship and business management
Unlike high-tech or platform-based entrepreneurship, this theme depicts a more traditional and grounded entrepreneurial landscape. Words like “store” “cost” “price” “profit” and “location” are fundamental concepts found in Small Business Management textbooks. These types of entrepreneurial activities are often categorized as Necessity Entrepreneurship or Lifestyle Entrepreneurship, where the primary goal might be to secure employment or maintain a specific way of life, rather than pursuing rapid growth (Szivas, 2001). This reminds us that youth entrepreneurship is not solely about an elite facet; a significant amount of practice exists within the microeconomic activities of daily life.
Topic 6: macro environment and industry development
This theme examines youth entrepreneurship within a broader context. Grand terms such as “society” “economy” “national” and “industry” directly point to the Entrepreneurial Ecosystems theory (Isenberg, 2011; Stam, 2015). This theory emphasizes that the flourishing of entrepreneurial activity relies on the support of systemic environmental factors such as policy, finance, culture, and market. Our study found that in the discourse of Chinese youth entrepreneurship, particular attention is paid to national development (“national”), socio-economics (“society” “economy”), and technological trends (“technology”). This may reflect the strong influence of top-down policy guidance and national development strategies in shaping the wave of youth entrepreneurship, which is also consistent with the tenets of Institutional Theory (North, 1990).
Following the identification of the seven distinct topics from the Zhihu data using the LDA model (visualized in Fig. 10), it can be seen from the clustering classification diagram that most of the topics do not overlap with each other, indicating that the model works better.Topic Feature Word Distributions represent the top 30 feature words within a topic, with the light blue colour denoting the frequency of their occurrence and the dark red colour denoting their topic weights. Topic Feature Word Distributions Topic 2 and Topic 3 have some feature words that are related, and Topic 4 and Topic 7 are farther away from the JSDs of the other four topics, indicating that it is more different from the other topics.
-
(2)
Valuation of Influencing Factors

LDA thematic clustering map and topic feature word distributions.
After applying the LDA model with K = 7 and identifying the top keywords for each generated topic (as detailed in Table 3), we calculated each topic’s proportion in the corpus, which represents its relative influence in the Zhihu discussions.we sought to group these themes into broader categories (as detailed in Table 4) to provide a more integrated and theoretically grounded understanding of the influencing factors on youth entrepreneurship from the public’s perspective. Drawing upon the innovation ecosystem framework, which differentiates between internal factors related to the entrepreneur or venture and external environmental conditions, we classified these seven topics into two overarching categories: Subjective Entrepreneurial Capabilities and Objective Social Factors. Subjective Entrepreneurial Capabilities encompass themes primarily focused on the skills, knowledge, mindset, experiences, operational aspects, and processes directly related to the entrepreneur and their venture’s internal functioning. Objective Social Factors include themes related to the external environment, market dynamics, industry context, and available resources within the ecosystem.
This two-category classification is guided by the fundamental components of the innovation ecosystem framework and reflects a natural division between internally-driven entrepreneurial elements and external ecosystem influences as discussed in the Zhihu data. This framework helps to conceptualize the interplay of individual agency and environmental factors in entrepreneurship. We assigned each of the seven LDA topics to one of these two categories based on the dominant semantic meaning of its top keywords:
-
Subjective Entrepreneurial Capabilities: Topic 1 (Foundational Elements of Entrepreneurship and Risk Management), Topic 2 (Personal Growth and Career Preparation), Topic 4 (Entrepreneurship Competitions and Project Practice), Topic 5 (Brick-and-Mortar Entrepreneurship and Business Management), and Topic 7 (Scientific Research Innovation and Technology Application) were grouped here. This category aligns with theories emphasizing the individual’s role and internal attributes in entrepreneurial success, such as human capital theory, self-efficacy, and entrepreneurial intentions. For instance, Topic 4, ‘Entrepreneurial Pathways’ while influenced by external factors like market competition or internet platforms (reflected in keywords), predominantly features terms indicating the entrepreneur’s actions, planning, team involvement, and analytical efforts (project, business plan, competition, team, analysis, participation). These elements are strongly aligned with the subjective aspects of building and navigating a venture within the ecosystem. Similarly, the other topics in this category focus on internal skills, preparation, operations, and foundational experience.
-
Objective Social Factors: Topic 3 (Internet Platforms and Digital Entrepreneurship) and Topic 6 (Macro Environment and Industry Development) were grouped here, as their core keywords directly pertain to external market conditions, industry characteristics, and broader developmental trends within the entrepreneurial ecosystem. These factors are key components of the entrepreneurial ecosystem framework, highlighting the influence of external support structures, market dynamics, and institutional environment on entrepreneurial activities.
Table 4 below presents this classification of the seven topics into these two overarching categories, along with their respective proportions in the Zhihu data, highlighting the relative perceived importance of these factor types in public discourse from this specific dataset.
Topics related to entrepreneurial skills (such as Team Management and Business Operations) had the greatest influence, indicating the importance of capabilities within the entrepreneurial cycle. In contrast, external factors like social environment and industry development were more supplementary. Notably, there was limited discussion around macro policy topics, possibly due to the nature of Zhihu as a user-driven platform where policy discussions might be less frequent compared to practical concerns. This limited discourse on macro-policy topics could stem from two potential factors. One possibility is a gap in youth awareness or understanding of available entrepreneurial policies and their relevance. Alternatively, this phenomenon may reflect the nature of the Zhihu platform itself, which is primarily user-driven for practical problem-solving rather than for policy-focused discussions. Regardless of the primary cause, this finding highlights a potential disconnect that warrants attention from policymakers and educators regarding effective channels for communicating policy information and fostering policy engagement among young entrepreneurs.
link
