Before we begin, make sure you have installed all the additional required Python packages. (The instructions below use pip. You can use easy_install, too.) Also, consider using virtualenv for a cleaner installation experience instead of sudo.
In the following subsection we will utilize the MAG and AMiner datasets to better understand how various paper characteristics, such as title, abstract, keywords, length, and references, change over time. We will start with analyzing how the number of published papers has changed over time. In this notebook, we will then check how various properties of papers have changed over time. We will use Altair and Seaborn Python packages to visualize the trends of papers' properties. Using these visualizations, we will analyze various papers' properties to understand how these properties have changed during the last century. We will start with observing how the number of published papers has changed over time.
In the last 160 years there has been a massive surge in the number of publications. In this subsection, we will analyze the number of publications and the language of publications over time.
import pandas as pd
import numpy as np
import altair as alt
from configs import *
from visualization.visual_utils import *
import turicreate.aggregate as agg
from utils import detect_lang, filter_sframe_by_func
sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)["Paper ID", "Paper publish year", 'Ref Number']
sf = sf.rename({"Paper publish year": "Year"})
g = sf.groupby("Year", {"Number of papers": agg.COUNT()})
sf2 = sf[sf['Ref Number'] >= 5]
g2 = sf2.groupby("Year", {"Number of papers (Ref >= 5)": agg.COUNT()}) # Papers with less 5 references can be reports, news, and letters
g = g.join(g2, how="left")
draw_features_yearly_chart_multi_lines(g, "Publication Type", "Total Papers", 1800, 2015, "MAG Number of Papers")
As you can see from the chart above, the number of papers has increased exponentially over the years. The decline from 2015 forward is probably due to missing papers in our dataset for years after 2014. Therefore, for the rest of this notebook, we will use only papers that were published before 2015. Next, let's observe the dominant paper languages over the years by analyzing title languages for all papers with at least 5 references:
sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)["Paper ID", "Original paper title", "Paper publish year"]
sf = sf.rename({"Paper publish year": "Year", "Original paper title":"Title"})
sf["Title Language"] = sf["Title"].apply(lambda t: detect_lang(t))
g = sf.groupby("Title Language", {"Number of Papers": agg.COUNT()})
g = g.sort("Number of Papers", ascending=False)
top_langs = set(g['Title Language'][:10])
sf = filter_sframe_by_years(sf,1980,2014)
sf = sf[sf["Title Language"].apply(lambda t: t in top_langs)]
g = sf.groupby(["Title Language", "Year"], {"Number of Papers": agg.COUNT()})
g = g.sort(["Year", "Number of Papers"], ascending=False)
chart = alt.Chart(g.to_dataframe(), title="Papers Title Top-10 Languages over Time").mark_line().encode(
alt.X('Year:Q', axis=alt.Axis(format='d'), scale=alt.Scale(zero=False)),
alt.Y('Number of Papers:Q', scale=alt.Scale(zero=False)),
color="Title Language"
g = g[g["Title Language"] != "english"]
chart = alt.Chart(g.to_dataframe(), title="Papers with Top-9 Non-English Titles over Time").mark_line().encode(
alt.X('Year:Q', axis=alt.Axis(format='d'), scale=alt.Scale(zero=False)),
alt.Y('Number of Papers:Q', scale=alt.Scale(zero=False)),
color="Title Language"
It can be observed that since the 1980s there are more and more publications in different languages than English, such Japanese, Chinese, and Spanish.
A paper’s title is among the first things one looks for before deciding to read a paper. Moreover, using the right title can help other researchers find the paper using search engines. In this subsection, we will analyze how papers' title lengths and properties have changed over time.
sf = tc.load_sframe(AMINER_PAPERS_SFRAME)["year","title", "lang"]
sf = sf.rename({"year":"Year"})
sf = sf[sf["lang"] == "en"]
sf["title_lang"] = sf["title"].apply(lambda t: detect_lang(t))
sf = sf[sf["title_lang"] == "english"]
#Second, papers with a short title
sf["Title Length"] = sf["title"].apply(lambda t: len(t.split()))
sf = sf[sf["Title Length"] > 1]
sf = sf[sf["Title Length"] < 50] # remove long titles that probably are results of problem in title parsing
#Lastly, filter papers taht were publish before 1850 or after 2014 (or have invalid publication year)
sf = sf[sf["Year"] <= 2014]
sf = sf[sf["Year"] >= 1850]
sf = sf.sort("Year") # 102489813 titles
draw_feature_yearly_func_value(sf, "Title Length", "Average Number of Words", 1900, 2014, func_name="agg.AVG",
title="Average Title Length over Time")
Let's check additional trends in papers' titles, such as the average word length and the usage of punctuation marks. We will start with calculating the average title word length over time:
sf['Avg Word Number of Chars'] = sf["title"].apply(lambda t: np.average([len(w) for w in t.split()]))
draw_feature_yearly_func_value(sf, "Avg Word Number of Chars", "Average of Average Word Number of Chars", 1900, 2014, func_name="agg.AVG", title="Titles Average Word Length over Time")
sf['mark'] = sf["title"].apply(lambda t: 1 if ('!' in t or '?' in t) else 0 )
draw_feature_yearly_func_value(sf, "mark", "Percentage of Papers", 1900, 2014, func_name="agg.AVG", title="Percentage of Papers with '?' or '!' in Title")
We can see that over the last years the usage of '?' and '!' in titles has increased. Let's look at some concrete examples of using interrobangs (?! and !?) in papers' titles:
sf['bangs'] = sf["title"].apply(lambda t: 1 if ('!?' in t or '?!' in t) else 0 )
draw_feature_yearly_func_value(sf, "bangs", "Percentage of Papers", 1850, 2014, func_name="agg.AVG", title="Percentage of Papers with Interrobangs in Title")
sf[sf["bangs"] == 1]["title"][1000:1100]
It can be observed that over time papers' titles have become longer both by adding more words to the title and by using more characters in each word. Moreover, over time more and more titles use question and exclamation marks. We can assume that using these marks can make the titles more attractive, to convince other researchers to read the paper.
It is well-known that papers' average number of authors has increased sharply in recent years. Let’s use our datasets to validate this observation and explore how the importance of the order of authors has changed over time.
sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)[ "Paper publish year", 'Authors List Sorted', 'Authors Number', "Ref Number"]
sf = sf.rename({"Paper publish year": "Year"})
g = sf.groupby( "Year", {"Average Authors Number": agg.AVG('Authors Number')})
sf2 = sf[sf['Ref Number'] >= 5]
g2 = sf2.groupby("Year", {"Average Authors Number (Ref >=5)": agg.AVG('Authors Number')})
g = g.join(g2, how="left")
draw_features_yearly_chart_multi_lines(g, "Publication Type", "Average Authors Number", 1850, 2015, title="Average Authors Number over Time")
As we can perceive, the average number of authors per paper has increased sharply over recent years. Let's examine how the maximal number of authors per year has changed.
g2 = sf2.groupby("Year", {"Maximal Number of Authors (Ref >=5)": agg.MAX('Authors Number')})
draw_feature_yearly_func_value(sf2, "Authors Number", "Max Number of Authors", 1980, 2015, func_name="agg.MAX", title="Maximal Number of Authors over Time (Ref >=5)")
sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)[ "Paper publish year", 'Authors Number', "Ref Number", "Original paper title"]
sf2 = sf[sf['Ref Number'] >= 5]
sf2.sort('Authors Number', ascending=False)
We can notice that in recent years more and more papers were written by thousands of authors who worked together on extremely large projects. While it is clear that the average number of authors per paper has increasingly grown in recent years, there is an additional interesting aspect regarding the papers' author lists - the order of the names in the list. Let's view how the order of the authors list has changed over time and check if alphabetic order is the most common order.
def get_last_names_list(l):
last_names_list = []
for d in l:
n = d['name'].split()
if len(n) < 2:
return None
return last_names_list
sf = tc.load_sframe(AMINER_PAPERS_SFRAME)['year','authors', 'lang']
sf = sf[sf["lang"] == "en"]
sf['authors_number'] = sf['authors'].apply(lambda l: len(l))
sf = sf[sf['authors'] != None]
sf = sf[sf['authors_number'] > 1]
sf['author_last_names'] = sf['authors'].apply(lambda l: get_last_names_list(l) )
sf = sf[sf['author_last_names'] != None]
Let's check the percentage of papers in which the authors appear in alphabetical order:
def is_alphabetical_order(l):
for i in range(len(l) - 1):
if l[i] > l[i+1]:
return 0
return 1
sf['is_alphabetical_order'] = sf['author_last_names'].apply(lambda l: is_alphabetical_order(l))
sf = sf.rename({"year": "Year"})
sf = sf.rename({"authors_number": "Authors Number"})
draw_feature_yearly_func_value(sf, "is_alphabetical_order", "Percentage of Papers", 1850, 2014, func_name="agg.AVG", title="Percentage of Papers with Authors in Alphabetical Order")
g = sf.groupby(['Year', "Authors Number"], {"Percentage of Papers": agg.AVG("is_alphabetical_order")})
g = g.sort(["Authors Number", "Year"] )
g = g[g["Year"] <= 2014]
g = g[g["Year"] > 1950 ]
g = g[g["Authors Number"] <= 10]
df = g.to_dataframe()
df = df.fillna(0)
df = df.sort_values(by=['Year'])
chart = alt.Chart(df, title="Papers with Authors in Alphabetical Order by Authors Number").mark_line().encode(
alt.X('Year:Q', axis=alt.Axis(format='d'), scale=alt.Scale(zero=False)),
alt.Y("Percentage of Papers:Q", scale=alt.Scale(zero=False)),
color="Authors Number"
We can observe that over the years there has been a decline in the usage of alphabetical order. Additionally, we can observe that the usage of alphabetical order sharply decreases in cases where there are 3 or more authors who wrote the paper.
In many cases, the abstract section is one of the first sections a researcher looks at before deciding to read the paper. Therefore, a better abstract can improve the number of reads a paper gets. Let's check how the length of papers' abstract sections has changed over time.
sf = tc.load_sframe(AMINER_PAPERS_SFRAME)["year","abstract", "lang"]
sf = sf[sf["lang"] == "en"] # only abstract of English papers
sf = sf[sf["lang"] != None]
sf["Number of Words in Abstracts"] = sf['abstract'].apply(lambda a: len(a.split()))
sf = sf[sf["Number of Words in Abstracts"] > 10] # remove short abstaract
sf = sf[sf["Number of Words in Abstracts"] <= 2000] # remove long abstaract
sf["abstract_lang"] = sf["abstract"].apply(lambda t: detect_lang(t))
sf = sf[sf["abstract_lang"] == "english"]
sf = sf.rename({"year": "Year"})
draw_feature_yearly_func_value(sf["Year", "Number of Words in Abstracts"] , "Number of Words in Abstracts", "Average Number of Words", start_year=1900, end_year=2013, func_name="agg.AVG",title="Abstracts' Average Number of Words over Time")
sf = sf[sf["Number of Words in Abstracts"] <= 500]
draw_features_decade_dist(sf, "Number of Words in Abstracts", 1950, 2013, col_warp=2, sharex=True, sharey=True)
decades_list = [1950, 1980, 1990, 2000, 2010]
draw_layered_hist(sf, "Number of Words in Abstracts", decades_list, 1920, 2014)
From the above figures, it can be observed that over time abstracts have become longer. In the 1950s, most abstracts were less than 100 words. In recent decades, most abstracts have over 100 words, and more and more abstracts contain between 200 and 300 words.
Papers' keywords are very helpful for understanding the general topic of a paper, and they make it easier to search for a paper on a specific topic. Let's check how the usage of keywords has changed over the years.
sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)["Paper publish year", "Keywords List", 'Ref Number', "Journal ID mapped to venue name"]
sf = sf.rename({"Paper publish year": "Year"})
sf = sf.fillna("Keywords List", [])
sf["Keywords Number"] = sf["Keywords List"].apply(lambda l: len(l))
k_sf = sf[sf["Keywords Number"] <= 20] # remove papers with too many keywords
k_sf['has_keywords'] = k_sf["Keywords List"].apply(lambda l: 1 if len(l) > 0 else 0)
has_k_sf = k_sf[k_sf['has_keywords']]
g = has_k_sf.groupby("Year", {"Papers with Keywords ": agg.COUNT()})
sf2 = has_k_sf[has_k_sf['Ref Number'] >= 5]
g2 = sf2.groupby("Year", {"Papers with Keywords (Ref >= 5)": agg.COUNT()})
g = g.join(g2, how="left")
draw_features_yearly_chart_multi_lines(g, "Publication Type", "Number of Papers over Time", 1800, 2014,title="Number of Papers with Keywords (MAG)")
j_sf = k_sf[k_sf["Journal ID mapped to venue name"] != '']
draw_feature_yearly_func_value(j_sf, "has_keywords", "Percentage of Journal Papers", 1900, 2014, func_name="agg.AVG", title="Percentage of Journal Papers with Keywords")
In the last century the percentage of papers' containing keywords has skyrocketed. However, since 2010, we can see a decline in the percentage of journal papers containing keywords. Similar observations also can be found in the AMiner dataset:
a_sf = tc.load_sframe(AMINER_PAPERS_SFRAME)["year", "keywords", "lang", "venue"]
a_sf = a_sf[a_sf["keywords"] != None]
a_sf = a_sf[a_sf["lang"] == "en"]
a_sf = a_sf.rename({"year": "Year", "keywords": "Keywords"})
g = a_sf.groupby("Year", {"Number of Papers": agg.COUNT()})
draw_features_yearly_chart(g, "Number of Papers", 1900, 2014, title="Number of Papers with Keywords over Time (AMiner)")
sf = a_sf[a_sf["Year"] == 2013]
sf = sf[sf["venue"] != None]
g = sf.groupby("venue", {"Number of Papers without Keywords (2013)": agg.COUNT()})
g.sort("Number of Papers without Keywords (2013)", ascending=False)
According to the above table, indeed many of the papers published in top journals in 2013 didn't contain keywords. This could be a result of format limitations or missing data in the dataset. For example, we can observe that many of the PLOS ONE papers don't have any keywords. Yet PLOS ONE papers and their matching keywords (defined as subject areas) are presented on the PLOS ONE website. However, from our observation these subject areas don't appear in the papers' PDF versions. Let's check how the average number of keywords per paper has changed over time.
#Only MAG Journal Papers
j_sf = k_sf[k_sf["Keywords Number"] != 0]
draw_feature_yearly_func_value(j_sf, "Keywords Number", "Average Keywords Number", 1900, 2014, func_name="agg.AVG",title="Average Journal Papers Number of Keywords over Time")
#AMiner papers
a_sf["Keywords Number"] = a_sf["Keywords"].apply(lambda l: len(l))
draw_feature_yearly_func_value(a_sf, "Keywords Number", "Average Keywords Number", 1900, 2014, func_name="agg.AVG", title="Papers Average Keywords Number (AMiner)")
years = set([1950,1970,1990, 2010])
sf = k_sf[k_sf["Year"].apply(lambda y: y in years)]["Year","Keywords Number"]
sf = sf[sf["Keywords Number"] <= 20]
sf = sf[sf["Keywords Number"] > 0]
sns.boxplot(x="Year", y="Keywords Number", data=sf.to_dataframe())
The MAG dataset maps each keyword with its matching field of study. Moreover, the dataset also contains the hierarchical order among the various fields of study. In our study, we used this hierarchical order to observe and measure the number of multidisciplinary papers that contain keywords from two or more different research fields.
f_sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)['Paper publish year',"Keywords List", "Fields of study parent list names (L1)", "Fields of study parent list (L1)",
"Fields of study parent list names (L0)", "Fields of study parent list (L0)"]
f_sf = f_sf.rename({"Paper publish year": "Year"})
f_sf = f_sf[f_sf["Fields of study parent list (L1)"] != None]
f_sf = f_sf.fillna("Fields of study parent list (L1)", [])
f_sf["Number of L1 Fields"] = f_sf["Fields of study parent list (L1)"].apply(lambda l: len(l))
f_sf["Number of L0 Fields"] = f_sf["Fields of study parent list (L0)"].apply(lambda l: len(l))
f_sf = f_sf[f_sf["Fields of study parent list (L0)"] != None]
f_sf = f_sf.fillna("Fields of study parent list (L0)", [])
f_sf = f_sf[f_sf["Number of L1 Fields"] <= 10 ] # limit the number of paperes with too many research fields
f_sf["Multidisciplinary Research"] = f_sf["Fields of study parent list (L0)"].apply(lambda l: 1 if len(l) >=2 else 0)
f_sf["Keywords List", "Fields of study parent list names (L0)", "Fields of study parent list names (L1)"]
g = f_sf[f_sf["Multidisciplinary Research"]]["Year", "Multidisciplinary Research"]
g = g.groupby("Year", {"Number of Papers": agg.COUNT()})
draw_features_yearly_chart(g, "Number of Papers", 1800, 2014, title="Number of Multidisciplinary Papers over Time")
draw_feature_yearly_func_value(f_sf, "Multidisciplinary Research", "Percentage of Multidisciplinary Papers", 1900, 2014, func_name="agg.AVG",title="Percentage of Multidisciplinary Papers over Time")
draw_feature_yearly_func_value(f_sf, "Number of L0 Fields", "Average Number of L0 Fields", 1900, 2014, func_name="agg.AVG", title="Average Number of L0 Fields over Time")
draw_feature_yearly_func_value(f_sf, "Number of L1 Fields", "Average Number of L1 Fields", 1900, 2014, func_name="agg.AVG",title="Average Number of L1 Fields over Time")
As it can be observed from the above charts, over the last century the overall number of multidisciplinary papers and the average number of research fields per publication considerably increased until 2010. Then, in 2010 there is a sharp decrease probably due to the decrease in papers with keywords, or due to other factors. Nevertheless, it clear that during the last century considerably more multidisciplinary publications have been published.
Let's check how the characteristics of papers have changed over time. We start with examining how the number of references have changed over the years.
sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)["Paper publish year", "Ref Number"]
sf = sf.rename({"Paper publish year": "Year"})
g = sf.groupby("Year", {"Average References Number": agg.AVG("Ref Number")})
sf2 = sf[sf['Ref Number'] >= 5]
g2 = sf2.groupby("Year", {"Average References Number (>= 5)": agg.AVG("Ref Number")})
g = g.join(g2, how="left")
draw_features_yearly_chart_multi_lines(g, "Publication Type", "Average References Number", 1800, 2015, title="Average References Number over Time")
Let's use the Seaborn package to visualize how the distributions of citations have changed over the past decades. Looking at the data, we can observe there are many papers with hundreds and even thousands of references. However, most papers have far fewer citations. We will zoom in on papers that have between 5 and 100 references and observe how the number of citations has changed over time.
sf = sf[sf["Ref Number"] <= 100]
sf = sf[sf["Ref Number"] >= 5]
sf = sf.rename({"Ref Number": "References Number"})
draw_features_decade_dist(sf, "References Number", 1960, 2013, col_warp=2, sharex=True, sharey=True)
decades_list = [1950, 1980, 1990, 2000, 2010]
draw_layered_hist(sf, "References Number", decades_list, 1950, 2014)
It can be observed that over time the percentage of papers with higher numbers of references has increased, especially in the last two decades. For example, while in 1960 only a few papers had over 20 references, since 2010 it is quite common to find papers with over 20 references. Moreover, while in the 1990s there were only a few papers with over 40 references, in recent years not only have a high number of papers been published each year, but also a higher percentage of them contains over 50 references. In the next subsection, we will be focusing on how the number of self-citations has changed over time.
Various papers claim that in recent years there has been a surge in the usage of self-citations. Let's widely check this claim using the MAG dataset and analyzing the self-citation number over time for papers with at least 5 references.
sf["Is Self Citation"] = sf["self citation"].apply(lambda i: 1 if i > 0 else 0)
sf2 = tc.load_sframe(EXTENDED_PAPERS_SFRAME)["Paper ID", "Paper publish year", "Ref Number"]
sf2 = sf2[sf2["Ref Number"] >= 5]
sf2 = sf2.rename({"Paper publish year":"Year"})
s_sf = sf.join(sf2)
g = s_sf.groupby("Year", {"Total Self Citations": agg.SUM("Is Self Citation")})
draw_features_yearly_chart(g, "Total Self Citations", 1900, 2014, title="Total Self-Citations over Time")
As can be seen in the above chart, the overall number of self-citations has exponentially grown over the last century. Let's observe how the average number of self-citations per paper has changed over the years.
g = s_sf.groupby(["Year", "Paper ID", "Ref Number"], {"Total Self Citations": agg.SUM("Is Self Citation")})
draw_feature_yearly_func_value(g, "Total Self Citations", "Average Number of Self-Citations per Paper", 1950, 2014, func_name="agg.AVG", title="Average Number of Self-Citations per Paper over Time")
g["Self Citation Ratio"] = g.apply(lambda r: r["Total Self Citations"]/float(r["Ref Number"]))
draw_feature_yearly_func_value(g, "Self Citation Ratio", "Average Self-Citation Percentage", 1950, 2014, func_name="agg.AVG", title="Average Self-Citation Percentage per Paper over Time")
draw_feature_yearly_func_value(g, "Total Self Citations", "Maximal Self-Citations Number", 1950, 2014, func_name="agg.MAX", title="Maximal Self-Citations Number over Time")
As can be observed from the above charts, the average number of self-citations has increased over the last century from about a single self-citation per paper in the 1980s to over 2.2 self-citations per paper in 2014. Moreover, in recent years there are papers with over 200 self-citations. However, due to the increase in the total number of references per paper, the overall percentage of self-citation references has decreased in recent years.
We can use the papers that appear in both the MAG and AMiner datasets to calculate how papers’ lengths have changed over time. We first calculate the average paper lengths over time. Then, we calculate the average and median lengths of only journal papers over time.
#first we filter papers with no page start or page end values
def convert_to_int(i):
return int(i)
return None
am_sf = tc.load_sframe(AMINER_PAPERS_SFRAME)["year", "page_start", "page_end","references"]
am_sf["Ref Number"] = am_sf["references"].apply(lambda l: len(l))
am_sf = am_sf[am_sf["Ref Number"] >= 5]
# Calculating papers length
am_sf['page_start'] = am_sf['page_start'].apply(lambda p: convert_to_int(p))
am_sf['page_end'] = am_sf['page_end'].apply(lambda p: convert_to_int(p))
am_sf = am_sf[am_sf['page_start'] != None]
am_sf = am_sf[am_sf['page_end'] != None]
am_sf['Paper Length'] = am_sf.apply(lambda p: p['page_end'] - p['page_start'] + 1)
am_sf = am_sf.rename({"year": "Year"})
am_sf = am_sf[am_sf["Paper Length"] > 0]
am_sf = am_sf[am_sf["Paper Length"] <= 2000] # removing papers with error parsed page numbers
draw_feature_yearly_func_value(am_sf, "Paper Length", "Average Paper Length", 1950, 2014, func_name="agg.AVG", title="Average Paper Length over Time")
years = set([1950,1960, 1970,1980, 1990, 2010, 2013])
sf = am_sf[am_sf["Year"].apply(lambda y: y in years)]["Year","Paper Length"]
sf = sf[sf["Paper Length"] <= 50]
sns.boxplot(x="Year", y="Paper Length", data=sf.to_dataframe())
According to the above charts, it seems that with time the average and median lengths of papers has decreased.
def no_citations_after_years(citations_dict, year,after_years=5):
if citations_dict is None:
return 1
l = [v for k,v in citations_dict.iteritems() if (int(k) <= (year + after_years))]
if len(l) > 0:
return 0
return 1
m_sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)["Paper publish year" , "Ref Number", "Total Citations by Year", "Total Citations by Year without Self Citations"]
m_sf = m_sf.rename({"Paper publish year":"Year"})
m_sf["No Citations After 5 Years"] = m_sf.apply(lambda r: no_citations_after_years(r["Total Citations by Year"], r["Year"]))
draw_feature_yearly_func_value(m_sf, "No Citations After 5 Years", "Percentage of Papers", 1900, 2009, func_name="agg.AVG", title="Papers with No Citations After 5 Years")
m_sf = m_sf[m_sf["Ref Number"] >= 5]
draw_feature_yearly_func_value(m_sf, "No Citations After 5 Years", "Percentage of Papers", 1900, 2009, func_name="agg.AVG", title="Papers with No Citations after 5 Years (Ref >= 5)")
no_sf = m_sf[m_sf["No Citations After 5 Years"] == 1]
g = no_sf.groupby("Year", {"Number of papers": agg.COUNT()})
sf2 = no_sf[no_sf['Ref Number'] >= 5]
g2 = sf2.groupby("Year", {"Number of papers (Ref >= 5)": agg.COUNT()}) # Papers with less 5 references can be reports, news, and letters
g = g.join(g2, how="left")
draw_features_yearly_chart_multi_lines(g, "Publication Type", "Total Papers", 1800, 2014, "Number of Papers with No-Citations aftetr 5 Years")
In both charts, there is a steady decline in the number of papers with no citations after 5 years. However, even with this decline, over 70% of all published papers don't receive any citations, and over 20% of all papers with at least 5 references don't receive any citations. Let's repeat this calculation by not considering self-citations.
m_sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)["Paper publish year" , "Ref Number", "Total Citations by Year", "Total Citations by Year without Self Citations"]
m_sf = m_sf.rename({"Paper publish year":"Year"})
m_sf["No Citations After 5 Years"] = m_sf.apply(lambda r: no_citations_after_years(r["Total Citations by Year without Self Citations"], r["Year"]))
draw_feature_yearly_func_value(m_sf, "No Citations After 5 Years", "Percentage of Papers", 1900, 2009, func_name="agg.AVG",title="Papers with No Citations other than Self-Citations after 5 Years")
m_sf = m_sf[m_sf["Ref Number"] >= 5]
draw_feature_yearly_func_value(m_sf, "No Citations After 5 Years", "Percentage of Papers", 1900, 2009, func_name="agg.AVG", title="Papers with No Citations other than Self-Citations after 5 Years (Ref >=5)")
These results indicate that each year huge amounts of resources are being spent on papers that probably have a limited impact. Nevertheless, even though the total number of uncited papers has increased over the years, the percentage of uncited papers has decreased.
selected_decades = {1950,1960, 1970, 1980, 1990, 2000}
sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)['Total Citations by Year',"Paper publish year", "Ref Number"]
sf = sf[sf["Ref Number"] >= 5]
sf = sf.fillna('Total Citations by Year', {})
sf = sf.rename({"Paper publish year": "Year"})
sf["Decade"] = sf["Year"].apply(lambda y: y- y%10)
sf = sf[sf["Decade"].apply(lambda decade: decade in selected_decades)]
sf['Total Citations by Year'] = sf['Total Citations by Year'].apply(lambda d: {int(k):v for k,v in d.iteritems()})
sf['Total Citations'] = sf.apply(lambda r: {int(k - r["Year"]): v for k, v in r['Total Citations by Year'].iteritems()})
def citation_after_years(d, years):
keys = [k for k in d.keys() if k<=years]
if len(keys) == 0:
return 0
return int(d[max(keys)])
sf['Total Citations After 10 Years'] = sf['Total Citations'].apply(lambda d: citation_after_years(d, 10) )
g = sf.groupby(["Decade", 'Total Citations After 10 Years'], {"Number of Papers": agg.COUNT()})
g2 = sf.groupby("Decade", {"Total Number of Papers": agg.COUNT()})
g = g.join(g2, on="Decade")
g['Paper Percentage'] = g.apply(lambda r: r['Number of Papers']/float(r["Total Number of Papers"]))
g = g.sort(["Decade", "Total Citations After 10 Years"])
g = g.sort(["Decade", "Total Citations After 10 Years"])
import plotly.plotly as py
import plotly.graph_objs as go
import urllib
import numpy as np
#X-axis citation number; Y-axis- decade; Z-axis Percentage
traces = []
for decade in selected_decades:
g2 = g[g["Decade"] == decade]
x = []
y = []
z = []
for r in g2:
z.append([r["Paper Percentage"], r["Paper Percentage"]])
y.append([decade, decade+2])
x.append([r['Total Citations After 10 Years'], r['Total Citations After 10 Years']])
#colorscale=[ [i, 'rgb(%d,%d,255)'%(ci, ci)] for i in np.arange(0,1.1,0.1) ],
layout = go.Layout(
scene = dict(
xaxis = dict(
title='Citation Number After 10 Years'),
yaxis = dict(
zaxis = dict(
title='Percentage of Papers'),),
fig = { 'data':traces, 'layout':layout}
py.iplot(fig, filename='ribbon-plot-python')
To better understand if the above described features have a positive or negative influence on a paper’s citation number, we calculated the correlations among the various features and the paper’s number of citations after 5-years.
join_sf = tc.load_sframe(AMINER_MAG_JOIN_SFRAME)['Paper publish year','Original paper title', 'Ref Number','Keywords List', 'Authors Number','Total Citations by Year',
'abstract', 'page_start', 'page_end']
join_sf = join_sf.rename({"Paper publish year": "Year", "Original paper title":"Title"})
join_sf = filter_sframe_by_years(join_sf, 1950, 2009)
join_sf["Title Language"] = join_sf["Title"].apply(lambda t: detect_lang(t))
join_sf = join_sf[join_sf['Ref Number'] != None]
join_sf = join_sf[join_sf['Title Language'] == 'english']
join_sf = join_sf[join_sf['abstract'] != None]
join_sf = join_sf[join_sf['page_start'] != None]
join_sf = join_sf[join_sf['page_end'] != None]
join_sf = join_sf[join_sf['Total Citations by Year'] != None]
join_sf = join_sf.fillna("Keywords List", [])
join_sf["Keywords Number"] = join_sf["Keywords List"].apply(lambda l: len(l))
join_sf = join_sf[join_sf["Keywords Number"].apply(lambda k: k>0)]
sf = join_sf
sf["Title Length"] = sf["Title"].apply(lambda t: len(t.split()))
sf = sf[sf["Title Length"] > 1]
sf = sf[sf["Title Length"] < 50] # remove long titles that probably are results of problem in title parsing
sf['Title ?/! Marks'] = sf["Title"].apply(lambda t: 1 if ('!' in t or '?' in t) else 0 )
sf["Abstract Length"] = sf['abstract'].apply(lambda a: len(a.split()))
sf = sf[sf["Abstract Length"] > 10] # remove short abstaract
sf = sf[sf["Abstract Length"] <= 2000] # remove long abstaract
sf["abstract_lang"] = sf["abstract"].apply(lambda t: detect_lang(t))
sf = sf[sf["abstract_lang"] == "english"]
sf['page_start'] = sf['page_start'].apply(lambda p: convert_to_int(p))
sf['page_end'] = sf['page_end'].apply(lambda p: convert_to_int(p))
sf = sf[sf['page_start'] != None]
sf = sf[sf['page_end'] != None]
sf['Paper Length'] = sf.apply(lambda p: p['page_end'] - p['page_start'] + 1)
sf['Total Citations by Year'] = sf['Total Citations by Year'].apply(lambda d: {int(k):v for k,v in d.iteritems()})
sf['Total Citations'] = sf.apply(lambda r: {int(k - r["Year"]): v for k, v in r['Total Citations by Year'].iteritems()})
sf['Total Citations After 5 Years'] = sf['Total Citations'].apply(lambda d: citation_after_years(d, 5) )
sf = sf["Year","Title Length","Title ?/! Marks", "Authors Number", "Abstract Length", "Keywords Number", "Ref Number", "Paper Length","Total Citations After 5 Years"]
df = sf.remove_column("Year").to_dataframe()
corr = df.corr('spearman')
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
ax = sns.heatmap(corr, mask=mask, annot=True, vmax=.5)
We can also calculate the correlation for specific years and get similar results.
sf_1980 = sf[sf["Year"] == 1980]
df = sf_1980.remove_column("Year").to_dataframe()
corr = df.corr('spearman')
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
ax = sns.heatmap(corr, mask=mask, annot=True, vmax=.5)
sf_2000 = sf[sf["Year"] == 2000]
df = sf_2000.remove_column("Year").to_dataframe()
corr = df.corr('spearman')
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
ax = sns.heatmap(corr, mask=mask, annot=True, vmax=.5)