In the previous notebook, we reviewed how papers' properties and authors' publication behaviors have changed over time. In this notebook, we going to utilize the SJR and MAG datasets to observe how journals' properties have changed over the last two centuries. First, let's load the required packages.
from configs import *
import pandas as pd
import numpy as np
import altair as alt
alt.renderers.enable('notebook')
from visualization.visual_utils import *
import turicreate.aggregate as agg
In this section, we will examine how the number of journals has changed over time. Let's start by exploring the number of active journals each year, and also the number of new journals added each year.
j_sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)['Paper publish year','Journal ID mapped to venue name','Normalized venue name','Original venue name',"Paper ID"]
j_sf = j_sf[j_sf['Journal ID mapped to venue name'] != '']
j_sf = j_sf.rename({"Journal ID mapped to venue name": "Journal ID", 'Paper publish year': "Year"})
j_sf
g = j_sf.groupby("Year", {"Number of Active Journals": agg.COUNT_DISTINCT("Journal ID")})
draw_features_yearly_chart(g, "Number of Active Journals", 1800, 2014, title="Number of Active Journals over Time (MAG)")
j_start_sf = j_sf.groupby('Journal ID', {"Start Year": agg.MIN("Year")})
g = j_start_sf.groupby("Start Year", {"Number of New Journals": agg.COUNT()})
draw_features_yearly_chart(g.rename({"Start Year": "Year"}), "Number of New Journals", 1800, 2014, title="Number of New Journals over Time (MAG)")
sjr_sf = tc.load_sframe(SJR_SFRAME)["Rank", "Year", "Title", "Type", "SJR", "SJR Best Quartile", "H index","Total Docs.", "Cites / Doc. (2years)"]
sjr_sf = sjr_sf[sjr_sf['Type'] == 'journal']
sjr_sf = sjr_sf.unique()
sf = sjr_sf[sjr_sf["Total Docs."] > 0]
g = sf.groupby("Year", {"Number of Active Journals": agg.COUNT_DISTINCT("Title")})
draw_features_yearly_chart(g, "Number of Active Journals", 1999, 2016, title="Number of Active Journals over Time (SJR)")
g = sjr_sf.groupby("Title", {"Start Year": agg.MIN("Year")})
g2 = g.groupby("Start Year", {"Number of New Journals": agg.COUNT()})
# We will skip on the year 1999 that contains journals that were first published before 1999
draw_features_yearly_chart(g2.rename({"Start Year": "Year"}), "Number of New Journals", 2000, 2016, title="Number of New Journals over Time (SJR)")
According to both datasets, there are over 14,000 active journals in each year since 2010, and according to SJR there were 20,975 active journals in 2015. Moreover, in both datasets, we discovered a sharp increase in the number of new journals, followed by a decrease in new journals; in the MAG dataset the decrease started in 2009, and in the SJR dataset it started in 2011. Interestingly enough, according to the SJR dataset, which seems to have more accurate data for the last decade, in 2010 there were 1380 new journals, while in 2015 there were only 352 new journals. We believe that this can be a direct result of the publishing of mega-journals, such as PLOS ONE that changed the academic publications eco-system. However, in 2016 the number of new journals had recovered with 1232 new journals. Nevertheless, the number of active journals has skyrocketed over the last 50 years.
Let's observe the average number of published papers for journals over time using both datasets.
sf = j_sf.groupby(["Journal ID", "Year"], {"Number of Papers": agg.COUNT()})
g = sf.groupby('Year', {"Average Number of Papers": agg.AVG("Number of Papers")})
draw_features_yearly_chart(g, "Average Number of Papers", 1800, 2014, title="Journals' Average Number of Papers (MAG)")
sjr_g = sjr_sf.groupby("Year", {"Average Number of Papers": agg.AVG("Total Docs.")})
draw_features_yearly_chart(sjr_g, "Average Number of Papers", 1999, 2016, title="Journals' Average Number of Papers (SJR)")
sf = j_sf.groupby(["Journal ID", "Year", 'Original venue name'], {"Number of Papers": agg.COUNT()})
g = sf.groupby('Year', {"Max Number of Papers": agg.MAX("Number of Papers")})
draw_features_yearly_chart(g, "Max Number of Papers", 1800, 2014, title="Maximal Journals' Number of Papers (MAG)")
sjr_g = sjr_sf.groupby("Year", {"Max Number of Papers": agg.MAX("Total Docs.")})
draw_features_yearly_chart(sjr_g, "Max Number of Papers", 1999, 2016, title="Maximal Journals' Number of Papers (SJR)")
sf = sjr_sf[sjr_sf["Year"] == 2014 ]
sf.sort("Total Docs.", ascending=False)['Title', 'Total Docs.']
sf = sjr_sf[sjr_sf["Year"] == 1999 ]
sf.sort("Total Docs.", ascending=False)['Title', 'Total Docs.']
sf = sjr_sf[sjr_sf["Year"] == 2016 ]
sf = sf[sf["Total Docs."] >= 1000]
sf.materialize()
sf
From above results, not only has the number of active journals sharply increased over the years, but also on average there is an increase in the number of papers published in journals, where there are various high-ranked journals which publish thousands of papers each year.
There are various methods to measure a journal’s success, such as the journal's SJR, impact factor, H-index, journal quartile, etc. Let's try to use the SJR dataset to determine how this measure has changed since 1999. We will start by measuring how the number of papers in each quartile has changed over time.
sf = sjr_sf["Year", "SJR Best Quartile","Total Docs."]
g = sf.groupby(["Year", "SJR Best Quartile"], {"Papers Number": agg.SUM("Total Docs.")})
g = g.sort(["Year", "SJR Best Quartile"])
selected_years = {1999, 2005, 2010,2016}
g2 = g[g["Year"].apply(lambda y: y in selected_years)]
alt.Chart(g2.to_dataframe()).mark_bar(stroke='transparent').encode(
alt.X('SJR Best Quartile:N', title=""),
alt.Y("Papers Number:Q",axis=alt.Axis(grid=True)),
color='SJR Best Quartile:N',
column="Year:O",
).configure_axis(
domainWidth=0.2)
It can be observed that over the last few years the number of papers which were published in Q1 and Q2 journals more than doubled, from 550,109 Q1 papers and 229,373 Q2 papers in 1999, to 1,187,514 Q1 papers and 554,782 Q2 papers in 2016. Let's calculate the percentage of papers in each quartile in 2016:
g3 = g.groupby("Year", {"Total Papers": agg.SUM("Papers Number")})
g3 = g.join(g3,on="Year")
g3["Percentage of Papers"] = g3.apply(lambda r: r['Papers Number']/float(r['Total Papers']))
g3 = g3[g3["Year"] == 2016]
g3 = g3.sort(["Year", "SJR Best Quartile"])
g3
From the above results we can observe that over 51.3% (1.18 million papers) of the published papers were published in Q1 journals, while only 8.6% of the papers were published in Q4. Let's observe how the H-index has changed over the years.
sf = sjr_sf["Year", "H index"]
g = sf.groupby("Year", {"Average H-index": agg.AVG("H index")})
draw_features_yearly_chart(g, "Average H-index", 1999, 2016, title="Average H index over Time (SJR)")
g = sf.groupby("Year", {"H-index List": agg.CONCAT("H index")})
g["Median H-index"] = g["H-index List"].apply(lambda l: np.median(l))
g = g.remove_column("H-index List")
draw_features_yearly_chart(g, "Median H-index", 1999, 2016, title="Median H index over Time (SJR)")
selected_years = {1999, 2003,2007,2011, 2015}
g = sf[sf["Year"].apply(lambda y: y in selected_years )]
s = sns.boxplot(y="Year", x="H index", data=g.to_dataframe(), orient="h")
s.set(xlim=(0, 200))
According to the above charts, we can notice that with time the H-index has decreased, and since 2012 only half of the journals have an H-index equal to or less than 16. This result may indicate that many of the journals have limited impact. Moreover, mega-journals, such as PLOS ONE have a very high H-index value due to the fact that they publish tens of thousands of papers each year. Let's measure how the $\frac{Cites}{Docs} (2years)$ and SJR measures have changed over time.
sf = sjr_sf["Year", "SJR"]
g = sf.groupby("Year", {"Average SJR": agg.AVG("SJR")})
draw_features_yearly_chart(g, "Average SJR", 1999, 2016, title="Average Journals SJR over Time")
selected_years = {1999, 2003,2007,2011, 2015}
g = sf[sf["Year"].apply(lambda y: y in selected_years )]
s = sns.boxplot(y="Year", x="SJR", data=g.to_dataframe(), orient="h")
s.set(xlim=(0, 2))
sf = sjr_sf["Year", "Cites / Doc. (2years)"]
g = sf.groupby("Year", {"Average Cites / Docs": agg.AVG("Cites / Doc. (2years)")})
draw_features_yearly_chart(g, "Average Cites / Docs", 1999, 2016, title="Average Cites/ Docs over Time (SJR)")
selected_years = {1999, 2003,2007,2011, 2015}
g = sf[sf["Year"].apply(lambda y: y in selected_years )]
s = sns.boxplot(y="Year", x="Cites / Doc. (2years)", data=g.to_dataframe(), orient="h")
s.set(xlim=(0, 2))
It can be observed that both the H-index, $\frac{Cites}{Docs} (2years)$, and SJR measures have changed considerably since 1999. While the H-index has decreased sharply, the $\frac{Cites}{Docs}$ has increased sharply over the period of 18 years. These may indicate that these measures are much less effective than in the past to measure journal ranking, especially when taking into consideration the surge in the number of active journals and the emerging of mega-journals. The SJR measure has also changed over the last 18 years. However, as can be seen from the above charts, the change was less dramatic.
In this section, we are going to analyze how various properties have changed in the top-40 journals. Let's start by selecting the top-40 journals.
sf = sjr_sf[sjr_sf["Year"] == 2016]
sf = sf.sort("SJR", ascending=False)[:40]
top_journals_titles = set(sf["Title"])
sf["Title"]
Let's get these journals’ ISSNs (there can be several ISSNs for each journal) and match them to the journal IDs in the MAG dataset.
sf = tc.load_sframe(SJR_SFRAME)["Title","ISSN"]
sf = sf[sf["ISSN"] != None ]
top_journals_issn = set(sf[sf["Title"].apply(lambda t: t in top_journals_titles)]["ISSN"]) # the journals ISSN over the years
#let's match the top journals to the MAG dataset journals by matching ISSNs and names
join_sf = tc.load_sframe(AMINER_MAG_JOIN_SFRAME)["Original venue name", "Journal ID mapped to venue name", "issn"]
join_sf = join_sf[join_sf["issn"] != None]
join_sf = join_sf[join_sf["issn"] != '']
join_sf = join_sf[join_sf["Journal ID mapped to venue name"] != '']
join_sf["issn"] = join_sf["issn"].apply(lambda issn: issn.replace("-", ""))
join_sf = join_sf[join_sf["issn"].apply(lambda issn: issn in top_journals_issn)]
join_sf = join_sf[join_sf["Original venue name"].apply(lambda t: t in top_journals_titles)]
selected_mag_journals = join_sf["Original venue name", "Journal ID mapped to venue name" ].unique()
selected_mag_journals
At the end of the matching process, we succeeded in matching 30 top journals in the MAG dataset based on their SJR matching ranking. Let's deeply examine how various properties of these journals have changed over time. We will start by looking in general at journal features, such as author’s age, gender, etc., based on all papers in all 30 journals.
top_journal_ids_set = set(selected_mag_journals["Journal ID mapped to venue name"])
mag_sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)["Paper ID","Journal ID mapped to venue name","Paper publish year", "Ref Number" ]
select_papers_sf = mag_sf[mag_sf["Journal ID mapped to venue name"].apply(lambda i: i in top_journal_ids_set)]
select_papers_sf = select_papers_sf[select_papers_sf["Paper publish year"] <= 2014]
select_papers_sf = select_papers_sf[select_papers_sf["Paper publish year"] >= 1950]
select_papers_sf = select_papers_sf[select_papers_sf["Ref Number"] >= 5]
len(select_papers_sf)
#Let's load all the papers into a PaperCollection object for analysis
from papers_collection_analyer import *
papers_ids = list(select_papers_sf["Paper ID"])
pc = PapersCollection(papers_ids=papers_ids )
#This can take some time
logger.setLevel(logging.INFO) # don't print the debugging info
d = pc.calculate_feature_over_time("papers_number", 1950, 2014)
d = d['papers_number']
sf = tc.SFrame({"Year": d.keys(), "Number of Papers": d.values()})
draw_features_yearly_chart(sf, "Number of Papers", 1950, 2014, title="Selected Top Journals Number of Papers (MAG)")
d = pc.calculate_feature_over_time("authors_number", 1950, 2014)
d = d['authors_number']
sf = tc.SFrame({"Year": d.keys(), "Number of Authors": d.values()})
draw_features_yearly_chart(sf, "Number of Authors", 1950, 2014, title="Selected Top Journals Number of Authors over Time (MAG)")
Let's examine how authors’ average ages have changed over time.
d = pc.calculate_feature_over_time("first_authors_average_age", 1950, 2014)
d = d['first_authors_average_age']
first_sf = tc.SFrame({"Year": d.keys(), "First Authors Average Age": d.values()})
d = pc.calculate_feature_over_time("last_authors_average_age", 1950, 2014)
d = d['last_authors_average_age']
last_sf = tc.SFrame({"Year": d.keys(), "Last Authors Average Age": d.values()})
sf = first_sf.join(last_sf, on="Year", how="left")
draw_features_yearly_chart_multi_lines(sf, "Author Sequence", "Average Academic Age", 1950, 2014, title="Authors Average Academic Age in Top Journals over Time")
d = pc.calculate_feature_over_time("percentage_of_papers_with_first_authors_that_publish_before_in_the_same_venue", 1950, 2014)
d = d['percentage_of_papers_with_first_authors_that_publish_before_in_the_same_venue']
first_sf = tc.SFrame({"Year": d.keys(), "First Authors Return Percentage": d.values()})
d = pc.calculate_feature_over_time("percentage_of_papers_with_last_authors_that_publish_before_in_the_same_venue", 1950, 2014)
d = d['percentage_of_papers_with_last_authors_that_publish_before_in_the_same_venue']
last_sf = tc.SFrame({"Year": d.keys(), "Last Authors Return Percentage": d.values()})
sf = first_sf.join(last_sf, on="Year", how="left")
draw_features_yearly_chart_multi_lines(sf, "Author Sequence", "Return Authors Percentage", 1950, 2014, title="Percentage of Top-Journal Papers by Returning Authors")
From the above charts, we can observe that the average age of first and last author that published in the top-30 journals we selected increased sharply. For example, in 1990 the average academic age of last authors who published papers in these journals was 7.89 years, while in 2014 it was 18.35 years. A similar observation can be made regarding returning authors; in 1990 38% of papers' last authors had published in these top journals before, while in 2014 the percentage jumped to 46.2%. It worthwhile to take into consideration that not all of these top-30 journals were as highly ranked in 1990 as they have become in recent years. Let's observe similar statistics, this time on each journal separately, using the Venue class.
#Let's get venues details
sf = VENUE_FETCHER.get_valid_venues_papers_ids_sframe(min_ref_number=5, min_journal_papers_num=50)
sf = sf.join(selected_mag_journals, on='Journal ID mapped to venue name')
sf
from venue import Venue
venue_dict = {}
#create dict with venue object
for r in sf:
venue_id = r['Journal ID mapped to venue name']
venue_name = r["Original venue name"]
papers_ids = r['Paper IDs List']
venue_dict[venue_name] = Venue(venue_id=venue_id, venue_name=venue_name, papers_ids=papers_ids)
#Get selected journal features dict
features_dict = {}
features_list = ['papers_number',
'authors_average_age', 'first_authors_average_age', 'last_authors_average_age',
'percentage_of_papers_with_authors_that_publish_before_in_the_same_venue',
"percentage_of_papers_with_first_authors_that_publish_before_in_the_same_venue",
"percentage_of_papers_with_last_authors_that_publish_before_in_the_same_venue"]
#Create a feature sframe
features_dict = {
'Venue Name': [],
'Feature Name': [],
'Year': [],
'Value': []
}
for n, v in venue_dict.iteritems():
for f in features_list:
d = v.calculate_feature_over_time(f, 1980, 2014)[f]
for y,val in d.iteritems():
features_dict['Venue Name'].append(n)
features_dict['Feature Name'].append(f)
features_dict['Year'].append(y)
features_dict['Value'].append(val)
sf = tc.SFrame(features_dict)
We created a features SFrame, and now we can see how each feature changed over time across all the selected top-ranked journals.
def draw_journal_feature(sf, features_set, hue=None, sharex=False, sharey=False, xlabel="Year", ylabel='',
hue_dict=None):
sf = sf[sf['Feature Name'].apply(lambda s: s in features_set)]
if hue_dict is not None:
sf['Feature Name'] = sf['Feature Name'].apply(lambda f: hue_dict[f] if f in hue_dict else f)
sf = sf.sort(["Venue Name", "Year", 'Feature Name'])
df = sf.to_dataframe()
if hue is None:
c = sns.FacetGrid(df, col="Venue Name", sharex=sharex, sharey=sharey, col_wrap=5)
else:
c = sns.FacetGrid(df, col="Venue Name", hue=hue, sharex=sharex, sharey=sharey, col_wrap=5)
c.map(plt.plot, "Year", "Value", alpha=.7).set_titles("{col_name}")
if hue is not None:
c.add_legend()
c.set_axis_labels(xlabel, ylabel)
draw_journal_feature(sf, {"papers_number"}, ylabel='Papers Number')
featurens_names_labels = {'authors_average_age': "All Authors", 'first_authors_average_age': 'First Authors',
'last_authors_average_age': 'Last Authors'}
draw_journal_feature(sf, {'authors_average_age', 'first_authors_average_age', 'last_authors_average_age'}, ylabel='Average Age', hue="Feature Name", hue_dict=featurens_names_labels)
featurens_names_labels = {'percentage_of_papers_with_authors_that_publish_before_in_the_same_venue': 'At least One Return Author',
"percentage_of_papers_with_first_authors_that_publish_before_in_the_same_venue": 'Return First Author',
"percentage_of_papers_with_last_authors_that_publish_before_in_the_same_venue": ' Return Last Author'}
draw_journal_feature(sf, {'percentage_of_papers_with_authors_that_publish_before_in_the_same_venue', 'percentage_of_papers_with_first_authors_that_publish_before_in_the_same_venue', 'percentage_of_papers_with_last_authors_that_publish_before_in_the_same_venue'}, ylabel='Percentage of Papers', hue="Feature Name", hue_dict=featurens_names_labels)
x = sf[sf["Feature Name"] == 'percentage_of_papers_with_authors_that_publish_before_in_the_same_venue']
x = x[x["Venue Name"] == "Nature Genetics"]
x
x[x["Year"] == 2014]
From the above charts, we can observe that over the years most of the top journals have published more papers each year. Additionally, in all the selected journals the average career age of the published authors increased over the years. For example, the average career age of last authors in Nature journal increased from about 5 years in 1980 to about 17.5 years in 2014. Moreover, the percentage of papers with returning authors has increased sharply in recent years. For example, in Cell journal, in 2010, about 80% of all papers included at least one author who published in the journal before, while in 1980 this rate stood at less than 40%.