Before we begin, make sure you have installed all the required Python packages. (The instructions below use pip. You can use easy_install, too.) Also, consider using virtualenv for a cleaner installation experience instead of sudo. I also recommend to running the code via IPython Notebook.
Please download the KDD Cup 2016 data, and please also download the project files from our GitHub repository. Through this research, we use the various constants that appear in consts.py. Please change the DATASETS_AMINER_DIR, DATASETS_BASE_DIR, and SFRAMES_BASE_DIR to your local directories, where you can download the datasets and save the project's SFrames.
Note: Creating the following SFrame requires considerable computation power for long periods.
In this study, we used the following datasets:
The Microsoft Academic KDD Cup 2016 dataset - The Microsoft Academic KDD Cup Graph dataset (referred to as the MAG 2016 dataset) contains data on over 126 million papers. The main advantage of this dataset is that it has undergone several preprocessing iterations of author entity matching (any author is identified by ID) and paper deduplication. Additionally, the dataset match between papers and their fields of study includes the hierarchical structure and connections between various fields of study.
AMiner dataset - The AMiner dataset contains information on over 154 million papers collected by the AMiner team. The dataset contains papers' abstracts, ISSNs, ISBNs, and details on each paper.
SJR dataset - The SCImago Journal Rank open dataset (referred to as the SJR dataset) contains journals and country specific metric data starting from 1999. In this study, we used the SJR dataset to better understand how various journal metrics have changed over time.
The first step is to convert the dataset text files into SFrame objects using the code located under the SFrames creator directory, using the following code.
from create_mag_sframes import *
from configs import *
create_all_sframes() # running this can take considerable time
The above two lines of code will create a set of SFrames with all the dataset data. The SFrames will include data on authors’ papers, keywords, fields of study, and more. Moreover, the code will construct the Extended Papers SFrame, which contains various meta data on each paper in the dataset.
mag_sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)
mag_sf
In our study, we also analyzed how various authors' attributes, such as the number of published papers, number of coauthors, etc., has changed over time. To achieve this, we created an authors features SFrame using the following code:
from create_mag_authors_sframe import *
a = AuthorsFeaturesExtractor()
#This need to run on a strong server and can take considerable time to run
a_sf = a.get_authors_all_features_sframe()
a_sf #the SFrame can be later loaded using tc.load_sframe(AUTHROS_FEATURES_SFRAME)
The above SFrame contains various features of each author that were constructed based on analyzing the author’s papers that have at least 5 references. If you notice, the author’s SFrame contains each author’s gender prediction. This column was created by obtaining first-name gender statistics from theSSA Baby Names and WikiTree datasets which include over 115 thousands unique first names (see details in geneder_classifier.py).
After downloading the AMiner website, simply load to an SFrame using the following code:
aminer_sf = tc.SFrame.read_json('%s/*.txt' % AMINER_DATA_DIR, orient='lines')
aminer_sf # the SFrame can be accessed also by using tc.load_sframe(AMINER_PAPERS_SFRAME)
First, we download all the journal ranking files from the SJR website. Next, we use the following code to create a single SFrame with all the journal data:
from create_sjr_sframe import *
sjr_sf = create_sjr_sframe(SJR_FILES_DIR)
sjr_sf # the SFrame can also be accessed using tc.load_sframe(SJR_SFRAME)
The MAG and AMiner datasets have a slightly different set of features. While the MAG dataset contains data on each author with a unique author ID, the AMiner contains additional data on each paper, including the paper's abstract and the paper's ISSN or ISBN. Additionally, the SJR dataset contains data about each journal's ranking.
To combine the data from the author publication record and the journals' rankings, we join the datasets. First, we joined the MAG and AMiner datasets by matching DOI values, using the following code (see also create_mag_aminer_sframe.py):
sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)
g1 = sf.groupby('Paper Document Object Identifier (DOI)', {'Count': agg.COUNT()})
s1 = set(g1[g1['Count'] > 1]['Paper Document Object Identifier (DOI)'])
sf = sf[sf['Paper Document Object Identifier (DOI)'].apply(lambda doi: doi not in s1 )]
sf.materialize()
sf2 = tc.load_sframe(AMINER_PAPERS_SFRAME)
g2 = sf2.groupby('doi', {'Count': agg.COUNT()})
s2 = set(g2[g2['Count'] > 1]['doi'])
sf2 = sf2[sf2['doi'].apply(lambda doi: doi not in s2 )]
sf2.materialize()
aminer_mag_sf = sf.join(sf2, {'Paper Document Object Identifier (DOI)': 'doi'})
aminer_mag_sf['title_len'] = aminer_mag_sf['title'].apply(lambda t: len(t))
aminer_mag_sf = aminer_mag_sf[aminer_mag_sf['title_len'] > 0]
aminer_mag_sf = aminer_mag_sf.rename({"Paper ID": "MAG Paper ID", "id": "Aminer Paper ID"})
aminer_mag_sf.remove_column('title_len')
aminer_mag_sf # this SFrame can be accessed using tc.load_Sframe(AMINER_MAG_JOIN_SFRAME)
Using the joined dataset, we obtained an SFrame with the joint meta data of 28.9 million papers. We can take this SFrame and join it with the SJR dataset.
import re
def create_aminer_mag_sjr_sframe(year):
"""
Creates a unified SFrame of AMiner, MAG, and the SJR datasets
:param year: year to use for SJR data
:return: SFrame with AMiner, MAG, and SJR data
:rtype: tc.SFrame
"""
sf = tc.load_sframe(AMINER_MAG_JOIN_SFRAME)
sf = sf[sf['issn'] != None]
sf = sf[sf['issn'] != 'null']
sf.materialize()
r = re.compile("(\d+)-(\d+)")
sf['issn_str'] = sf['issn'].apply(lambda i: "".join(r.findall(i)[0]) if len(r.findall(i))> 0 else None)
sf = sf[sf['issn_str'] != None]
sjr_sf = tc.load_sframe(SJR_SFRAME)
sjr_sf = sjr_sf[sjr_sf['Year'] == year]
return sf.join(sjr_sf, on={'issn_str': "ISSN"})
create_aminer_mag_sjr_sframe(2015)
Using Turicreate and SFrame objects can help us get general data on how academic publication dynamics have changed over time, but it would be challenging to use this data to create more complicated insights, such as the trends of a specific journal. To reveal more complicated insights using the data, we would need to load the dataset to a different framework. In this study, we chose to use MongoDB as our framework for more complicated queries. We installed MongoDB on Ubuntu 17.10 using the instructions in the following link. After MongoDB is installed and running, please remember to set the user and password, and update MONGO_HOST & MONGO_PORT vars in consts.py (one can also adjust the connection to include user password auth). Now, the next step is to load the above created SFrames to collections in MongoDB using mongo_connecter.py:
from mongo_connector import *
load_sframe() #this will load the SFrame to a local
In the end of the loading process, six collections will be loaded to the journal database.
MD.client.journals.collection_names()
In the second part of the tutorial, we will demonstrate how the above created MongoDB collections can be utilized to calculate various statistics on paper collections, authors, journals, and research domains.