PART I: Creating the Study's Datasets

0. Setup

Before we begin, make sure you have installed all the required Python packages. (The instructions below use pip. You can use easy_install, too.) Also, consider using virtualenv for a cleaner installation experience instead of sudo. I also recommend to running the code via IPython Notebook.

  • sudo pip install --upgrade turicreate
  • sudo pip install --upgrade repoze.lru
  • sudo pip install --upgrade networkx
  • sudo pip install --upgrade pymongo

Please download the KDD Cup 2016 data, and please also download the project files from our GitHub repository. Through this research, we use the various constants that appear in consts.py. Please change the DATASETS_AMINER_DIR, DATASETS_BASE_DIR, and SFRAMES_BASE_DIR to your local directories, where you can download the datasets and save the project's SFrames.

Note: Creating the following SFrame requires considerable computation power for long periods.

1. Creating the SFrames

In this study, we used the following datasets:

  • The Microsoft Academic KDD Cup 2016 dataset - The Microsoft Academic KDD Cup Graph dataset (referred to as the MAG 2016 dataset) contains data on over 126 million papers. The main advantage of this dataset is that it has undergone several preprocessing iterations of author entity matching (any author is identified by ID) and paper deduplication. Additionally, the dataset match between papers and their fields of study includes the hierarchical structure and connections between various fields of study.

  • AMiner dataset - The AMiner dataset contains information on over 154 million papers collected by the AMiner team. The dataset contains papers' abstracts, ISSNs, ISBNs, and details on each paper.

  • SJR dataset - The SCImago Journal Rank open dataset (referred to as the SJR dataset) contains journals and country specific metric data starting from 1999. In this study, we used the SJR dataset to better understand how various journal metrics have changed over time.

1.1 The Microsoft Academic KDD Cup Dataset

The first step is to convert the dataset text files into SFrame objects using the code located under the SFrames creator directory, using the following code.

In [1]:
from create_mag_sframes import *
from configs import *
create_all_sframes() # running this can take considerable time

The above two lines of code will create a set of SFrames with all the dataset data. The SFrames will include data on authors’ papers, keywords, fields of study, and more. Moreover, the code will construct the Extended Papers SFrame, which contains various meta data on each paper in the dataset.

In [2]:
mag_sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)
mag_sf
Out[2]:
Paper ID Original paper title Normalized paper title Paper publish year Paper publish date
01B27BE8 Evaluating Polarity for
Verbal Phraseological ...
evaluating polarity for
verbal phraseological ...
2014 2014/11/16
027D0030 Automatic Monitoring the
Content of Audio ...
automatic monitoring the
content of audio ...
2012 2012/10/27
7CFE299E Towards a set of Measures
for Evaluating Software ...
towards a set of measures
for evaluating software ...
2009 2009/11
59BEBE1C Learning Probability
Densities of Optimiza ...
learning probability
densities of optimiza ...
2008 2008/10/27
5873C011 Towards a Model for an
Immune System ...
towards a model for an
immune system ...
2002 2002/04/22
7A1109E4 Approach Towards a
Natural Language Anal ...
approach towards a
natural language anal ...
2013 2013/11
0B00AFD8 Towards the creation of
semantic models based on ...
towards the creation of
semantic models based on ...
2012 2012/10/27
5C66D743 Comparison of Neural
Networks and Support ...
comparison of neural
networks and support ...
2009 2009/11/01
040121AE Multiple Kernel Support
Vector Machine Proble ...
multiple kernel support
vector machine proble ...
2014 2014/11/16
7DEADC9A A Set of Test Cases for
Performance Measures in ...
a set of test cases for
performance measures in ...
2008
Paper Document Object
Identifier (DOI) ...
Original venue name Normalized venue name Journal ID mapped to
venue name ...
Conference ID mapped to
venue name ...
10.1007/978-3-319-13647-9
_19 ...
mexican international
conference on artificial ...
micai 42D7146F
10.1007/978-3-642-37807-2
_11 ...
mexican international
conference on artificial ...
micai 42D7146F
10.1109/MICAI.2009.15 mexican international
conference on artificial ...
micai 42D7146F
10.1007/978-3-540-88636-5
_25 ...
mexican international
conference on artificial ...
micai 42D7146F
10.1007/3-540-46016-0_42 mexican international
conference on artificial ...
micai 42D7146F
mexican international
conference on artificial ...
micai 42D7146F
10.1007/978-3-642-37807-2
_26 ...
mexican international
conference on artificial ...
micai 42D7146F
10.1007/978-3-642-05258-3
_42 ...
mexican international
conference on artificial ...
micai 42D7146F
10.1007/978-3-319-13650-9
_14 ...
mexican international
conference on artificial ...
micai 42D7146F
mexican international
conference on artificial ...
micai 42D7146F
Paper rank Ref Number Total Citations by Year Total Citations by Year
without Self Citations ...
Authors List Sorted
19517 21 None None [834A11E2, 7E8BA14F,
852B2668] ...
19444 7 None None [7DB8825E, 6936139F]
18870 14 {'2015': 10.0, '2014':
9.0, '2011': 3.0, '20 ...
{'2015': 8.0, '2014':
7.0, '2011': 1.0, '20 ...
[81867464, 8106CCE6,
7D20CE86, 7C6C6BB9] ...
19444 7 None None [807DCA23, 811B0352,
2779E3F4] ...
19177 8 {'2003': 1.0, '2006':
3.0, '2007': 3.0, '20 ...
{'2003': 1.0, '2006':
2.0, '2007': 2.0, '20 ...
[7F553272, 7F830ACE,
7E7F1E07] ...
19555 0 None None [7EE331AF]
19476 10 None None [80CF45DD, 814339C1,
7ED13F21, 7F45D6E4] ...
19428 9 {'2015': 1.0, '2014':
1.0} ...
{'2015': 1.0, '2014':
1.0} ...
[7E2F72E3, 45F06265]
19468 9 None None [7677E6C4, 7EBBEA7F,
7F312412, 776D4ECC, ...
19394 9 {'2015': 3.0, '2014':
2.0, '2013': 2.0, '20 ...
{'2015': 3.0, '2014':
2.0, '2013': 2.0, '20 ...
[7E792787, 7EC1BF2D,
7897839F] ...
Keywords List Field of study list Field of study list names Fields of study parent
list (L0) ...
None None None None
None None None None
[measures, software
measurement, autonomy, ...
[0A9CB5A9, 0556B228,
03E623B0, 0ABCEA76, ...
[Measure, Software
measurement, Autonomy, ...
[0271BC14, 0895A350,
0205A1DB, 07982D63] ...
[optimization problem,
probability density] ...
[083736DA, 0BBED543] [None, Probability
density function] ...
[0205A1DB]
[process algebra, process
calculi, multi agent ...
[09A47029, 09A47029,
027A0232, 027A0232, ...
[Process calculus,
Process calculus, Multi- ...
[0271BC14]
[cognition, computational
linguistics, grammars, ...
[0A2079AC, 093E8748,
03365AB6, 044294F0, ...
[Cognition, Computational
linguistics, Rule-based ...
[0271BC14, 00F03FC7]
[computer aided design,
cad, ontologies] ...
[07245C42, 0B9C400C,
09F001E0] ...
[Computer Aided Design,
None, Ontology] ...
[0271BC14]
[dynamic system, dynamic
systems, neural network ...
[0AA68668, 0304C748,
0304C748, 0AA68668, ...
[Dynamical system,
Artificial neural ...
[0271BC14, 0B0FEB68]
None None None None
[multiobjective
optimization] ...
[04198571] [Multi-objective
optimization] ...
[]
Fields of study parent
list names (L0) ...
Fields of study parent
list (L1) ...
Fields of study parent
list names (L1) ...
Fields of study parent
list (L2) ...
None None None None
None None None None
[Computer Science,
Sociology, Mathematics, ...
[0BE4BA29, 0765A2E4,
093C4716, 06E88D7C] ...
[Law, Data mining,
Artificial intelligence, ...
[00F36ADC, 05A3DFDE]
[Mathematics] [064E5072] [Statistics] [007E3B49]
[Computer Science] [0C19BFCD, 0BE20181,
093C4716] ...
[Immunology, Programming
language, Artificial ...
[027A0232]
[Computer Science,
Psychology] ...
[0BE20181, 0C2DB2A7] [Programming language,
Natural language ...
[00E4DDF6, 0C199D1F,
093E8748, 044294F0] ...
[Computer Science] [093C4716] [Artificial intelligence] [07245C42]
[Computer Science,
Chemistry] ...
[0724DFBA] [Machine learning] [0304C748, 097464D7]
None None None None
[] [07868074] [Mathematical
optimization] ...
[02724C38]
Fields of study parent
list names (L2) ...
Authors Number Urls Fields of study parent
list (L3) ...
None 3 [http://link.springer.com
/content/pdf/10.1007% ...
None
None 2 [http://dl.acm.org/citati
on.cfm?id=2481834, ht ...
None
[Project management,
Politics] ...
4 [http://ieeexplore.ieee.o
rg/xpl/abstractAuthor ...
[0059F32E, 0556B228,
03E623B0, 0ABCEA76, ...
[Stochastic process] 3 [http://dx.doi.org/10.100
7/978-3-540-88636-5_2 ...
[0BBED543]
[Multi-agent system] 3 [http://dl.acm.org/citati
on.cfm?id=691909, htt ...
[09A47029, 0087AC0D]
[Speech synthesis,
Machine translation, ...
1 [http://ieeexplore.ieee.o
rg/lpdocs/epic03/wrap ...
[0322F49A, 0A2079AC,
03365AB6, 041AB807] ...
[Computer Aided Design] 4 [http://dl.acm.org/citati
on.cfm?id=2481852, ht ...
[09F001E0]
[Artificial neural
network, Nonlinear ...
2 [http://adsabs.harvard.ed
u/abs/2009LNCS.5845.. ...
[00BB2E8D, 0AA68668,
2078A8D7] ...
None 5 [http://link.springer.com
/content/pdf/10.1007% ...
None
[Linear programming] 3 [http://dx.doi.org/10.100
7/978-3-540-88636-5_4 ...
[04198571]
Fields of study parent
list names (L3) ...
None
None
[Software agent, Software
measurement, Autonomy, ...
[Probability density
function] ...
[Process calculus, Immune
system] ...
[Feature extraction,
Cognition, Rule-based ...
[Ontology]
[Support vector machine,
Dynamical system, Cop ...
None
[Multi-objective
optimization] ...
[126903970 rows x 28 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In our study, we also analyzed how various authors' attributes, such as the number of published papers, number of coauthors, etc., has changed over time. To achieve this, we created an authors features SFrame using the following code:

In [3]:
from create_mag_authors_sframe import *
a = AuthorsFeaturesExtractor()

#This need to run on a strong server and can take considerable time to run
a_sf = a.get_authors_all_features_sframe()
a_sf #the SFrame can be later loaded using tc.load_sframe(AUTHROS_FEATURES_SFRAME)
Out[3]:
Author ID Papers by Years Dict Coauthors by Years Dict Affilation by Year Dict
00001F05 {2010: ['5DA0F250'],
2013: ['7AF8ABFE']} ...
{2013: ['77CE16EC',
'17B20BAE']} ...
{2010: [''], 2013: ['']}
00002AD3 {2009: ['7A0B348F'],
2010: ['795F56C6'], 2 ...
{2009: ['7FD1B86A',
'7921EA7D', '05390F01', ...
{2009: [''], 2010: [''],
2012: [''], 2006: [''], ...
00006A31 {2009: ['7CFAEB15']} {2009: ['7E24B147',
'7C3ED158', '79A6FF42', ...
{2009: ['']}
0000B5FA {2008: ['7714EB4E'],
2009: ['78B7C257'], 2 ...
{2008: ['54648B9B'],
2009: ['7ADCCDB0', ...
{2008: [''], 2009: [''],
2010: ['', '', ''], 2 ...
0001CF9B {2013: ['7C9BBC3A']} {2013: ['852809D8',
'77FE1F64', '80B5223D', ...
{2013: ['']}
00040294 {2009: ['7892886A',
'81263516', '81424AA7'], ...
{2009: ['75D367F6',
'82C1C4DE', '80824D21', ...
{2009: ['', '', ''],
2011: ['', ''], 2004: ...
00045553 {1987: ['77F10A3D'],
1988: ['7836E5B8'], 1 ...
{1987: ['85CAEB12'],
1988: ['819D7046', ...
{1987: ['new york medical
college'], 1988: [''], ...
0004B8AF {2010: ['77AFDEB4']} {2010: ['77227437',
'77CB65A7', '5EBF97A1', ...
{2010: ['']}
000510E2 {2011: ['80790612'],
2012: ['76E5D7F2', ...
{2011: ['82D84635',
'11F1B283'], 2012: ...
{2011: ['university of
queensland'], 2012: ['', ...
00063841 {2014: ['7790AFD4']} {2014: ['853EBBF2',
'7901305E', '0F71473E', ...
{2014: ['']}
Sequence Number by Year
Dict ...
Author name First name Last name Conference ID by Year
Dict ...
{2010.0: array('d',
[1.0]), 2013.0: ...
nancy praill nancy praill {2010: [''], 2013: ['']}
{2009.0: array('d',
[1.0]), 2010.0: ...
david s rebergen david rebergen {2009: [''], 2010: [''],
2012: [''], 2006: [''], ...
{2009.0: array('d',
[6.0])} ...
b zelazowska b zelazowska {2009: ['']}
{2008.0: array('d',
[1.0]), 2009.0: ...
lars goerigk lars goerigk {2008: [''], 2009: [''],
2010: ['', '', ''], 2 ...
{2013.0: array('d',
[5.0])} ...
orlando lastres
danguillecourt ...
orlando danguillecourt {2013: ['']}
{2009.0: array('d', [4.0,
7.0, 6.0]), 2011.0: ...
ivani bisordi ivani bisordi {2009: ['', '', ''],
2011: ['', ''], 2004: ...
{1987.0: array('d',
[1.0]), 1988.0: ...
miguel a pappolla miguel pappolla {1987: [''], 1988: [''],
1989: [''], 1990: ['', ...
{2010.0: array('d',
[7.0])} ...
dong zaijie dong zaijie {2010: ['']}
{2011.0: array('d',
[1.0]), 2012.0: ...
fairlie mcilwraith fairlie mcilwraith {2011: [''], 2012: ['',
'', ''], 2013: [''], ...
{2014.0: array('d',
[8.0])} ...
tiziano ponsetti tiziano ponsetti {2014: ['']}
Journal ID by Year Dict Venue by Year Dict Gender Dict
{2010: [''], 2013: ['']} {2010: [''], 2013: ['']} {'Gender': 'Female',
'Total Males': 2999, ...
{2009: ['0959867B'],
2010: ['0069C535'], 2 ...
{2009: ['Journal of
Occupational and ...
{'Gender': 'Male', 'Total
Males': 3700247, 'Total ...
{2009: ['036625C9']} {2009: ['Advances in
Medical Sciences']} ...
{'Gender': 'Unisex',
'Total Males': 536, ...
{2008: ['0A1986D0'],
2009: ['0ACEE946'], 2 ...
{2008: ['ChemPhysChem'],
2009: ['Physical ...
{'Gender': 'Male', 'Total
Males': 12459, 'Total ...
{2013: ['01F41F83']} {2013: ['International
Journal of Energy ...
{'Gender': 'Male', 'Total
Males': 47535, 'Total ...
{2009: ['08826C6E',
'0B483532', '05F694A1'], ...
{2009: ['Infection,
Genetics and Evolution', ...
{'Gender': 'Female',
'Total Males': 0, 'Total ...
{1987: ['096E1E70'],
1988: ['03C89659'], 1 ...
{1987: ['Synapse'], 1988:
['Human Pathology'], ...
{'Gender': 'Male', 'Total
Males': 173865, 'Total ...
{2010: ['0B0C2E2F']} {2010: ['Aquaculture
Research']} ...
{'Gender': 'Male', 'Total
Males': 317, 'Total ...
{2011: ['080AF648'],
2012: ['068E6FF5', '', ...
{2011: ['Drug and Alcohol
Review'], 2012: ['Drug ...
{'Gender': 'Unisex',
'Total Males': 7, 'Total ...
{2014: ['06FD8B4A']} {2014: ['Catalysis
Today']} ...
{'Gender': 'Male', 'Total
Males': 37, 'Total ...
[22443094 rows x 12 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

The above SFrame contains various features of each author that were constructed based on analyzing the author’s papers that have at least 5 references. If you notice, the author’s SFrame contains each author’s gender prediction. This column was created by obtaining first-name gender statistics from theSSA Baby Names and WikiTree datasets which include over 115 thousands unique first names (see details in geneder_classifier.py).

1.2 The AMiner Dataset

After downloading the AMiner website, simply load to an SFrame using the following code:

In [4]:
aminer_sf = tc.SFrame.read_json('%s/*.txt' % AMINER_DATA_DIR,  orient='lines')
aminer_sf # the SFrame can be accessed also by using tc.load_sframe(AMINER_PAPERS_SFRAME)
Out[4]:
abstract authors doi id
None [{'name': 'G. Adam'},
{'name': 'K. Schreibe ...
10.1002/ange.19650770204 53e99784b7602d9701f3e130
None [{'name': 'R. Farahbod'},
{'name': 'V. Gervasi'}, ...
None 53e99784b7602d9701f3e131
The method to making
technology roadmap is ...
[{'name': 'MO Chou'},
{'name': 'CHEN Jiqing'}, ...
None 53e99784b7602d9701f3e132
Drought is the first
place in all the natural ...
[{'name': 'Peijuan
Wang'}, {'name': 'Jiahua ...
10.1109/IGARSS.2011.60495
03 ...
53e99784b7602d9701f3e133
Determination of total
sugar can serve to ...
[{'org': 'Yantai
Institute of Coastal ...
None 53e99784b7602d9701f3e135
Resumen: Uno de los
problemas que debemos ...
[{'name': 'CELSO
VARGAS'}] ...
None 53e99784b7602d9701f3e136
None [{'name': 'D J Lum'},
{'name': 'V Upadhyay'}, ...
10.1111/j.1365-2559.2007.
02817.x ...
53e99784b7602d9701f3e137
This paper discussed the
planning and design ...
[{'org': 'School of
Resource and ...
None 53e99784b7602d9701f3e139
Rough set is a
mathematical tool to ...
None None 53e99784b7602d9701f3e13a
None [{'name': 'F
THOUVENYPAISANT'}, ...
10.1016/S0221-0363(05)762
74-0 ...
53e99784b7602d9701f3e13b
isbn issn issue keywords lang n_citation page_end page_start pdf
None None 2 None en None 95 94 None
None None None None en None None None None
None None 19 [science and technology
production, technology ...
zh None 95 90 None
None None null [canopy parameters,
canopy spectrum, ...
en None 1933 1930 None
None None 07 [metabolites, Jerusalem
artichoke, total sugar, ...
zh 1 93+97 90 None
None None None None en None None None None
None None 5 None en None 707 704 None
None None 28 [Planning and design
method, Mountainous ...
zh 1 364 362 None
None None 11 [Data Mining, Rough Set,
Algorithm, Rules ...
zh 3 106 104 None
None None 10 None en None 1555 1555 None
references title url venue
[53e9a6e6b7602d970301a47d
, ...
1.4-N→N′-Acylwanderun
g bei einem ...
[http://dx.doi.org/10.100
2/ange.19650770204] ...
Angewandte Chemie
[53e9a1d0b7602d9702ac8f1b
, ...
Design and Specification
of the CoreASM Execution ...
None None
None Practice Research on
Technology Roadmap for ...
None Science and Technology
Management Research ...
[53e999c3b7602d970220b9b7
, ...
The relationship between
canopy parameters and ...
[http://dx.doi.org/10.110
9/IGARSS.2011.6049503] ...
IGARSS
None The effect of metabolites
on the determination of ...
None Food Science and
Technology ...
None El Humanista y la
Energía Nuclear ...
None None
[53e9b395b7602d9703e78794
, ...
Botryoid fibroepithelial
polyp of the urinary ...
[http://dx.doi.org/10.111
1/j.1365-2559.2007.02 ...
Histopathology
None Planning and Design
Method of Land ...
None Journal of Anhui
Agricultural Sciences ...
None A Data Mining Based on
Rough Set Theory ...
None Software Guide
None RI1 Embolisation des
varices stomiales par ...
[http://dx.doi.org/10.101
6/S0221-0363(05)76274 ...
Journal De Radiologie
volume year
77 1965
None None
None 2013
null 2011
None 2012
None 2013
51 2007
None 2012
None 2012
86 2005
[154771161 rows x 19 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

1.3 The SJR Dataset

First, we download all the journal ranking files from the SJR website. Next, we use the following code to create a single SFrame with all the journal data:

In [5]:
from create_sjr_sframe import *
sjr_sf = create_sjr_sframe(SJR_FILES_DIR)
sjr_sf # the SFrame can also be accessed using tc.load_sframe(SJR_SFRAME)
Out[5]:
Rank Title Type SJR SJR Best Quartile H index Total Docs. Total Docs. (3years)
1 Astrophysical Journal
Letters ...
journal 61.473 Q1 82 5 7
1 Astrophysical Journal
Letters ...
journal 61.473 Q1 82 5 7
2 Annual Review of
Biochemistry ...
journal 49.476 Q1 248 30 81
2 Annual Review of
Biochemistry ...
journal 49.476 Q1 248 30 81
3 Cell journal 41.978 Q1 616 354 1359
3 Cell journal 41.978 Q1 616 354 1359
4 Annual Review of
Immunology ...
journal 40.906 Q1 254 29 81
4 Annual Review of
Immunology ...
journal 40.906 Q1 254 29 81
5 Annual Review of Cell and
Developmental Biology ...
book serie 33.882 Q1 182 25 61
5 Annual Review of Cell and
Developmental Biology ...
book serie 33.882 Q1 182 25 61
Total Refs. Total Cites (3years) Citable Docs. (3years) Cites / Doc. (2years) Ref. / Doc. Country
350 493 7 72.75 70.0 United Kingdom
350 493 7 72.75 70.0 United Kingdom
5913 3445 80 35.38 197.1 United States
5913 3445 80 35.38 197.1 United States
15870 47390 1328 34.36 44.83 United States
15870 47390 1328 34.36 44.83 United States
5236 4030 81 46.69 180.55 United States
5236 4030 81 46.69 180.55 United States
4134 1770 60 26.55 165.36 United States
4134 1770 60 26.55 165.36 United States
Year Categories ISSN
1999 20418213
1999 20418205
1999 15454509
1999 00664154
1999 00928674
1999 10974172
1999 07320582
1999 15453278
1999 15308995
1999 10810706
[502524 rows x 17 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

1.4 Joint Datasets

The MAG and AMiner datasets have a slightly different set of features. While the MAG dataset contains data on each author with a unique author ID, the AMiner contains additional data on each paper, including the paper's abstract and the paper's ISSN or ISBN. Additionally, the SJR dataset contains data about each journal's ranking.

To combine the data from the author publication record and the journals' rankings, we join the datasets. First, we joined the MAG and AMiner datasets by matching DOI values, using the following code (see also create_mag_aminer_sframe.py):

In [6]:
sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)
g1 = sf.groupby('Paper Document Object Identifier (DOI)', {'Count': agg.COUNT()})
s1 = set(g1[g1['Count'] > 1]['Paper Document Object Identifier (DOI)'])
sf = sf[sf['Paper Document Object Identifier (DOI)'].apply(lambda doi: doi not in s1 )]
sf.materialize()

sf2 = tc.load_sframe(AMINER_PAPERS_SFRAME)
g2 = sf2.groupby('doi', {'Count': agg.COUNT()})
s2 = set(g2[g2['Count'] > 1]['doi'])
sf2 = sf2[sf2['doi'].apply(lambda doi: doi not in s2 )]
sf2.materialize()

aminer_mag_sf = sf.join(sf2, {'Paper Document Object Identifier (DOI)': 'doi'})
aminer_mag_sf['title_len'] = aminer_mag_sf['title'].apply(lambda t: len(t))
aminer_mag_sf = aminer_mag_sf[aminer_mag_sf['title_len'] > 0]
aminer_mag_sf = aminer_mag_sf.rename({"Paper ID": "MAG Paper ID", "id": "Aminer Paper ID"})
aminer_mag_sf.remove_column('title_len')
aminer_mag_sf # this SFrame can be accessed using tc.load_Sframe(AMINER_MAG_JOIN_SFRAME)
Out[6]:
MAG Paper ID Original paper title Normalized paper title Paper publish year Paper publish date
7C15F682 Ptychoptera deleta Novak,
1877 from the Early ...
ptychoptera deleta novak
1877 from the early ...
2011 2011
84A37D36 Sherborn’s
foraminiferal studies ...
sherborn s foraminiferal
studies and their ...
2016 2016/07/01
773B216E A new species of
hydrobiid snails ...
a new species of
hydrobiid snails moll ...
2011 2011/10/19
77C44F83 Revision of the
planthopper genus ...
revision of the
planthopper genus ...
2014 2014/10/12
75233F3E Female genitalia of
Seasogonia Young from ...
female genitalia of
seasogonia young from ...
2012 2012/11/01
7B5321C5 A taxonomic study on the
genus Ettchellsia ...
a taxonomic study on the
genus ettchellsia cam ...
2012 2012
3C77A5B8 An Asiatic Chironomid in
Brazil: morphology, DNA ...
an asiatic chironomid in
brazil morphology dna ...
2015 2015/07/27
8051111A A new species of
Smicromorpha ...
a new species of
smicromorpha hymenoptera ...
2009 2009/09/14
79BD8F37 Open exchange of
scientific knowledge and ...
open exchange of
scientific knowledge and ...
2014 2014/06/06
240B7EFF Four new species of
Epicephala Meyrick, 1880 ...
four new species of
epicephala meyrick 1880 ...
2015 2015/06/15
Paper Document Object
Identifier (DOI) ...
Original venue name Normalized venue name Journal ID mapped to
venue name ...
Conference ID mapped to
venue name ...
10.3897/zookeys.130.1401 ZooKeys zookeys 0BDFC074
10.3897/zookeys.550.9863 ZooKeys zookeys 0BDFC074
10.3897/zookeys.138.1927 ZooKeys zookeys 0BDFC074
10.3897/zookeys.462.6657 ZooKeys zookeys 0BDFC074
10.3897/zookeys.164.2132 ZooKeys zookeys 0BDFC074
10.3897/zookeys.254.4182 ZooKeys zookeys 0BDFC074
10.3897/zookeys.514.9925 ZooKeys zookeys 0BDFC074
10.3897/zookeys.20.195 ZooKeys zookeys 0BDFC074
10.3897/zookeys.414.7717 ZooKeys zookeys 0BDFC074
10.3897/zookeys.508.9479 ZooKeys zookeys 0BDFC074
Paper rank Ref. Number Total Citations by Year Total Citations by Year
without Self Citations ...
Authors List Sorted
19382 7 {'2015': 1.0} {'2015': 1.0} [855C02FD, 7B2C4199]
19555 None None None [84CB5028]
19402 12 {'2015': 3.0, '2014':
3.0, '2011': 1.0, '20 ...
{'2015': 1.0, '2014':
1.0, '2011': 1.0, '20 ...
[8439D30B]
19555 None None None [84C55AD3, 7E095DC6]
19427 6 None None [805137B8, 80CA5307]
19370 4 None None [80F9A983, 7FFB2555]
19555 None None None [6118B891, 7FE3B9C3,
79CC73E7, 7C17044D] ...
19157 3 {'2015': 4.0, '2014':
3.0, '2013': 2.0, '20 ...
{'2015': 4.0, '2014':
3.0, '2013': 2.0, '20 ...
[85B81F06, 7E5ABA3D]
19321 5 {'2015': 3.0, '2014':
1.0} ...
{'2015': 2.0} [7237B1F9, 7DAD3B1C,
78F96D88, 78A6ED0B] ...
19555 None None None [7CFFBD65, 862455E0,
7D640C0F] ...
Keywords List Field of study list Field of study list names Fields of study parent
list (L0) ...
[tertiary, biomedical
research, neogene, ...
[009377C6, 0660586C,
01A380F9, 039D5C06] ...
[Tertiary, None, Neogene,
Bioinformatics] ...
[]
None None None None
[biomedical research,
bioinformatics] ...
[0660586C, 039D5C06] [None, Bioinformatics] []
None None None None
[morphology, taxonomy] [06A2C3F5, 037ECF39] [Morphology, Taxonomy] [052C8328]
[taxonomy,
bioinformatics, ...
[039D5C06, 037ECF39,
0660586C] ...
[Bioinformatics,
Taxonomy, None] ...
[052C8328]
None None None None
[morphology, taxonomy] [06A2C3F5, 037ECF39] [Morphology, Taxonomy] [052C8328]
[taxonomy, intellectual
property rights, ...
[037ECF39, 0215A9CE,
039D5C06, 0660586C] ...
[Taxonomy, Intellectual
property, Bioinformat ...
[0895A350, 052C8328]
None None None None
Fields of study parent
list names (L0) ...
Fields of study parent
list (L1) ...
Fields of study parent
list names (L1) ...
Fields of study parent
list (L2) ...
[] [090B39EA, 039D5C06] [Paleontology,
Bioinformatics] ...
[0683829C]
None None None None
[] [039D5C06] [Bioinformatics] []
None None None None
[Biology] [027F4522] [Linguistics] [037ECF39, 06A2C3F5]
[Biology] [039D5C06] [Bioinformatics] [037ECF39]
None None None None
[Biology] [027F4522] [Linguistics] [037ECF39, 06A2C3F5]
[Sociology, Biology] [0BE4BA29, 039D5C06] [Law, Bioinformatics] [037ECF39]
None None None None
Fields of study parent
list names (L2) ...
Authors Number Urls abstract
[Stratigraphy] 2 [/pmc/articles/instance/3
260767/?report=abstra ...
The first fossil that was
described in ...
None 1 [http://zookeys.pensoft.n
et/lib/ajax_srv/artic ...
None
[] 1 [http://bionames.org/refe
rences/b38c8055b3453e ...
Anew minute valvatiform
species belonging to the ...
None 2 [http://www.researchgate.
net/publication/27090 ...
Chinese species in the
genus Nycheuma Fennah, ...
[Taxonomy, Morphology] 2 [/pmc/articles/instance/3
272620/?report=abstra ...
Seasogonia Young, 1986 is
a sharpshooter genus ...
[Taxonomy] 2 [http://bionames.org/refe
rences/4078cc33be92d7 ...
Three new species of
Ettchellsia Cameron, ...
None 4 [http://www.researchgate.
net/publication/28071 ...
None
[Taxonomy, Morphology] 2 [http://www.cabdirect.org
/abstracts/2009331573 ...
None
[Taxonomy] 4 [http://advocomplex.ch/fi
les/Open%20exchange%2 ...
Background. The 7(th)
Framework Programme for ...
None 3 [http://www.ncbi.nlm.nih.
gov/pubmed/26167120, ...
None
authors Aminer Paper ID isbn issn issue keywords lang
[{'org': 'Institute of
Biology, Pedagogical ...
55a4881d65ce31bc877df20d None 1313-2970 130 [cheb basin, cypris
formation, czech ...
en
[{'name': 'giles
miller'}] ...
56d81fffdabfae2eeeb5906d None None None None en
[{'org': 'Department of
Ecology and Systematics, ...
55a47a6865ce31bc877c5f09 None 1313-2970 138 [caenogastropoda,
daphniola eptalophos sp. ...
en
[{'org': 'The Special Key
Laboratory for ...
55a6b03265ce054aad712ec1 None 1313-2989 462 [delphacini, fulgoroidea,
hemiptera, nycheuma, new ...
en
[{'org': 'Institute of
Entomology, Guizhou ...
55a4953165ceb7cb02d2d096 None 1313-2970 164 [auchenorrhyncha,
cicadellinae, ...
en
[{'org': 'Laboratory of
Entomology, Faculty of ...
55a51ace65ceb7cb02e11911 None 1313-2989 254 [south east asia,
taxonomy, parasitic ...
en
[{'name': 'gizelle
amora'}, {'name': 'neusa ...
56d81fffdabfae2eeeb59087 None None None None en
[{'name': 'd c darling'},
{'name': 'norman f ...
53e9ac5cb7602d970362f86d None None 0 [morphology, taxonomy] en
[{'org': 'Plazi,
Zinggstrasse 16, 3007 ...
55a676ab65ce054aad68df46 None 1313-2989 414 [biodiversity knowledge,
european copyright, ...
en
[{'name': 'houhun li'},
{'name': 'zhibo wang'}, ...
56d82000dabfae2eeeb59093 None None None None en
n_citation page_end page_start pdf references title ...
3 305 299 None [53e99c60b7602d9702511119
, ...
Ptychoptera deleta
Novák, 1877 from the ...
...
None None None None None Sherborn’s
foraminiferal studies ...
...
None 64 53 None [56d8e4efdabfae2eee2affe9
, ...
A new species of
hydrobiid snails ...
...
None 57 47 None None Revision of the
planthopper genus ...
...
None 40 24 None [56d92143dabfae2eee9f1671
, ...
Female genitalia of
Seasogonia Young from ...
...
None 108 99 None [53e99ddab7602d970269b655
, ...
A taxonomic study on the
genus Ettchellsia ...
...
None None None None None An Asiatic Chironomid in
Brazil: morphology, DNA ...
...
None None None None [56d85c7cdabfae2eee5a5fe6
, ...
A new species of
Smicromorpha ...
...
9 135 109 None [55a4825f65ce31bc877d426d
, ...
Open exchange of
scientific knowledge and ...
...
None None None None None Four new species of
Epicephala Meyrick, 1880 ...
...
[28945815 rows x 44 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Using the joined dataset, we obtained an SFrame with the joint meta data of 28.9 million papers. We can take this SFrame and join it with the SJR dataset.

In [8]:
import re
def create_aminer_mag_sjr_sframe(year):
    """
    Creates a unified SFrame of AMiner, MAG, and the SJR datasets
    :param year: year to use for SJR data
    :return: SFrame with AMiner, MAG, and SJR data
    :rtype: tc.SFrame
    """
    sf = tc.load_sframe(AMINER_MAG_JOIN_SFRAME)
    sf = sf[sf['issn'] != None]
    sf = sf[sf['issn'] != 'null']
    sf.materialize()
    r = re.compile("(\d+)-(\d+)")
    sf['issn_str'] = sf['issn'].apply(lambda i: "".join(r.findall(i)[0]) if len(r.findall(i))> 0 else None)
    sf = sf[sf['issn_str'] != None]
    sjr_sf = tc.load_sframe(SJR_SFRAME)
    sjr_sf = sjr_sf[sjr_sf['Year'] == year]
    return sf.join(sjr_sf, on={'issn_str': "ISSN"})
create_aminer_mag_sjr_sframe(2015)
Out[8]:
MAG Paper ID Original paper title Normalized paper title Paper publish year Paper publish date
7C15F682 Ptychoptera deleta Novak,
1877 from the Early ...
ptychoptera deleta novak
1877 from the early ...
2011 2011
773B216E A new species of
hydrobiid snails ...
a new species of
hydrobiid snails moll ...
2011 2011/10/19
77C44F83 Revision of the
planthopper genus ...
revision of the
planthopper genus ...
2014 2014/10/12
75233F3E Female genitalia of
Seasogonia Young from ...
female genitalia of
seasogonia young from ...
2012 2012/11/01
7B5321C5 A taxonomic study on the
genus Ettchellsia ...
a taxonomic study on the
genus ettchellsia cam ...
2012 2012
79BD8F37 Open exchange of
scientific knowledge and ...
open exchange of
scientific knowledge and ...
2014 2014/06/06
778DE072 First report on
C-banding, fluorochrome ...
first report on c banding
fluorochrome staining ...
2013 2013/07/30
76034B62 Checklist of the families
Scathophagidae, Fanni ...
checklist of the families
scathophagidae fanniidae ...
2014 2014/09/19
781099DE Checklist of the family
Culicidae (Diptera) in ...
checklist of the family
culicidae diptera in ...
2014 2014
770A539F Two new species of
harvestmen (Opiliones, ...
two new species of
harvestmen opiliones ...
2014 2014/08/14
Paper Document Object
Identifier (DOI) ...
Original venue name Normalized venue name Journal ID mapped to
venue name ...
Conference ID mapped to
venue name ...
10.3897/zookeys.130.1401 ZooKeys zookeys 0BDFC074
10.3897/zookeys.138.1927 ZooKeys zookeys 0BDFC074
10.3897/zookeys.462.6657 ZooKeys zookeys 0BDFC074
10.3897/zookeys.164.2132 ZooKeys zookeys 0BDFC074
10.3897/zookeys.254.4182 ZooKeys zookeys 0BDFC074
10.3897/zookeys.414.7717 ZooKeys zookeys 0BDFC074
10.3897/zookeys.319.4265 ZooKeys zookeys 0BDFC074
10.3897/zookeys.441.7142 ZooKeys zookeys 0BDFC074
10.3897/zookeys.441.7743 ZooKeys zookeys 0BDFC074
10.3897/zookeys.434.7486 ZooKeys zookeys 0BDFC074
Paper rank Ref. Number Total Citations by Year Total Citations by Year
without Self Citations ...
Authors List Sorted
19382 7 {'2015': 1.0} {'2015': 1.0} [855C02FD, 7B2C4199]
19402 12 {'2015': 3.0, '2014':
3.0, '2011': 1.0, '20 ...
{'2015': 1.0, '2014':
1.0, '2011': 1.0, '20 ...
[8439D30B]
19555 None None None [84C55AD3, 7E095DC6]
19427 6 None None [805137B8, 80CA5307]
19370 4 None None [80F9A983, 7FFB2555]
19321 5 {'2015': 3.0, '2014':
1.0} ...
{'2015': 2.0} [7237B1F9, 7DAD3B1C,
78F96D88, 78A6ED0B] ...
19485 15 {'2015': 1.0, '2014':
1.0} ...
{'2015': 1.0, '2014':
1.0} ...
[7FDB7566, 80418DAE]
19404 5 None None [78FFCF1E, 7CD385BA]
19424 6 {'2015': 1.0} {'2015': 1.0} [7D2454E0]
19404 5 None None [850AEC5F, 8306860B]
Keywords List Field of study list Field of study list names Fields of study parent
list (L0) ...
[tertiary, biomedical
research, neogene, ...
[009377C6, 0660586C,
01A380F9, 039D5C06] ...
[Tertiary, None, Neogene,
Bioinformatics] ...
[]
[biomedical research,
bioinformatics] ...
[0660586C, 039D5C06] [None, Bioinformatics] []
None None None None
[morphology, taxonomy] [06A2C3F5, 037ECF39] [Morphology, Taxonomy] [052C8328]
[taxonomy,
bioinformatics, ...
[039D5C06, 037ECF39,
0660586C] ...
[Bioinformatics,
Taxonomy, None] ...
[052C8328]
[taxonomy, intellectual
property rights, ...
[037ECF39, 0215A9CE,
039D5C06, 0660586C] ...
[Taxonomy, Intellectual
property, Bioinformat ...
[0895A350, 052C8328]
[biomedical research,
bioinformatics] ...
[0660586C, 039D5C06] [None, Bioinformatics] []
[bioinformatics,
biomedical research] ...
[039D5C06, 0660586C] [Bioinformatics, None] []
[bioinformatics,
biomedical research] ...
[039D5C06, 0660586C] [Bioinformatics, None] []
[taxonomy] [037ECF39] [Taxonomy] [052C8328]
Fields of study parent
list names (L0) ...
Fields of study parent
list (L1) ...
Fields of study parent
list names (L1) ...
Fields of study parent
list (L2) ...
[] [090B39EA, 039D5C06] [Paleontology,
Bioinformatics] ...
[0683829C]
[] [039D5C06] [Bioinformatics] []
None None None None
[Biology] [027F4522] [Linguistics] [037ECF39, 06A2C3F5]
[Biology] [039D5C06] [Bioinformatics] [037ECF39]
[Sociology, Biology] [0BE4BA29, 039D5C06] [Law, Bioinformatics] [037ECF39]
[] [039D5C06] [Bioinformatics] []
[] [039D5C06] [Bioinformatics] []
[] [039D5C06] [Bioinformatics] []
[Biology] [] [] [037ECF39]
Fields of study parent
list names (L2) ...
Authors Number Urls abstract
[Stratigraphy] 2 [/pmc/articles/instance/3
260767/?report=abstra ...
The first fossil that was
described in ...
[] 1 [http://bionames.org/refe
rences/b38c8055b3453e ...
Anew minute valvatiform
species belonging to the ...
None 2 [http://www.researchgate.
net/publication/27090 ...
Chinese species in the
genus Nycheuma Fennah, ...
[Taxonomy, Morphology] 2 [/pmc/articles/instance/3
272620/?report=abstra ...
Seasogonia Young, 1986 is
a sharpshooter genus ...
[Taxonomy] 2 [http://bionames.org/refe
rences/4078cc33be92d7 ...
Three new species of
Ettchellsia Cameron, ...
[Taxonomy] 4 [http://advocomplex.ch/fi
les/Open%20exchange%2 ...
Background. The 7(th)
Framework Programme for ...
[] 2 [/pmc/articles/PMC3764527
/?report=abstract, ht ...
In spite of various
cytogenetic works on ...
[] 2 [http://europepmc.org/abs
tract/MED/25337032, h ...
A revised checklist of
the Scathophagidae, ...
[] 1 [http://europepmc.org/art
icles/PMC4200447?pdf= ...
A checklist of the
Culicidae (Diptera) ...
[Taxonomy] 2 [http://espace.library.cu
rtin.edu.au/R?func=dbin- ...
Neopilionidae:
Enantiobuninae) are ...
authors Aminer Paper ID isbn issn issue keywords lang
[{'org': 'Institute of
Biology, Pedagogical ...
55a4881d65ce31bc877df20d None 1313-2970 130 [cheb basin, cypris
formation, czech ...
en
[{'org': 'Department of
Ecology and Systematics, ...
55a47a6865ce31bc877c5f09 None 1313-2970 138 [caenogastropoda,
daphniola eptalophos sp. ...
en
[{'org': 'The Special Key
Laboratory for ...
55a6b03265ce054aad712ec1 None 1313-2989 462 [delphacini, fulgoroidea,
hemiptera, nycheuma, new ...
en
[{'org': 'Institute of
Entomology, Guizhou ...
55a4953165ceb7cb02d2d096 None 1313-2970 164 [auchenorrhyncha,
cicadellinae, ...
en
[{'org': 'Laboratory of
Entomology, Faculty of ...
55a51ace65ceb7cb02e11911 None 1313-2989 254 [south east asia,
taxonomy, parasitic ...
en
[{'org': 'Plazi,
Zinggstrasse 16, 3007 ...
55a676ab65ce054aad68df46 None 1313-2989 414 [biodiversity knowledge,
european copyright, ...
en
[{'org': 'Department of
Entomology, Y.S. Parmar ...
55a5ce0d65ce60f99bf5c02d None 1313-2989 319 [c-banding, cma3, dapi,
nor location, ...
en
[{'org': 'Finnish Museum
of Natural History, ...
55a69aa665ce054aad6d871f None 1313-2989 441 [diptera, finland,
species list, ...
en
[{'org': 'Finnish Museum
of Natural History, ...
55a69aa665ce054aad6d8706 None 1313-2989 441 [checklist, culicidae,
diptera, finland, ...
en
[{'org': 'Dept of
Environment and ...
55a688db65ce054aad6ae23a None 1313-2989 434 [taxonomy, arachnids,
cave biota] ...
en
n_citation page_end page_start pdf references title ...
3 305 299 None [53e99c60b7602d9702511119
, ...
Ptychoptera deleta
Novák, 1877 from the ...
...
None 64 53 None [56d8e4efdabfae2eee2affe9
, ...
A new species of
hydrobiid snails ...
...
None 57 47 None None Revision of the
planthopper genus ...
...
None 40 24 None [56d92143dabfae2eee9f1671
, ...
Female genitalia of
Seasogonia Young from ...
...
None 108 99 None [53e99ddab7602d970269b655
, ...
A taxonomic study on the
genus Ettchellsia ...
...
9 135 109 None [55a4825f65ce31bc877d426d
, ...
Open exchange of
scientific knowledge and ...
...
None 291 283 None [56d8b9c0dabfae2eee2562bf
, ...
First report on
C-banding, fluorochrome ...
...
None 367 347 None [56d90738dabfae2eeefe3d12
, ...
Checklist of the families
Scathophagidae, Fanni ...
...
None 51 47 None [56d8d4a0dabfae2eeec120ac
, ...
Checklist of the family
Culicidae (Diptera) in ...
...
None 45 37 None [56d88f6bdabfae2eeedadea4
, ...
Two new species of
harvestmen (Opiliones, ...
...
[4498015 rows x 61 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

2. Loading the Dataset to MongoDB

Using Turicreate and SFrame objects can help us get general data on how academic publication dynamics have changed over time, but it would be challenging to use this data to create more complicated insights, such as the trends of a specific journal. To reveal more complicated insights using the data, we would need to load the dataset to a different framework. In this study, we chose to use MongoDB as our framework for more complicated queries. We installed MongoDB on Ubuntu 17.10 using the instructions in the following link. After MongoDB is installed and running, please remember to set the user and password, and update MONGO_HOST & MONGO_PORT vars in consts.py (one can also adjust the connection to include user password auth). Now, the next step is to load the above created SFrames to collections in MongoDB using mongo_connecter.py:

In [9]:
from mongo_connector import *
load_sframe() #this will load the SFrame to a local

In the end of the loading process, six collections will be loaded to the journal database.

In [10]:
MD.client.journals.collection_names()
Out[10]:
[u'authoros_features',
 u'sjr_journals',
 u'aminer_mag_papers',
 u'fields_of_study_papers',
 u'papers_features',
 u'authors_features']

In the second part of the tutorial, we will demonstrate how the above created MongoDB collections can be utilized to calculate various statistics on paper collections, authors, journals, and research domains.