Given that grammatical and spelling errors have been found to influence perceived competence and credibility in written communication, this study examined how a student’s grammar and spelling errors affect how other students respond to the student’s postings in four online debates hosted in asynchronous threaded discussions. Message-response exchanges were sequentially analyzed to identify patterns in students’ replies to arguments and challenges with counter-challenges, explanations, and evidentiary support posted by students that exhibited low versus high number of grammatical and spelling errors. Although no causal inferences can be drawn from this study, the findings nevertheless suggests that: (a) arguments posted by high-error students are more likely to be challenged than arguments posted by low-error students; (b) exchanges between high-error students can amplify the effects of grammar/spelling errors; and (c) higher levels of argumentation can be achieved by placing students into groups that are heterogeneous in writing skills in general. The findings and methods used in this study lay the groundwork for further research on strategies for managing individual differences in students’ grammar and spelling (and other student behaviors in general) and increasing the level of critical discourse in online discussions.
]]>Learning Analytics (LA) is an emerging field in which sophisticated analytic tools are used to improve learning and education. It draws from, and is closely tied to, a series of other fields of study like business intelligence, web analytics, academic analytics, educational data mining, and action analytics. The main objective of this research work is to find meaningful indicators or metrics in a learning context and to study the inter-relationships between these metrics using the concepts of Learning Analytics and Educational Data Mining, thereby, analyzing the effects of different features on student’s performance using Disposition analysis. In this project, K-means clustering data mining technique is used to obtain clusters which are further mapped to find the important features of a learning context. Relationships between these features are identified to assess the student’s performance.
]]>Many universities are involved in the field of Instructional Design and Technology (IDT). The journal on Educational Technology Research and Development published an article on this, examining the academic productivity of universities in IDT.
This includes research in the number of publications in the top 20 scientific journals in the field of IDT over the period from 2005 to 2014. This shows that EADTU members the Open University UK and the Open University of the Netherlands are the only European institutions in the top 10 publishing institutions.
In addition the American researchers looked at the number of contributions to the Handbook of Educational Communications Technology (3rd and 4th edition).
Again only two European universities are listed in the top 7, both being EADTU members, KU Leuven and the Open University of the Netherlands.
The post Data-riffic! What does data tell us about UK higher education in 2017? appeared first on Wonkhe.
]]>by András A. Benczúr, Róbert Pálovics (MTA SZTAKI) , Márton Balassi (Cloudera), Volker Markl, Tilmann Rabl, Juan Soto (DFKI), Björn Hovstadius, Jim Dowling and Seif Haridi (SICS)
Big data analytics promise to deliver valuable business insights. However, this will be difficult to realise using today’s state-of-the-art technologies, given the flood of data generated from various sources. The European STREAMLINE project develops scalable, fast reacting, and high accuracy machine learning techniques for the needs of European online media companies.
Big data analytics promise to deliver valuable business insights. However, this will be difficult to realise using today’s state-of-the-art technologies, given the flood of data generated from various sources. The European STREAMLINE project [L1] develops scalable, fast reacting, and high accuracy machine learning techniques for the needs of European online media companies.
by Mark Cieliebak (Zurich University of Applied Sciences)
Deep Neural Networks (DNN) can achieve excellent results in text analytics tasks such as sentiment analysis, topic detection and entity extraction. In many cases they even come close to human performance. To achieve this, however, they are highly-optimised for one specific task, and a huge amount of human effort is usually needed to design a DNN for a new task. With DeepText, we will develop a software pipeline that can solve arbitrary text analytics tasks with DNNs with minimal human input.
]]>The UK Data Service holds the UK’s largest collection of research data. It’s also an extremely useful source of information about how to use data in your research.
To help researchers get the most out of the service they run a series of free webinars on what the service is and how to use it, training on topics like data management and data reuse, and introductions to some of their key data sets.
See the full list of webinars with links to registration. You’ll also find recordings of past webinars which will be added to, so if you can attend live you can always catch up later.
The webinars and resources are available to everyone, but could be of particular interest if you are funded by the Economic and Social Research Council (ESRC). ESRC funds the UK Data Service and provides detailed guidance on data management planning for ESRC researchers.
]]>
As anyone who's tried to analyze real-world data knows, there are any number of problems that may be lurking in the data that can prevent you from being able to fit a useful predictive model:
The vtreat package is designed to counter common data problems like these in a statistically sound manner. It's a data frame preprocessor which applies a number of data cleaning processes to the input data before analysis, using techniques such as impact coding and categorical variable encoding (the methods are described in detail in this paper). Further details can be found on the vtreat github page, where authors John Mount and Nina Zumel note:
Even with modern machine learning techniques (random forests, support vector machines, neural nets, gradient boosted trees, and so on) or standard statistical methods (regression, generalized regression, generalized additive models) there are common data issues that can cause modeling to fail. vtreat deals with a number of these in a principled and automated fashion.
One final note: the main function in the package, prepare, is a little like model.matrix in that categorical variables are converted into numeric variables using contrast codings. This means that the output is suitable for many machine-learning functions (like xgboost) that don't accept categorical variables.
The vtreat package is available on CRAN now, and you can find a worked example using vtreat in the blog post linked below.
Win-Vector Blog: vtreat: prepare data
]]>The integration of subject matter learning with reading and writing skills takes place in multiple ways. Students learn to read, interpret, and write texts in the discipline-relevant genres. However, writing can be used not only for the purposes of practice in professional communication, but also as an opportunity to reflect on the learned material. In this paper, we address a writing intervention – Utility Value (UV) intervention – that has been shown to be effective for promoting interest and retention in STEM subjects in laboratory studies and field experiments. We conduct a detailed investigation into the potential of natural language processing technology to support evaluation of such writing at scale: We devise a set of features that characterize UV writing across different genres, present common themes, and evaluate UV scoring models using essays on known and new biology topics. The automated UV scoring results are, we believe, promising, especially for the personal essay genre.
]]>This paper presents an investigation of score prediction based on natural language processing for two targeted constructs within analytic text-based writing: 1) students’ effective use of evidence and, 2) their organization of ideas and evidence in support of their claim. With the long-term goal of producing feedback for students and teachers, we designed a task-dependent model, for each dimension, that aligns with the scoring rubric and makes use of the source material. We believe the model will be meaningful and easy to interpret given the writing task. We used two datasets of essays written by students in grades 5–6 and 6–8. Our experimental results show that our task-dependent model (consistent with the rubric) performs as well as if not outperforms competitive baselines. We also show the potential generalizability of the rubric-based model by performing cross-corpus experiments. Finally, we show that the predictive utility of different feature groups in our rubric-based modeling approach is related to how much each feature group covers a rubric’s criteria.
]]>A review is textual feedback provided by a reviewer to the author of a submitted version. Peer reviews are used in academic publishing and in education to assess student work. While reviews are important to e-commerce sites like Amazon and e-bay, which use them to assess the quality of products and services, our work focuses on academic reviewing. We seek to help reviewers improve the quality of their reviews. One way to measure review quality is through metareview or review of reviews. We develop an automated metareview software that provides rapid feedback to reviewers on their assessment of authors’ submissions. To measure review quality, we employ metrics such as: review content type, review relevance, review’s coverage of a submission, review tone, review volume and review plagiarism (from the submission or from other reviews). We use natural language processing and machine-learning techniques to calculate these metrics. We summarize results from experiments to evaluate our review quality metrics: review content, relevance and coverage, and a study to analyze user perceptions of importance and usefulness of these metrics. Our approaches were evaluated on data from Expertiza and the Scaffolded Writing and Rewriting in the Discipline (SWoRD) project, which are two collaborative web-based learning applications.
]]>I wrote this with Darrall Thompson recently for a Pearson ACODE award (for which REVIEW was ‘Highly Commended for Innovation in Technology Enhanced Learning’, although other very worthy projects took the prize), and think it’s worth sharing!
Darrall also put together this fantastic video for the application:
Single-mark or grade indicators are commonplace in describing student performance, leading to a tendency for both students and staff to focus on this single indicator, rather than more nuanced evaluation of a student’ knowledge and attributes (Thompson, 2006). Moreover, such assessments cannot provide feedback regarding the development of knowledge and other attributes across disciplinary boundaries and years of study.
The REVIEW software is an assessment tool designed to bring both summative and formative feedback together, over time, and across disciplinary boundaries. The tool has been developed to enhance learning through three modes of action:
Led by researchers at the University of Technology Sydney (UTS), the tool has been evaluated against these objectives over a period of 12 years. Early evaluations (Kamvounias & Thompson, 2008; Taylor et al., 2009; Thompson, Treleaven, Kamvounias, Beem, & Hill, 2008) indicated that (1) based on student feedback surveys, they had generally positive experiences in using the tool, specifically that it enhanced the clarity of the assessment expectations, and (2) based on instructor reflections and analysis of unit outline changes, the tool was a driver for change in developing explicit assessment criteria and constructive alignment between assessments and graduate attributes.
Perhaps most significantly, based on 4 semesters of REVIEW self-assessment data, analysis indicates enhancement of student learning through calibration of their self-assessments such that they become more aligned with tutor-judgements over the semesters (Boud, Lawson, & Thompson, 2013), a finding replicated over a shorter period, with varied cohorts, elsewhere (Carroll, 2013). In addition, “There are early signs in student feedback that the visual display of criteria linked to attribute categories and sub-categories is useful in charting progress and presenting to employers in interview contexts. Employers take these charts seriously because they are derived from actual official assessment criteria from a broad range of subjects over time” (Thompson, Forthcoming, p. 19)
REVIEW’s impact is seen across whole cohorts of students, in multiple disciplines, and institutions. From initially being deployed at UTS, the REVIEW software has been adopted by: The University of New South Wales (UNSW); Queensland University of Technology (QUT); The University of Sydney; and – in developing pilot work – schools in both the UK and Australia. This external adoption forms a part of REVIEW’s impact at scale, demonstrating its reproducible impact on enhancing student learning, and (as we discuss further below) providing a sustainable model for its continued development.
Within these institutional contexts, the tool has growing adoption amongst academics. Indeed, its impact can be seen in the – largely organic – growth, with coordinators finding the tool helpful to engage their tutors with the tool to support a unified approach to student feedback. For example, at UNSW, a 2011 trail of REVIEW with four courses, has expanded to 160 courses using the software each semester across three faculties, “[i]t was found that the use of Review improved learning, teaching and assessment administration and reporting outcomes” (Carroll, 2016). REVIEW facilitates a ‘bottom up’ approach to assessment innovation (Cathcart, Kerr, Fletcher, & Mack, 2008; Thompson, 2009). That is, rather than academics developing individual approaches, or required to align their existing unit outlines and activities within them to proscribed graduate attributes, they use a facilitative tool to make explicit the aims underlying their assessment tasks. This process often leads to more scenario-based questions to test the application of knowledge in examinations (Thompson, Forthcoming). Because of its mode of action, instructor (or departmental) adoption has an impact on all students enrolled in their courses – as such, all students in classes which use REVIEW are impacted by the increased focus on making assessment criteria explicit, articulating relationships between assessment criteria and graduate attributes, and drawing constructive alignment between these factors.
Moreover, our experience indicates that students do choose to engage with the self-review components of REVIEW (generally over 2/3 of students, an uptake we intend to investigate more formally). As a result of the benefits of the self-reflection process – highlighted above – attempts have been made to incentivize student engagement further, by providing reward or penalty for engagement and through making engaging materials that articulate the purpose of the self-assessment process. “In my experience, the most successful method has been an introduction by the unit coordinator in combination with tutors who genuinely value the student gradings and demonstrate this feature by marking a piece of work in a large lecture context. Involving students in this live marking activity engages both them and the tutors in further understanding the criteria…” (Thompson, Forthcoming, p. 12)
The REVIEW tool is explicitly targeted at building capabilities; both of students, and the academic staff and tutors who work with them to develop their graduate attributes. As such, REVIEW is targeted at building capability in criterion-based assessment, and understanding of the application of these criteria – by both students and assessors – towards high-level graduate attributes, which the system foregrounds thus facilitating change favouring constructive alignment between assessment tasks and these goals.
The system has won adoption through its ease of use and range of visual feedback, alongside – for instructors, and administrators – a range of reports offering value for course mapping, the benchmarking of sets of tutor assessments (e.g. to explore discrepancies in tutor marking), accreditation and assurance purposes, and monitoring changes in subjects over different deliveries. The reports are then used as discussion tools, to support professional development between tutors and instructors (Thompson, Forthcoming, p. 16). In addition, the software has facilitated course-reviews, through providing reporting on the mapping of assessment criteria to graduate attributes. These reports can, for example, reveal that some Course Outcomes are not in fact mapped to assessments, again opening discussion around assessment and outcome designs (Thompson, Forthcoming, p. 19).
Some impetus for use of REVIEW has come in one Faculty from the mandate that graduate attribute development be reported on by course teams, with REVIEW validated as a system to provide such evidence. Moreover, though, engagement goes beyond ‘box ticking’. The software facilitates and enhances an approach to criterion-based and self-assessment, but its implementation has been developed with a set of resources to guide academics in creating discipline-specific language to describe intended learning outcomes and their application to assessment tasks and criteria. It is thus a key facilitator of formative assessment both as an agent for change, and in terms of its scaffolding capabilities – emphasising criterion-assessment, and targeting feedback at those areas in which a student’s self-assessment is least accurate.
A key facilitative feature in the software has been the ‘visual grammar’ which threads through course documentation and the REVIEW software. In DAB a memorable acronym, colour-set, and symbol has been developed to foreground each category to staff and students: CAPRI. CAPRI comprises the graduate attributes in the faculty: Communication and Groupwork; Attitudes and Values; Practical and Professional; Research and Critique; Innovation and Creativity. These attributes are then foregrounded in REVIEW, which is used to collect marks in the background from the day-to-day marking of assessment criteria linked to both Course Intended Learning Outcomes and the five CAPRI categories.
Top-down directives about graduate attribute integration often involve onerous documentation, alienating busy academics while having minimal impact at the student level. For improvement in feedback to occur, instructors need to be given timesaving strategies and support. Software such as REVIEW must be integrated into the main university system to save time in assessment and reporting processes. The timesaving aspects and ease of use of REVIEW together with its perceived value to staff and students caused it to spread by osmosis, leading to its commercialization by the University of Technology Sydney in 2011.
University technology divisions require highly secure systems that do not compromise their existing services. There are a number of approaches for web-based systems hosted internally by each university or externally by a provider. The developer’s recommendation is for REVIEW to be externally hosted and undergo rigorous penetration testing with every upgrade release. However, an internally hosted option is available. The configuration of the system and Application Program Interface (API) integration is essential for broad adoption, together with policy approvals by faculty boards, heads of school, and course directors.
REVIEW features are continuously upgrading due to a collaborative funding model that enables universities that require a particular feature to pay for it to be included. For example, the Assurance of Learning reporting system illustrated in Figure 9 was funded by the University of New South Wales (UNSW) because of their requirement for Business School accreditation by the AACSB (Association to Advance Collegiate Schools of Business), and EQUIS (European Quality Improvement System). They have used this module in REVIEW extensively for their successful and continuing accreditation processes and maintain that previous methods of collecting and compiling data for these reports was onerous and time-consuming at the most highly pressured times of the year. REVIEW has automated this process with a level of granularity that has assured its adoption across a number of faculties.
The collaborative funding model is a progressive format that enables such Assurance of Learning and other modules to be available for any other user of REVIEW free of charge. Shared or individually funded features are specified, and costs are then estimated by the software developers in Sydney. Extensive modules together with smaller features are implemented with ongoing upgrade versions. There is a REVIEW Users Group (RUG) jointly run by UNSW and UTS as both an academic and technical forum for ideas, feature requests, and upgrade presentations.
The REVIEW tool has been adopted across multiple disciplinary and institutional settings. The tool provides flexibility in terms of the specific functions that are deployed in each setting, and how they are expressed. For example, REVIEW can be used in disciplinary contexts requiring accreditation by professional and educational bodies. In business faculties at three Universities (UTS, UNSW, QUT), an ‘Assurance of Learning (AOL)’ module has been introduced for this purpose.
Multiple institutions have adopted REVIEW, shared their practices, customisation, and ‘wish lists’ for features (see ‘sustainability’). The development of REVIEW features has been driven by users, and is testament to the value academics see in its use. A key set of resources has been developed across this work, to support both students and staff in use of REVIEW, and their understanding of criterion-assessment, peer and self-review, and graduate attributes.
As part of the commercialisation process (see ‘sustainability’, the original REVIEW code was converted from Flash to HTML 5, by a small external developer. This development was funded using a collaborative model across institutions, allowing the development of modules (such as the UNSW Assurance of Learning module) that other institutions now have available to them. This model has thus seen a sustainable and reproducible means to achieve enterprise level implementation of the REVIEW tool.
The commercial website (http://review-edu.com/ see ‘sustainability’) gives some guidance to instructors, although further funding could support the transition of resources to ‘open license’ materials to be shared through a key repository. Similarly, REVIEW continues to be researched and developed, to build its capabilities and ensure that it can be adopted across contexts. Further funding would support this work; for example, a schools pilot is currently being planned in both the UK and Australia. This pilot affords potential for new research and development avenues, while also requiring a different kind of support to the materials already developed. We are also actively planning a project to investigate the quality of the qualitative feedback that students receive, and the quality of their own reflections, when using REVIEW. That research will extend REVIEW to support staff and students in identifying and giving high quality feedback – particularly important given the pedagogic value of students giving feedback in peer-assessment contexts.
Boud, D., Lawson, R., & Thompson, D. G. (2013). Does student engagement in self-assessment calibrate their judgement over time? Assessment & Evaluation in Higher Education, 38(8), 941–956.
Carroll, D. (2013). Benefits for students from achieving accuracy in criteria-based self-assessment. Presented at the ASCILITE, Sydney. Retrieved from https://www.researchgate.net/profile/Danny_Carroll/publication/264041914_Benefits_for_students_from_achieving_accuracy_in_criteria-based_self-_assessment/links/0a85e53c9f80a21617000000.pdf
Carroll, D. (2016, April). Meaningfully embedding program (Degree) learning goals in course work. Presented at the Transforming Assessment. Retrieved from http://transformingassessment.com/events_6_april_2016.php
Cathcart, A., Kerr, G. F., Fletcher, M., & Mack, J. (2008). Engaging staff and students with graduate attributes across diverse curricular landscapes. In QUT Business School; School of Accountancy; School of Advertising, Marketing & Public Relations; School of Management. University of South Australia, Adelaide. Retrieved from http://www.unisa.edu.au/ATNAssessment08/
Kamvounias, P., & Thompson, D. G. (2008). Assessing Graduate Attributes in the Business Law Curriculum. Retrieved from https://opus.lib.uts.edu.au/handle/10453/10516
Taylor, T., Thompson, D., Clements, L., Simpson, L., Paltridge, A., Fletcher, M., … Lawson, R. (2009). Facilitating staff and student engagement with graduate attribute development, assessment and standards in business faculties. Deputy Vice-Chancellor (Academic) – Papers. Retrieved from http://ro.uow.edu.au/asdpapers/527
Thompson, D. (Forthcoming). Marks Should Not Be the Focus of Assessment — But How Can Change Be Achieved? Journal of Learning Analytics.
Thompson, D. (2006). E-Assessment: The Demise of Exams and the Rise of Generic Attribute Assessment for Improved Student Learning. Robert, TS E-Assessment. United State of America: Idea Group Inc. Retrieved from http://www.igi-global.com/chapter/self-peer-group-assessment-learning/28808
Thompson, D. (2009). Successful engagement in graduate attribute assessment using software. Campus-Wide Information Systems, 26(5), 400–412. http://doi.org/10.1108/10650740911004813
Thompson, D., Treleaven, L., Kamvounias, P., Beem, B., & Hill, E. (2008). Integrating Graduate Attributes with Assessment Criteria in Business Education: Using an Online Assessment System. Journal of University Teaching and Learning Practice, 5(1), 35.
]]>This year feels different, perhaps because it’s the first time that I end the year knowing the new one will bring us a new President, one with quite different goals than the current administration’s. Things feel quite uncertain moving forward, despite all the certainty one can supposedly muster from looking back – from looking at the near-term or long-term history and trends. I’m feeling quite tentative about whether or not the insights that I might be able to glean about the year will have much relevance for the business and politics of education technology under Trump. I’m quite frightened that some of the “worst case scenarios” I’ve imagined for education technology – the normalization of surveillance, algorithmic bias, privatization, radical individualization – are poised to be the new reality.
This year feels different too than the previous years in which I’ve written these reviews because education technology – as an industry – sort of floundered in 2016, as I think my series will show. Investment dollars were down, if nothing else. I suppose some analysts would argue education technology, as an industry, “matured” this year – young startup founders were replaced by old white men as chief executives, young startups were acquired by old, established corporations. But all in all, there just isn’t much to speak of this year when it comes to spectacular “innovation” (whatever you take that to mean). Or even when it comes to remarkable “failure” – which I gather we’re supposed to praise these days.
This year’s “Top Ed-Tech Trends” are mostly the same as previous years’, despite marketing efforts to hype certain (largely consumer) products – 3D printing, virtual reality, Pokemon Go, and so on. I’ve written before about “ed-tech’s zombie ideas” – about how monstrous ideas are repeatedly revived – and this year was no different.
We could ask, I suppose, why ed-tech might be in the doldrums – why no sweeping “revolution” despite all the investment and all the enthusiasm. (We can debate what that revolution would look like: institutional change, improved test scores, more or less job security?) Has education technology, or digital technology more broadly, simply become banal as it has become ubiquitous?
And yet, this moment feels anything if banal. Here we are with a President-Elect – a reality TV star – who has been supported by white nationalists, the KKK, Wikileaks, trolls, and Peter Thiel, who election was facilitated through a massive misinformation campaign spread virally through Facebook. Education technology, and again digital technology more broadly, might not be the progressive, democratizing force that some promised. Go figure.
So we must, I think, look at the more insidious ways in which various technologies are slowly altering our notions of knowledge, expertise, and education (as practices, as institutions, as systems) – and ask who’s invested in the various futures that education technology purports to offer.
Each Friday, I gather all the education and education technology and technology-related news into one article. (I also gather articles that I read about the same topics for a newsletter that I send out each Saturday.) Each month, I calculate all the venture capital investment that’s gone into education technology, noting who’s invested, the type of company, and so on. It’s from these weekly and monthly reports that I start to build my analysis. I listen to stories. I follow the money, and I follow the press releases. I try to verify the wild, wild claims. I look for patterns. It’s based on these patterns that I choose the ten of my “Top Ed-Tech Trends.”
They’re not all “trends,” really. They’re categories. But I’ve purposefully called this series “trends” because I like to imagine it helps defang some of the bulleted list of crap that other publications churn out, claiming that this or that product is going to “change everything” about how we teach and learn.
A note on the lenses through which I analyze ed-tech: History. Ideology. Labor. Power. Rhetoric. Ethics. Narrative. Networks. Humanities. Culture. Anti-racism. I guess I’ll add anti-fascism from here on out, just to be really clear.
Earlier this fall, Sara M. Watson published a lengthy piece for the Tow Center for Digital Journalism, “Toward a Constructive Technology Criticism.” Even though the opening paragraphs that spoke of “loom-smashing Luddites and told-you-so Cassandras,” I didn’t see much of myself in her description of the “technology criticism landscape,” despite the years now that I’ve been a landscaper. Watson’s suggestions for a “constructive technology criticism”: surface ideologies. Ask better questions. Offer alternatives. Be realistic. Be precise. Be generous.
Seven years and hundreds of thousands of words reviewing what’s happening to and through education technology is as generous I can be right now, I think.
Here’s what I’ve written in previous years. You can decide for yourself with how much my criticism has been heeded (hell, even acknowledged):
Icon credits: The Noun Project
]]>The learning sciences of today recognize the tri-dimensional nature of learning as involving cognitive, social and emotional phenomena. However, many computer-supported argumentation systems still fail in addressing the socio-emotional aspects of group reasoning, perhaps due to a lack of an integrated theoretical vision of how these three dimensions interrelate to each other. This paper presents a multi-dimensional and multi-level model of the role of emotions in argumentation, inspired from a multidisciplinary literature review and extensive previous empirical work on an international corpus of face-to-face student debates. At the crossroads of argumentation studies and research on collaborative learning, employing a linguistic perspective, we specify the social and cognitive functions of emotions in argumentation. The cognitive function of emotions refers to the cognitive and discursive process of schematization (Grize, 1996, 1997). The social function of emotions refers to recognition-oriented behaviors that correspond to engagement into specific types of group talk (e. g. Mercer in Learning and Instruction 6(4), 359–377, 1996). An in depth presentation of two case studies then enables us to refine the relation between social and cognitive functions of emotions. A first case gives arguments for associating low-intensity emotional framing, on the cognitive side, with cumulative talk, on the social side. A second case shows a correlation between high-intensity emotional framing, and disputational talk. We then propose a hypothetical generalization from these two cases, adding an element to the initial model. In conclusion, we discuss how better understanding the relations between cognition and social and emotional phenomena can inform pedagogical design for CSCL.
]]>Recognising students’ emotion, affect or cognition is a relatively young field and still a challenging task in the area of intelligent tutoring systems. There are several ways to use the output of these recognition tasks within the system. The approach most often mentioned in the literature is using it for giving feedback to the students. The features used for that approach can be high-level features like linguistics features which are words related to emotions or affects, taken e.g. from written or spoken inputs, or low-level features like log-file features which are created from information contained in the log-files. In this work we aim at supporting task sequencing by perceived task-difficulty recognition on low-level features easily extracted from the log-file. We analyse these features by statistical tests showing that there are statistically significant feature combinations and hence the presented features are able to describe students’ perceived task-difficulty in intelligent tutoring systems. Furthermore, we apply different classification methods to the log-file features for perceived task-difficulty recognition and present a kind of higher ensemble method for improving the classification performance on the features extracted from a real data set. The presented approach outperforms classical ensemble methods and is able to improve the classification performance substantially, enabling a perceived task-difficulty recognition satisfactory enough for employing its output for components of a real system like task independent support or task sequencing.
]]>Negotiation mechanism using conversational agents (chatbots) has been used in Open Learner Models (OLM) to enhance learner model accuracy and provide opportunities for learner reflection. Using chatbots that allow for natural language discussions has shown positive learning gains in students. Traditional OLMs assume a learner to be able to manage their own learning and already in an appropriate affective/behavioral state that is conducive for learning. This paper proposes a new perspective of learning that advances the state of the art in fully-negotiated OLMs by exploiting learner’s affective & behavioral states to generate engaging natural language dialogues that train them to enhance their metacognitive skills. In order to achieve this, we have developed the NDLtutor that provides a natural language interface to learners. Our system generates context-aware dialogues automatically to enhance learner participation and reflection. This paper provides details on the design and implementation of the NDLtutor and discusses two evaluation studies. The first evaluation study focuses on the dialogue management capabilities of our system and demonstrates that our dialog system works satisfactorily to realize meaningful and natural interactions for negotiation. The second evaluation study investigates the effects of our system on the self-assessment and self-reflection of the learners. The results of the evaluations show that the NDLtutor is able to produce significant improvements in the self-assessment accuracy of the learners and also provides adequate support for prompting self-reflection in learners.
]]>This study investigated whether and how students with low prior achievement can carry out and benefit from reflective assessment supported by the Knowledge Connections Analyzer (KCA) to collaboratively improve their knowledge-building discourse. Participants were a class of 20 Grade 11 students with low achievement taking visual art from an experienced teacher. We used multiple methods to analyze the students’ online discourse at several levels of granularity. Results indicated that students with low achievement were able to take responsibility for advancing collective knowledge, as they generated theories and questions, built on each others’ ideas, and synthesized and rose above their community’s ideas. Analysis of qualitative data such as the KCA prompt sheets, student interviews and classroom observations indicated that students were capable of carrying out reflective assessment using the KCA in a knowledge building environment, and that the use of reflective assessment may have helped students to focus on goals of knowledge building. Implications for how students with low achievement collaboratively improve their knowledge-building discourse facilitated by reflective assessment are discussed.
]]>The problem of poor writing skills at the postsecondary level is a large and troubling one. This study investigated the writing skills of low-skilled adults attending college developmental education courses by determining whether variables from an automated scoring system were predictive of human scores on writing quality rubrics. The human-scored measures were a holistic quality rating for a persuasive essay and an analytic quality score for a written summary. Both writing samples were based on text on psychology and sociology topics related to content taught at the introductory undergraduate level. The study is a modified replication of McNamara et al. (Written Communication, 27(1), 57–86 2010), who identified several Coh-Metrix variables from five linguistic classes that reliably predicted group membership (high versus low proficiency) using human quality scores on persuasive essays written by average-achieving college students. When discriminant analyses and ANOVAs failed to replicate the McNamara et al. (Written Communication, 27(1), 57–86 2010) findings, the current study proceeded to analyze all of the variables in the five Coh-Metrix classes. This larger analysis identified 10 variables that predicted human-scored writing proficiency. Essay and summary scores were predicted by different automated variables. Implications for instruction and future use of automated scoring to understand the writing of low-skilled adults are discussed.
]]>Orchestrating collaborative learning in the classroom involves tasks such as forming learning groups with heterogeneous knowledge and making learners aware of the knowledge differences. However, gathering information on which the formation of appropriate groups and the creation of graphical knowledge representations can be based is very effortful for teachers. Tools supporting cognitive group awareness provide such representations to guide students during their collaboration, but mainly rely on specifically created input. Our work is guided by the questions of how the analysis and visualization of cognitive information can be supported by automatic mechanisms (especially using text mining), and what effects a corresponding tool can achieve in the classroom. We systematically compared different methods to be used in a Grouping and Representing Tool (GRT), and evaluated the tool in an experimental field study. Latent Dirichlet Allocation proved successful in transforming the topics of texts into values as a basis for representing cognitive information graphically. The Vector Space Model with Euclidian distance based clustering proved to be particularly well suited for detecting text differences as a basis for group formation. The subsequent evaluation of the GRT with 54 high school students further confirmed the GRT’s impact on learning support: students who used the tool added twice as many concepts in an essay after discussing as those in the unsupported group. These results show the potential of the GRT to support both teachers and students.
]]>In the last few years thousands of scientific papers have investigated sentiment analysis, several startups that measure opinions on real data have emerged and a number of innovative products related to this theme have been developed. There are multiple methods for measuring sentiments, including lexical-based and supervised machine learning methods. Despite the vast interest on the theme and wide popularity of some methods, it is unclear which one is better for identifying the polarity (i.e., positive or negative) of a message. Accordingly, there is a strong need to conduct a thorough apple-to-apple comparison of sentiment analysis methods, as they are used in practice, across multiple datasets originated from different data sources. Such a comparison is key for understanding the potential limitations, advantages, and disadvantages of popular methods. This article aims at filling this gap by presenting a benchmark comparison of twenty-four popular sentiment analysis methods (which we call the state-of-the-practice methods). Our evaluation is based on a benchmark of eighteen labeled datasets, covering messages posted on social networks, movie and product reviews, as well as opinions and comments in news articles. Our results highlight the extent to which the prediction performance of these methods varies considerably across datasets. Aiming at boosting the development of this research area, we open the methods’ codes and datasets used in this article, deploying them in a benchmark system, which provides an open API for accessing and comparing sentence-level sentiment analysis methods.
]]>But, is this truly authentic reflection or are students swayed by the teacher's thoughts? When we, as teachers, give feedback and then ask students to reflect on this, are we losing some of the student's voice and independent thinking?
Let’s take public speaking, for a moment.
"Public speaking provides a wonderful opportunity for self-reflection," says English Teacher Kerry Hosmer.
The self-critique allows a more thorough reflection on performance as students step back from the internal nerves and review the actual delivery.
Kerry Hosmer, D.C. Teacher
Hosmer and I teamed up to leverage the power of technology to enable students to take their self-analysis to the next level, and here was our big goal: we wanted to have students initiate and guide the reflective process. Here’s how we did it.
Through this effort, we wanted the teacher to have a chance to hear a student's self-reflection before making an opinion. As educators, we are always promoting creativity and the importance of students asking the questions instead of always providing them with the prompts—so in this project, we required students to break down and analyze their work independently.
The PVLEGS (Poise, Voice, Life, Eye Contact, Gestures, and Speed) delivery approach (developed by Erik Palmer) serves as the anchor for the Public Speaking course. Poise, Voice, Life, Eye Contact, Gestures, and Speed become fundamental attributes that each speaker must work on, along with the construction of the assigned speech. These components are critical to student success, but processing both information and delivery simultaneously can be quite challenging. So, Mrs. Hosmer explains why self-critique is key:
"Students remain in tune with the internal—their nerves, the key points in their heads that they must address, and usually a time measurement that they have to meet. The self-critique allows a more thorough reflection on performance as students step back from the internal nerves and review the actual delivery. What effective changes in tone took place? Was the pace appropriate for the speech? Did their gestures enhance the delivery or distract from the message? Did they truly connect with their audience through eye contact, or did the speech get delivered to the floor?"
Students in public speaking explored how to capture and maintain audience attention. They learned the importance of a confident and poised delivery. From there, students moved into examining the structural approach to writing an original speech. Students were given the freedom to choose their own topic and speak about something they are passionate about.
Why did Stacey choose Zaption? Here are a few considerations: when surveyed, teachers report that their top three considerations for choosing tools are 1) price, 2) whether the tool was time-saving and simple to integrate into instruction, and 3) whether the tool tailored student tasks and instructions based on individual students’ needs.*
*Teachers Know Best, 2015 Report
After writing and practicing speeches, students were ready to present in front of the class. And—ready for the big secret?—students used their phones to capture a video recording of their in-class presentation. We introduced students to Zaption, a web app that turns a video into an interactive experience by inserting text, quizzes, and discussions. Though the tool is often aimed at teachers looking for a way to engage and monitor students watching instructional videos, we decided to allow students to do the authoring for this exercise.
Using Zaption, students were asked to address the following in their self-critique:
Students' depth of understanding the PVLEGS approach becomes clear through their comments in the Zaption Self-Critique. They must pinpoint deliberate places in order to both celebrate strengths and to set goals for improvement.
Using Zaption, students took ownership over the process of identifying how to take their speeches to the next level. The assignment promotes critical, independent, and reflective thinking. Consider senior Jessica Vincent’s perspective:
"It was awkward and uncomfortable at first, but really beneficial. You don't usually get to see how you look to other people, especially during a speech in front of the class. You're nervous and focused on what you need to say, so you don't really get to pay attention to what others are seeing. Zaption was easy to use. It forced me to pay attention to the entire video since we had to comment at various points."
In the future, students will use Zaption to review a classmate's speech. A student speaker will insert specific text slides along their speech recording and incorporate targeted questions for review. This next level will open up constructive dialogue between classmates, as well as reinforce the understanding of these critical components of the course.
Reflection offers a critical tool for both students and their teachers who are guiding them—especially if the right technology is involved. Understanding can be even more closely assessed, and the process opens up a one-on-one dialogue that ultimately results in further growth in the classroom.
Today I’m pleased to announce new major release of text2vec - text2vec 0.4 which is already on CRAN.
For those readers who is not familiar with text2vec - it is an R package which provides an efficient framework with a concise API for text analysis and natural language processing.
With this release I also launched project homepage - http://text2vec.org where you can find up-to-date documents and tutorials.
The core functionality at the moment includes:
First of all, I would like to express special thanks to project contributors - Lincoln Mullen, Qin Wenfeng, Zach Mayer and others (and of course for all of those who reported bugs on the issue tracker!).
A lot of work was done in the last 6 months. Most notable changes are:
create_*
functions modified input objects (in contrast to usual R behavior with copy-on-modify semantics). So I received a lot of bug reports on that. People just didn’t understand why they getting empty Document-Term matrices. That was my big mistake, R users assume that function can’t modify argument. So I rewrote iterators with R6
classes (thanks to @hadley for suggestion). Learned a lot.fit
, transform
, fit_transform
. More details will be available soon in a separate blog post. Stay tuned.lda
package in next release). It happened that LDA from text2vec ~ 2x faster that original (and ~10x faster than topicmodels!)float
arithmetic (don’t forget to enable -ffast-math
option for your C++ compiler)L1
regularization - our new feature (I didn’t see implementations our papers where researchers tried to add regularization). Higher quality word embedding for small data sets. More details will be available in separate blog post. Stay tunedCheck out tutorials on text2vec.org where I’ll be updating documentation on a regular basis.
Below is the updated introduction to text mining with text2vec. No fancy word clouds. No Jane Austen. Enjoy.
Most text mining and NLP modeling use bag of words or bag of n-grams methods. Despite their simplicity, these models usually demonstrate good performance on text categorization and classification tasks. But in contrast to their theoretical simplicity and practical efficiency building bag-of-words models involves technical challenges. This is especially the case in R because of its copy-on-modify semantics.
Let’s briefly review some of the steps in a typical text analysis pipeline:
In this vignette we will primarily discuss the first step. Texts themselves can take up a lot of memory, but vectorized texts usually do not, because they are stored as sparse matrices. Because of R’s copy-on-modify semantics, it is not easy to iteratively grow a DTM. Thus constructing a DTM, even for a small collections of documents, can be a serious bottleneck for analysts and researchers. It involves reading the whole collection of text documents into RAM and processing it as single vector, which can easily increase memory use by a factor of 2 to 4. The text2vec package solves this problem by providing a better way of constructing a document-term matrix.
Let’s demonstrate package core functionality by applying it to a real case problem - sentiment analysis.
text2vec package provides the movie_review
dataset. It consists of 5000 movie reviews, each of which is marked as positive or negative. We will also use the data.table package for data wrangling.
First of all let’s split out dataset into two parts - train and test. We will show how to perform data manipulations on train set and then apply exactly the same manipulations on the test set:
To represent documents in vector space, we first have to create mappings from terms to term IDS. We call them terms instead of words because they can be arbitrary n-grams not just single words. We represent a set of documents as a sparse matrix, where each row corresponds to a document and each column corresponds to a term. This can be done in 2 ways: using the vocabulary itself or by feature hashing.
Let’s first create a vocabulary-based DTM. Here we collect unique terms from all documents and mark each of them with a unique ID using the create_vocabulary()
function. We use an iterator to create the vocabulary.
What was done here?
itoken()
function. All functions prefixed with create_
work with these iterators. R users might find this idiom unusual, but the iterator abstraction allows us to hide most of details about input and to process data in memory-friendly chunks.create_vocabulary()
function.Alternatively, we could create list of tokens and reuse it in further steps. Each element of the list should represent a document, and each element should be a character vector of tokens.
Number of docs: 4000 0 stopwords: ... ngram_min = 1; ngram_max = 1 Vocabulary: terms terms_counts doc_counts 1: overturned 1 1 2: disintegration 1 1 3: vachon 1 1 4: interfered 1 1 5: michonoku 1 1 --- 35592: penises 2 2 35593: arabian 1 1 35594: personal 102 94 35595: end 921 743 35596: address 10 10
Note that text2vec provides a few tokenizer functions (see ?tokenizers
). These are just simple wrappers for the base::gsub()
function and are not very fast or flexible. If you need something smarter or faster you can use the tokenizers package which will cover most use cases, or write your own tokenizer using the stringi package.
Now that we have a vocabulary, we can construct a document-term matrix.
Time difference of 0.800817 secs
Now we have a DTM and can check its dimensions.
[1] 4000 35596
[1] TRUE
As you can see, the DTM has rows, equal to the number of documents, and columns, equal to the number of unique terms.
Now we are ready to fit our first model. Here we will use the glmnet package to fit a logistic regression model with an L1 penalty and 4 fold cross-validation.
Time difference of 3.485586 secs
[1] "max AUC = 0.923"
We have successfully fit a model to our DTM. Now we can check the model’s performance on test data.
Note that we use exactly the same functions from prepossessing and tokenization. Also we reuse/use the same vectorizer
- function which maps terms to indices.
[1] 0.916697
As we can see, performance on the test data is roughly the same as we expect from cross-validation.
We can note, however, that the training time for our model was quite high. We can reduce it and also significantly improve accuracy by pruning the vocabulary.
For example, we can find words “a”, “the”, “in”, “I”, “you”, “on”, etc in almost all documents, but they do not provide much useful information. Usually such words are called stop words. On the other hand, the corpus also contains very uncommon terms, which are contained in only a few documents. These terms are also useless, because we don’t have sufficient statistics for them. Here we will remove pre-defined stopwords, very common and very unusual terms.
Time difference of 0.439589 secs
Time difference of 0.6738439 secs
[1] 4000 6585
Note that the new DTM has many fewer columns than the original DTM. This usually leads to both accuracy improvement (because we removed “noise”) and reduction of the training time.
Also we need to create DTM for test data with the same vectorizer:
[1] 1000 6585
Can we improve the model? Definitely - we can use n-grams instead of words. Here we will use up to 2-grams:
Time difference of 1.47972 secs
Time difference of 2.973802 secs
[1] "max AUC = 0.9217"
Seems that usage of n-grams improved our model a little bit more. Let’s check performance on test dataset:
[1] 0.9268974
Further tuning is left up to the reader.
If you are not familiar with feature hashing (the so-called “hashing trick”) I recommend you start with the Wikipedia article, then read the original paper by a Yahoo! research team. This technique is very fast because we don’t have to perform a lookup over an associative array. Another benefit is that it leads to a very low memory footprint, since we can map an arbitrary number of features into much more compact space. This method was popularized by Yahoo! and is widely used in Vowpal Wabbit.
Here is how to use feature hashing in text2vec.
Time difference of 1.51502 secs
Time difference of 4.494137 secs
[1] "max AUC = 0.8937"
[1] 0.9036685
As you can see our AUC is a bit worse but DTM construction time is considerably lower. On large collections of documents this can be a significant advantage.
Before doing analysis it usually can be useful to transform DTM. For example lengths of the documents in collection can significantly vary. In this case it can be useful to apply normalization.
By “normalization” we assume transformation of the rows of DTM so we adjust values measured on different scales to a notionally common scale. For the case when length of the documents vary we can apply “L1” normalization. It means we will transform rows in a way that sum
of the row values will be equal to 1
:
By this transformation we should improve the quality of data preparation.
Another popular technique is TF-IDF transformation. We can (and usually should) apply it to our DTM. It will not only normalize DTM, but also increase the weight of terms which are specific to a single document or handful of documents and decrease the weight for terms used in most documents:
Note that here we first time touched model object in text2vec. At this moment the user should remember several important things about text2vec models:
fit()
or fit_transform()
function, model will be modifed by it.transform(new_data, fitted_model)
method.More detailed overview of models and models API will be available soon in a separate vignette.
Once we have tf-idf reweighted DTM we can fit our linear classifier again:
Time difference of 3.033687 secs
[1] "max AUC = 0.9146"
Let’s check the model performance on the test dataset:
[1] 0.9053246
Usually tf-idf transformation significantly improve performance on most of the dowstream tasks.
Try text2vec
, share your thoughts in comments. I’m waiting for feedback.
Learning also happens in the process of getting familiar with my new faculty life! It’s extremely exciting for me to join the Learning Technologies (LT) program and the LT Media Lab at UMN. A lot of “first times” have happened during the past three weeks since my official start date: the first program meeting, the first faculty meeting, the first writing group meetup, the first time I ran behind the University shuttle, and the first class meeting… Everything is still fresh!
Among all those excitements, I am especially excited about the course I am teaching this Spring: Learning Analytics in the Knowledge Age. It is currently offered as a special topics course in our program and may be offered as an ordinary course in the future. It is the first course I am teaching independently. In the first class meeting last week, I was glad to see a wonderful mix of graduate students bringing in unique expertise and interests from several areas, including educational psychology, STEM education, computer science, pharmacy, and LT. There will be a lot of fun!
This course is designed to be a Knowledge Building course. All participants are seen as equal contributors to the field of learning analytics. All students enrolled in the course become builders of the field! That’s one exciting thing of diving into such a new field.
Backgrounds and interests brought in by students will make the course extremely fun to teach. So I designed two types of groups in which they would shine!
Knowledge Forum will be used as the course environment because it is still the most powerful tool I see to support flexible idea development. We will be able to try some analytic tools in Knowledge Forum – living and exploring the capacity of learning analytics in supporting growth in learning in different domains. We will see whether we could also advance some of the Knowledge Forum analytic tools.
There have been a few awesome courses offered on this topic, mostly in the form of MOOCs:
These courses have been amazing resources for myself to learn the field as well as to design the current course. However, the course I am offering is designed to be quite different from them in the following two aspects.
I believe there are many lessons for me to learn – as a teacher, co-learner, and knowledge-builder in the class. If you are also interested in this course, feel free to check out its course materials. Suggestions are more than welcomed.
(Photo credit: Lana Peterson)
]]>Once writers complete a first draft, they are often encouraged to evaluate their writing and prioritize what to revise. Yet, this process can be both daunting and difficult. This study looks at how students used a semantic concept mapping tool to re-present the content and organization of their initial draft of an informational text. We examine the processes of students at two different schools as they remediated their own texts and how those processes impacted the development of their rhetorical, conceptual, and communicative capacities. Our analysis suggests that students creating visualizations of their completed first drafts scaffolded self-evaluation. The mapping tool aided visualization by converting compositions into discrete persistent visual data elements that represented concepts and connections. This often led to students’ meta-awareness of what was missing or misaligned in their draft. Our findings have implications for how students approach, educators perceive, and designers support the drafting and revision process.
]]>The paper explores what exactly it is that users participate in when being involved in participatory design (PD), relating this discussion to the CSCW perspective on collaborative design work. We argue that a focus on decision-making in design is necessary for understanding participation in design. Referring to Schön we see design as involving creating choices, selecting among them, concretizing choices and evaluating the choices. We discuss how these kinds of activities have played out in four PD projects that we have participated in. Furthermore, we show that the decisions are interlinked, and discuss the notion of decision linkages. We emphasize the design result as the most important part of PD. Finally, participation is discussed as the sharing of power, asking what the perspective of power and decision-making adds to the understanding of design practices.
]]>The goal of a classification algorithm is to attempt to learn a separator (classifier) that can distinguish the two. There are many ways of doing this, based on various mathematical, statistical, or geometric assumptions:
But when you start looking at real, uncleaned data one of the first things you notice is that it’s a lot noisier and imbalanced. Scatterplots of real data often look more like this:
The primary problem is that these classes are imbalanced: the red points are greatly outnumbered by the blue.
Research on imbalanced classes often considers imbalanced to mean a minority class of 10% to 20%. In reality, datasets can get far more imbalanced than this. —Here are some examples:
Many of these domains are imbalanced because they are what I call needle in a haystack problems, where machine learning classifiers are used to sort through huge populations of negative (uninteresting) cases to find the small number of positive (interesting, alarm-worthy) cases.
When you encounter such problems, you’re bound to have difficulties solving them with standard algorithms. Conventional algorithms are often biased towards the majority class because their loss functions attempt to optimize quantities such as error rate, not taking the data distribution into consideration^{2}. In the worst case, minority examples are treated as outliers of the majority class and ignored. The learning algorithm simply generates a trivial classifier that classifies every example as the majority class.
This might seem like pathological behavior but it really isn’t. Indeed, if your goal is to maximize simple accuracy (or, equivalently, minimize error rate), this is a perfectly acceptable solution. But if we assume that the rare class examples are much more important to classify, then we have to be more careful and more sophisticated about attacking the problem.
If you deal with such problems and want practical advice on how to address them, read on.
Note: The point of this blog post is to give insight and concrete advice on how to tackle such problems. However, this is not a coding tutorial that takes you line by line through code. I have Jupyter Notebooks (also linked at the end of the post) useful for experimenting with these ideas, but this blog post will explain some of the fundamental ideas and principles.
Learning from imbalanced data has been studied actively for about two decades in machine learning. It’s been the subject of many papers, workshops, special sessions, and dissertations (a recent survey has about 220 references). A vast number of techniques have been tried, with varying results and few clear answers. Data scientists facing this problem for the first time often ask What should I do when my data is imbalanced? This has no definite answer for the same reason that the general question Which learning algorithm is best? has no definite answer: it depends on the data.
That said, here is a rough outline of useful approaches. These are listed approximately in order of effort:
First, a quick detour. Before talking about how to train a classifier well with imbalanced data, we have to discuss how to evaluate one properly. This cannot be overemphasized. You can only make progress if you’re measuring the right thing.
score
^{3} or predict
). Instead, get probability estimates via proba
or predict_proba
.sklearn.cross_validation.StratifiedKFold
.sklearn.calibration.CalibratedClassifierCV
)The two-dimensional graphs in the first bullet above are always more informative than a single number, but if you need a single-number metric, one of these is preferable to accuracy:
The easiest approaches require little change to the processing steps, and simply involve adjusting the example sets until they are balanced. Oversampling randomly replicates minority instances to increase their population. Undersampling randomly downsamples the majority class. Some data scientists (naively) think that oversampling is superior because it results in more data, whereas undersampling throws away data. But keep in mind that replicating data is not without consequence—since it results in duplicate data, it makes variables appear to have lower variance than they do. The positive consequence is that it duplicates the number of errors: if a classifier makes a false negative error on the original minority data set, and that data set is replicated five times, the classifier will make six errors on the new set. Conversely, undersampling can make the independent variables look like they have a higher variance than they do.
Because of all this, the machine learning literature shows mixed results with oversampling, undersampling, and using the natural distributions.
Most machine learning packages can perform simple sampling adjustment. The R package unbalanced
implements a number of sampling techniques specific to imbalanced datasets, and scikit-learn.cross_validation
has basic sampling algorithms.
Possibly the best theoretical argument of—and practical advice for—class imbalance was put forth in the paper Class Imbalance, Redux, by Wallace, Small, Brodley and Trikalinos^{4}. They argue for undersampling the majority class. Their argument is mathematical and thorough, but here I’ll only present an example they use to make their point.
They argue that two classes must be distinguishable in the tail of some distribution of some explanatory variable. Assume you have two classes with a single dependent variable, x. Each class is generated by a Gaussian with a standard deviation of 1. The mean of class 1 is 1 and the mean of class 2 is 2. We shall arbitrarily call class 2 the majority class. They look like this:
Given an x value, what threshold would you use to determine which class it came from? It should be clear that the best separation line between the two is at their midpoint, x=1.5, shown as the vertical line: if a new example x falls under 1.5 it is probably Class 1, else it is Class 2. When learning from examples, we would hope that a discrimination cutoff at 1.5 is what we would get, and if the classes are evenly balanced this is approximately what we should get. The dots on the x axis show the samples generated from each distribution.
But we’ve said that Class 1 is the minority class, so assume that we have 10 samples from it and 50 samples from Class 2. It is likely we will learn a shifted separation line, like this:
We can do better by down-sampling the majority class to match that of the minority class. The problem is that the separating lines we learn will have high variability (because the samples are smaller), as shown here (ten samples are shown, resulting in ten vertical lines):
So a final step is to use bagging to combine these classifiers. The entire process looks like this:
This technique has not been implemented in Scikit-learn, though a file called blagging.py
(balanced bagging) is available that implements a BlaggingClassifier, which balances bootstrapped samples prior to aggregation.
Over- and undersampling selects examples randomly to adjust their proportions. Other approaches examine the instance space carefully and decide what to do based on their neighborhoods.
For example, Tomek links are pairs of instances of opposite classes who are their own nearest neighbors. In other words, they are pairs of opposing instances that are very close together.
Tomek’s algorithm looks for such pairs and removes the majority instance of the pair. The idea is to clarify the border between the minority and majority classes, making the minority region(s) more distinct. The diagram above shows a simple example of Tomek link removal. The R package unbalanced
implements Tomek link removal, as does a number of sampling techniques specific to imbalanced datasets. Scikit-learn has no built-in modules for doing this, though there are some independent packages (e.g., TomekLink).
Another direction of research has involved not resampling of examples, but synthesis of new ones. The best known example of this approach is Chawla’s SMOTE (Synthetic Minority Oversampling TEchnique) system. The idea is to create new minority examples by interpolating between existing ones. The process is basically as follows. Assume we have a set of majority and minority examples, as before:
SMOTE was generally successful and led to many variants, extensions, and adaptations to different concept learning algorithms. SMOTE and variants are available in R in the unbalanced
package and in Python in the UnbalancedDataset
package.
It is important to note a substantial limitation of SMOTE. Because it operates by interpolating between rare examples, it can only generate examples within the body of available examples—never outside. Formally, SMOTE can only fill in the convex hull of existing minority examples, but not create new exterior regions of minority examples.
Many machine learning toolkits have ways to adjust the “importance” of classes. Scikit-learn, for example, has many classifiers that take an optional class_weight
parameter that can be set higher than one. Here is an example, taken straight from the scikit-learn documentation, showing the effect of increasing the minority class’s weight by ten. The solid black line shows the separating border when using the default settings (both classes weighed equally), and the dashed line after the class_weight
parameter for the minority (red) classes changed to ten.
As you can see, the minority class gains in importance (its errors are considered more costly than those of the other class) and the separating hyperplane is adjusted to reduce the loss.
It should be noted that adjusting class importance usually only has an effect on the cost of class errors (False Negatives, if the minority class is positive). It will adjust a separating surface to decrease these accordingly. Of course, if the classifier makes no errors on the training set errors then no adjustment may occur, so altering class weights may have no effect.
This post has concentrated on relatively simple, accessible ways to learn classifiers from imbalanced data. Most of them involve adjusting data before or after applying standard learning algorithms. It’s worth briefly mentioning some other approaches.
Learning from imbalanced classes continues to be an ongoing area of research in machine learning with new algorithms introduced every year. Before concluding I’ll mention a few recent algorithmic advances that are promising.
In 2014 Goh and Rudin published a paper Box Drawings for Learning with Imbalanced Data^{5} which introduced two algorithms for learning from data with skewed examples. These algorithms attempt to construct “boxes” (actually axis-parallel hyper-rectangles) around clusters of minority class examples:
Their goal is to develop a concise, intelligible representation of the minority class. Their equations penalize the number of boxes and the penalties serve as a form of regularization.
They introduce two algorithms, one of which (Exact Boxes) uses mixed-integer programming to provide an exact but fairly expensive solution; the other (Fast Boxes) uses a faster clustering method to generate the initial boxes, which are subsequently refined. Experimental results show that both algorithms perform very well among a large set of test datasets.
Earlier I mentioned that one approach to solving the imbalance problem is to discard the minority examples and treat it as a single-class (or anomaly detection) problem. One recent anomaly detection technique has worked surprisingly well for just that purpose. Liu, Ting and Zhou introduced a technique called Isolation Forests^{6} that attempted to identify anomalies in data by learning random forests and then measuring the average number of decision splits required to isolate each particular data point. The resulting number can be used to calculate each data point’s anomaly score, which can also be interpreted as the likelihood that the example belongs to the minority class. Indeed, the authors tested their system using highly imbalanced data and reported very good results. A follow-up paper by Bandaragoda, Ting, Albrecht, Liu and Wells^{7} introduced Nearest Neighbor Ensembles as a similar idea that was able to overcome several shortcomings of Isolation Forests.
As a final note, this blog post has focused on situations of imbalanced classes under the tacit assumption that you’ve been given imbalanced data and you just have to tackle the imbalance. In some cases, as in a Kaggle competition, you’re given a fixed set of data and you can’t ask for more.
But you may face a related, harder problem: you simply don’t have enough examples of the rare class. None of the techniques above are likely to work. What do you do?
In some real world domains you may be able to buy or construct examples of the rare class. This is an area of ongoing research in machine learning. If rare data simply needs to be labeled reliably by people, a common approach is to crowdsource it via a service like Mechanical Turk. Reliability of human labels may be an issue, but work has been done in machine learning to combine human labels to optimize reliability. Finally, Claudia Perlich in her Strata talk All The Data and Still Not Enough gives examples of how problems with rare or non-existent data can be finessed by using surrogate variables or problems, essentially using proxies and latent variables to make seemingly impossible problems possible. Related to this is the strategy of using transfer learning to learn one problem and transfer the results to another problem with rare examples, as described here.
Here, I have attempted to distill most of my practical knowledge into a single post. I know it was a lot, and I would value your feedback. Did I miss anything important? Any comments or questions on this blog post are welcome.
Gaussians.ipynb
.blagging.py
. It is a simple fork of the existing bagging implementation of sklearn, specifically ./sklearn/ensemble/bagging.py
.ImbalancedClasses.ipynb
. It loads up several domains and compares blagging with other methods under different distributions.Thanks to Chloe Mawer for her Jupyter Notebook design work.
^{1. Natalie Hockham makes this point in her talk Machine learning with imbalanced data sets, which focuses on imbalance in the context of credit card fraud detection.↩}
^{2. By definition there are fewer instances of the rare class, but the problem comes about because the cost of missing them (a false negative) is much higher.↩}
^{3. The details in courier are specific to Python’s Scikit-learn.↩}
^{4. “Class Imbalance, Redux”. Wallace, Small, Brodley and Trikalinos. IEEE Conf on Data Mining. 2011.↩}
^{5. Box Drawings for Learning with Imbalanced Data.” Siong Thye Goh and Cynthia Rudin. KDD-2014, August 24–27, 2014, New York, NY, USA.↩}
^{6. “Isolation-Based Anomaly Detection”. Liu, Ting and Zhou. ACM Transactions on Knowledge Discovery from Data, Vol. 6, No. 1. 2012.↩}
^{7. “Efficient Anomaly Detection by Isolation Using Nearest Neighbour Ensemble.” Bandaragoda, Ting, Albrecht, Liu and Wells. ICDM-2014↩}
The post Learning from Imbalanced Classes appeared first on Silicon Valley Data Science.
]]>Table of contents:
Unsupervisedly learned word embeddings have been exceptionally successful in many NLP tasks and are frequently seen as something akin to a silver bullet. In fact, in many NLP architectures, they have almost completely replaced traditional distributional features such as Brown clusters and LSA features.
Proceedings of last year's ACL and EMNLP conferences have been dominated by word embeddings, with some people musing that Embedding Methods in Natural Language Processing was a more fitting name for EMNLP. This year's ACL features not one but two workshops on word embeddings.
Semantic relations between word embeddings seem nothing short of magical to the uninitiated and Deep Learning NLP talks frequently prelude with the notorious \(king - man + woman \approx queen \) slide, while a recent article in Communications of the ACM hails word embeddings as the primary reason for NLP's breakout.
This post will be the first in a series that aims to give an extensive overview of word embeddings showcasing why this hype may or may not be warranted. In the course of this review, we will try to connect the disperse literature on word embedding models, highlighting many models, applications and interesting features of word embeddings, with a focus on multilingual embedding models and word embedding evaluation tasks in later posts.
This first post lays the foundations by presenting current word embeddings based on language modelling. While many of these models have been discussed at length, we hope that investigating and discussing their merits in the context of past and current research will provide new insights.
A brief note on nomenclature: In the following we will use the currently prevalent term word embeddings to refer to dense representations of words in a low-dimensional vector space. Interchangeable terms are word vectors and distributed representations. We will particularly focus on neural word embeddings, i.e. word embeddings learned by a neural network.
Since the 1990s, vector space models have been used in distributional semantics. During this time, many models for estimating continuous representations of words have been developed, including Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Have a look at this blog post for a more detailed overview of distributional semantics history in the context of word embeddings.
Bengio et al. coin the term word embeddings in 2003 and train them in a neural language model jointly with the model's parameters. First to show the utility of pre-trained word embeddings were arguably Collobert and Weston in 2008. Their landmark paper A unified architecture for natural language processing not only establishes word embeddings as a useful tool for downstream tasks, but also introduces a neural network architecture that forms the foundation for many current approaches. However, the eventual popularization of word embeddings can be attributed to Mikolov et al. in 2013 who created word2vec, a toolkit that allows the seamless training and use of pre-trained embeddings. In 2014, Pennington et al. released GloVe, a competitive set of pre-trained word embeddings, signalling that word embeddings had reached the main stream.
Word embeddings are one of the few currently successful applications of unsupervised learning. Their main benefit arguably is that they don't require expensive annotation, but can be derived from large unannotated corpora that are readily available. Pre-trained embeddings can then be used in downstream tasks that use small amounts of labeled data.
Naturally, every feed-forward neural network that takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation, necessarily yields word embeddings as the weights of the first layer, which is usually referred to as Embedding Layer.
The main difference between such a network that produces word embeddings as a by-product and a method such as word2vec whose explicit goal is the generation of word embeddings is its computational complexity. Generating word embeddings with a very deep architecture is simply too computationally expensive for a large vocabulary. This is the main reason why it took until 2013 for word embeddings to explode onto the NLP stage; computational complexity is a key trade-off for word embedding models and will be a recurring theme in our review.
Another difference is the training objective: word2vec and GloVe are geared towards producing word embeddings that encode general semantic relationships, which are beneficial to many downstream tasks; notably, word embeddings trained this way won't be helpful in tasks that do not rely on these kind of relationships. In contrast, regular neural networks typically produce task-specific embeddings that are only of limited use elsewhere. Note that a task that relies on semantically coherent representations such as language modelling will produce similar embeddings to word embedding models, which we will investigate in the next chapter.
As a side-note, word2vec and Glove might be said to be to NLP what VGGNet is to vision, i.e. a common weight initialisation that provides generally helpful features without the need for lengthy training.
To facilitate comparison between models, we assume the following notational standards: We assume a training corpus containing a sequence of \(T\) training words \(w_1, w_2, w_3, \cdots, w_T\) that belong to a vocabulary \(V\) whose size is \(|V|\). Our models generally consider a context of \( n \) words. We associate every word with an input embedding \( v_w \) (the eponymous word embedding in the Embedding Layer) with \(d\) dimensions and an output embedding \( v'_w \) (another word representation whose role will soon become clearer). We finally optimize an objective function \(J_\theta\) with regard to our model parameters \(\theta\) and our model outputs some score \(f_\theta(x)\) for every input \( x \).
Word embedding models are quite closely intertwined with language models. The quality of language models is measured based on their ability to learn a probability distribution over words in \( V \). In fact, many state-of-the-art word embedding models try to predict the next word in a sequence to some extent. Additionally, word embedding models are often evaluated using perplexity, a cross-entropy based measure borrowed from language modelling.
Before we get into the gritty details of word embedding models, let us briefly talk about some language modelling fundamentals.
Language models generally try to compute the probability of a word \(w_t\) given its \(n - 1\) previous words, i.e. \(p(w_t \: | \: w_{t-1} , \cdots w_{t-n+1})\). By applying the chain rule together with the Markov assumption, we can approximate the product of a whole sentence or document by the product of the probabilities of each word given its \(n\) previous words:
\(p(w_1 , \cdots , w_T) = \prod\limits_i p(w_i \: | \: w_{i-1} , \cdots , w_{i-n+1}) \).
In n-gram based language models, we can calculate a word's probability based on the frequencies of its constituent n-grams:
\( p(w_t \: | \: w_{t-1} , \cdots , w_{t-n+1}) = \dfrac{count(w_{t-n+1}, \cdots , w_{t-1},w_t)}{count({w_{t-n+1}, \cdots , w_{t-1}})}\).
Setting \(n = 2\) yields bigram probabilities, while \(n = 5\) together with Kneser-Ney smoothing leads to smoothed 5-gram models that have been found to be a strong baseline for language modelling. For more details, you can refer to these slides from Stanford.
In neural networks, we achieve the same objective using the well-known softmax layer:
\(p(w_t \: | \: w_{t-1} , \cdots , w_{t-n+1}) = \dfrac{\text{exp}({h^\top v'_{w_t}})}{\sum_{w_i \in V} \text{exp}({h^\top v'_{w_i}})} \).
The inner product \( h^\top v'_{w_t} \) computes the (unnormalized) log-probability of word \( w_t \), which we normalize by the sum of the log-probabilities of all words in \( V \). \(h\) is the output vector of the penultimate network layer (the hidden layer in the feed-forward network in Figure 1), while \(v'_w\) is the output embedding of word \(w \), i.e. its representation in the weight matrix of the softmax layer. Note that even though \(v'_w\) represents the word \(w\), it is learned separately from the input word embedding \(v_w\), as the multiplications both vectors are involved in differ (\(v_w\) is multiplied with an index vector, \(v'_w\) with \(h\)).
Note that we need to calculate the probability of every word \( w \) at the output layer of the neural network. To do this efficiently, we perform a matrix multiplication between \(h\) and a weight matrix whose rows consist of \(v'_w\) of all words \(w\) in \(V\). We then feed the resulting vector, which is often referred to as a logit, i.e. the output of a previous layer that is not a probability, with \(d = |V|\) into the softmax, while the softmax layer "squashes" the vector to a probability distribution over the words in \(V\).
Note that the softmax layer (in contrast to the previous n-gram calculations) only implicitly takes into account \(n\) previous words: LSTMs, which are typically used for neural language models, encode these in their state \(h\), while Bengio's neural language model, which we will see in the next chapter, feeds the previous \(n\) words through a feed-forward layer.
Keep this softmax layer in mind, as many of the subsequent word embedding models will use it in some fashion.
Using this softmax layer, the model tries to maximize the probability of predicting the correct word at every timestep \( t \). The whole model thus tries to maximize the averaged log probability of the whole corpus:
\(J_\theta = \frac{1}{T} \text{log} \space p(w_1 , \cdots , w_T)\).
Analogously, through application of the chain rule, it is usually trained to maximize the average of the log probabilities of all words in the corpus given their previous \( n \) words:
\(J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \text{log} \space p(w_t \: | \: w_{t-1} , \cdots , w_{t-n+1})\).
To sample words from the language model at test time, we can either greedily choose the word with the highest probability \(p(w_t \: | \: w_{t-1} \cdots w_{t-n+1})\) at every time step \( t \) or use beam search. We can do this for instance to generate arbitrary text sequences as in Karpathy's Char-RNN or as part of a sequence prediction task, where an LSTM is used as the decoder.
The classic neural language model proposed by Bengio et al. [^{1}] in 2003 consists of a one-hidden layer feed-forward neural network that predicts the next word in a sequence as in Figure 2.
Their model maximizes what we've described above as the prototypical neural language model objective (we omit the regularization term for simplicity):
\(J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \text{log} \space f(w_t , w_{t-1} , \cdots , w_{t-n+1})\).
\( f(w_t , w_{t-1} , \cdots , w_{t-n+1}) \) is the output of the model, i.e. the probability \( p(w_t \: | \: w_{t-1} , \cdots , w_{t-n+1}) \) as computed by the softmax, where \(n \) is the number of previous words fed into the model.
Bengio et al. are one of the first to introduce what we now refer to as a word embedding, a real-valued word feature vector in \(\mathbb{R}\). Their architecture forms very much the prototype upon which current approaches have gradually improved. The general building blocks of their model, however, are still found in all current neural language and word embedding models. These are:
Additionally, Bengio et al. identify two issues that lie at the heart of current state-of-the-art-models:
Finding ways to mitigate the computational cost associated with computing the softmax over a large vocabulary [^{9}] is thus one of the key challenges both in neural language models as well as in word embedding models.
After Bengio et al.'s first steps in neural language models, research in word embeddings stagnated as computing power and algorithms did not yet allow the training of a large vocabulary.
Collobert and Weston [^{4}] (thus C&W) showcase in 2008 that word embeddings trained on a sufficiently large dataset carry syntactic and semantic meaning and improve performance on downstream tasks. They elaborate upon this in their 2011 paper [^{8}].
Their solution to avoid computing the expensive softmax is to use a different objective function: Instead of the cross-entropy criterion of Bengio et al., which maximizes the probability of the next word given the previous words, Collobert and Weston train a network to output a higher score \(f_\theta\) for a correct word sequence (a probable word sequence in Bengio's model) than for an incorrect one. For this purpose, they use a pairwise ranking criterion, which looks like this:
\(J_\theta\ = \sum\limits_{x \in X} \sum\limits_{w \in V} \text{max} \lbrace 0, 1 - f_\theta(x) + f_\theta(x^{(w)}) \rbrace \).
They sample correct windows \(x\) containing \(n\) words from the set of all possible windows \(X\) in their corpus. For each window \(x\), they then produce a corrupted, incorrect version \(x^{(w)}\) by replacing \(x\)'s centre word with another word \(w\) from \(V\). Their objective now maximises the distance between the scores output by the model for the correct and the incorrect window with a margin of \(1\). Their model architecture, depicted in Figure 3 without the ranking objective, is analogous to Bengio et al.'s model.
The resulting language model produces embeddings that already possess many of the relations word embeddings have become known for, e.g. countries are clustered close together and syntactically similar words occupy similar locations in the vector space. While their ranking objective eliminates the complexity of the softmax, they keep the intermediate fully-connected hidden layer (2.) of Bengio et al. around (the HardTanh layer in Figure 3), which constitutes another source of expensive computation. Partially due to this, their full model trains for seven weeks in total with \(|V| = 130000\).
Let us now introduce arguably the most popular word embedding model, the model that launched a thousand word embedding papers: word2vec, the subject of two papers by Mikolov et al. in 2013. As word embeddings are a key building block of deep learning models for NLP, word2vec is often assumed to belong to the same group. Technically however, word2vec is not be considered to be part of deep learning, as its architecture is neither deep nor uses non-linearities (in contrast to Bengio's model and the C&W model).
In their first paper [^{2}], Mikolov et al. propose two architectures for learning word embeddings that are computationally less expensive than previous models. In their second paper [^{3}], they improve upon these models by employing additional strategies to enhance training speed and accuracy.
These architectures offer two main benefits over the C&W model and Bengio's language model:
As we will later show, the success of their model is not only due to these changes, but especially due to certain training strategies.
In the following, we will look at both of these architectures:
While a language model is only able to look at the past words for its predictions, as it is evaluated on its ability to predict each next word in the corpus, a model that just aims to generate accurate word embeddings does not suffer from this restriction. Mikolov et al. thus use both the \(n\) words before and after the target word \( w_t \) to predict it as depicted in Figure 4. They call this continuous bag-of-words (CBOW), as it uses continuous representations whose order is of no importance.
The objective function of CBOW in turn is only slightly different than the language model one:
\(J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \text{log} \space p(w_t \: | \: w_{t-n} , \cdots , w_{t-1}, w_{t+1}, \cdots , w_{t+n})\).
Instead of feeding \( n \) previous words into the model, the model receives a window of \( n \) words around the target word \( w_t \) at each time step \( t \).
While CBOW can be seen as a precognitive language model, skip-gram turns the language model objective on its head: Instead of using the surrounding words to predict the centre word as with CBOW, skip-gram uses the centre word to predict the surrounding words as can be seen in Figure 5.
The skip-gram objective thus sums the log probabilities of the surrounding \( n \) words to the left and to the right of the target word \( w_t \) to produce the following objective:
\(J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \sum\limits_{-n \leq j \leq n , \neq 0} \text{log} \space p(w_{t+j} \: | \: w_t)\).
To gain a better intuition of how the skip-gram model computes \( p(w_{t+j} \: | \: w_t) \), let's recall the definition of our softmax:
\(p(w_t \: | \: w_{t-1} , \cdots , w_{t-n+1}) = \dfrac{\text{exp}({h^\top v'_{w_t}})}{\sum_{w_i \in V} \text{exp}({h^\top v'_{w_i}})} \).
Instead of computing the probability of the target word \( w_t \) given its previous words, we calculate the probability of the surrounding word \( w_{t+j} \) given \( w_t \). We can thus simply replace these variables in the equation:
\(p(w_{t+j} \: | \: w_t ) = \dfrac{\text{exp}({h^\top v'_{w_{t+j}}})}{\sum_{w_i \in V} \text{exp}({h^\top v'_{w_i}})} \).
As the skip-gram architecture does not contain a hidden layer that produces an intermediate state vector \(h\), \(h\) is simply the word embedding \(v_{w_t}\) of the input word \(w_t\). This also makes it clearer why we want to have different representations for input embeddings \(v_w\) and output embeddings \(v'_w\), as we would otherwise multiply the word embedding by itself. Replacing \(h \) with \(v_{w_t}\) yields:
\(p(w_{t+j} \: | \: w_t ) = \dfrac{\text{exp}({v^\top_{w_t} v'_{w_{t+j}}})}{\sum_{w_i \in V} \text{exp}({v^\top_{w_t} v'_{w_i}})} \).
Note that the notation in Mikolov's paper differs slightly from ours, as they denote the centre word with \( w_I \) and the surrounding words with \( w_O \). If we replace \( w_t \) with \( w_I \), \( w_{t+j} \) with \( w_O \), and swap the vectors in the inner product due to its commutativity, we arrive at the softmax notation in their paper:
\(p(w_O|w_I) = \dfrac{\text{exp}(v'^\top_{w_O} v_{w_I})}{\sum^V_{w=1}\text{exp}(v'^\top_{w} v_{w_I})}\).
In the next post, we will discuss different ways to approximate the expensive softmax as well as key training decisions that account for much of skip-gram's success. We will also introduce GloVe [^{5}], a word embedding model based on matrix factorisation and discuss the link between word embeddings and methods from distributional semantics.
Did I miss anything? Let me know in the comments below.
If you want to learn more about word embeddings, these other blog posts on word embeddings are also available:
This blog post has been translated into the following languages:
- Chinese
Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A Neural Probabilistic Language Model. The Journal of Machine Learning Research, 3, 1137–1155. Retrieved from http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf ↩
Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013), 1–12. ↩
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NIPS, 1–9. ↩
Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing. Proceedings of the 25th International Conference on Machine Learning - ICML ’08, 20(1), 160–167. http://doi.org/10.1145/1390156.1390177 ↩
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543. http://doi.org/10.3115/v1/D14-1162 ↩
Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-Aware Neural Language Models. AAAI. Retrieved from http://arxiv.org/abs/1508.06615 ↩
Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., & Wu, Y. (2016). Exploring the Limits of Language Modeling. Retrieved from http://arxiv.org/abs/1602.02410 ↩
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural Language Processing (almost) from Scratch. Journal of Machine Learning Research, 12 (Aug), 2493–2537. Retrieved from http://arxiv.org/abs/1103.0398 ↩
Chen, W., Grangier, D., & Auli, M. (2015). Strategies for Training Large Vocabulary Neural Language Models, 12. Retrieved from http://arxiv.org/abs/1512.04906 ↩
Credit for the post image goes to Christopher Olah.
]]>Note: If you are looking for a review paper, this blog post is also available as an article on arXiv.
Update 15.06.2017: Added derivations of AdaMax and Nadam.
Table of contents:
Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. lasagne's, caffe's, and keras' documentation). These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by.
This blog post aims at providing you with intuitions towards the behaviour of different algorithms for optimizing gradient descent that will help you put them to use. We are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training. Subsequently, we will introduce the most common optimization algorithms by showing their motivation to resolve these challenges and how this leads to the derivation of their update rules. We will also take a short look at algorithms and architectures to optimize gradient descent in a parallel and distributed setting. Finally, we will consider additional strategies that are helpful for optimizing gradient descent.
Gradient descent is a way to minimize an objective function \(J(\theta)\) parameterized by a model's parameters \(\theta \in \mathbb{R}^d \) by updating the parameters in the opposite direction of the gradient of the objective function \(\nabla_\theta J(\theta)\) w.r.t. to the parameters. The learning rate \(\eta\) determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley. If you are unfamiliar with gradient descent, you can find a good introduction on optimizing neural networks here.
There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.
Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters \(\theta\) for the entire training dataset:
\(\theta = \theta - \eta \cdot \nabla_\theta J( \theta)\).
As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that don't fit in memory. Batch gradient descent also doesn't allow us to update our model online, i.e. with new examples on-the-fly.
In code, batch gradient descent looks something like this:
for i in range(nb_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad
For a pre-defined number of epochs, we first compute the gradient vector params_grad
of the loss function for the whole dataset w.r.t. our parameter vector params
. Note that state-of-the-art deep learning libraries provide automatic differentiation that efficiently computes the gradient w.r.t. some parameters. If you derive the gradients yourself, then gradient checking is a good idea. (See here for some great tips on how to check gradients properly.)
We then update our parameters in the direction of the gradients with the learning rate determining how big of an update we perform. Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.
Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example \(x^{(i)}\) and label \(y^{(i)}\):
\(\theta = \theta - \eta \cdot \nabla_\theta J( \theta; x^{(i)}; y^{(i)})\).
Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online.
SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily as in Image 1.
While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD's fluctuation, on the one hand, enables it to jump to new and potentially better local minima. On the other hand, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting. However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively.
Its code fragment simply adds a loop over the training examples and evaluates the gradient w.r.t. each example. Note that we shuffle the training data at every epoch as explained in this section.
for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad
Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of \(n\) training examples:
\(\theta = \theta - \eta \cdot \nabla_\theta J( \theta; x^{(i:i+n)}; y^{(i:i+n)})\).
This way, it a) reduces the variance of the parameter updates, which can lead to more stable convergence; and b) can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient. Common mini-batch sizes range between 50 and 256, but can vary for different applications. Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used. Note: In modifications of SGD in the rest of this post, we leave out the parameters \(x^{(i:i+n)}; y^{(i:i+n)}\) for simplicity.
In code, instead of iterating over examples, we now iterate over mini-batches of size 50:
for i in range(nb_epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=50):
params_grad = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_grad
Vanilla mini-batch gradient descent, however, does not guarantee good convergence, but offers a few challenges that need to be addressed:
Choosing a proper learning rate can be difficult. A learning rate that is too small leads to painfully slow convergence, while a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge.
Learning rate schedules [^{11}] try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset's characteristics [^{10}].
Additionally, the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.
Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. [^{19}] argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.
In the following, we will outline some algorithms that are widely used by the deep learning community to deal with the aforementioned challenges. We will not discuss algorithms that are infeasible to compute in practice for high-dimensional data sets, e.g. second-order methods such as Newton's method.
SGD has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another [^{1}], which are common around local optima. In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum as in Image 2.
Momentum [^{2}] is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Image 3. It does this by adding a fraction \(\gamma\) of the update vector of the past time step to the current update vector:
\(
\begin{align}
\begin{split}
v_t &= \gamma v_{t-1} + \eta \nabla_\theta J( \theta) \\
\theta &= \theta - v_t
\end{split}
\end{align}
\)
Note: Some implementations exchange the signs in the equations. The momentum term \(\gamma\) is usually set to 0.9 or a similar value.
Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. \(\gamma < 1\)). The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.
However, a ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We'd like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.
Nesterov accelerated gradient (NAG) [^{7}] is a way to give our momentum term this kind of prescience. We know that we will use our momentum term \(\gamma v_{t-1}\) to move the parameters \(\theta\). Computing \( \theta - \gamma v_{t-1} \) thus gives us an approximation of the next position of the parameters (the gradient is missing for the full update), a rough idea where our parameters are going to be. We can now effectively look ahead by calculating the gradient not w.r.t. to our current parameters \(\theta\) but w.r.t. the approximate future position of our parameters:
\(
\begin{align}
\begin{split}
v_t &= \gamma v_{t-1} + \eta \nabla_\theta J( \theta - \gamma v_{t-1} ) \\
\theta &= \theta - v_t
\end{split}
\end{align}
\)
Again, we set the momentum term \(\gamma\) to a value of around 0.9. While Momentum first computes the current gradient (small blue vector in Image 4) and then takes a big jump in the direction of the updated accumulated gradient (big blue vector), NAG first makes a big jump in the direction of the previous accumulated gradient (brown vector), measures the gradient and then makes a correction (red vector), which results in the complete NAG update (green vector). This anticipatory update prevents us from going too fast and results in increased responsiveness, which has significantly increased the performance of RNNs on a number of tasks [^{8}].
Refer to here for another explanation about the intuitions behind NAG, while Ilya Sutskever gives a more detailed overview in his PhD thesis [^{9}].
Now that we are able to adapt our updates to the slope of our error function and speed up SGD in turn, we would also like to adapt our updates to each individual parameter to perform larger or smaller updates depending on their importance.
Adagrad [^{3}] is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data. Dean et al. [^{4}] have found that Adagrad greatly improved the robustness of SGD and used it for training large-scale neural nets at Google, which -- among other things -- learned to recognize cats in Youtube videos. Moreover, Pennington et al. [^{5}] used Adagrad to train GloVe word embeddings, as infrequent words require much larger updates than frequent ones.
Previously, we performed an update for all parameters \(\theta\) at once as every parameter \(\theta_i\) used the same learning rate \(\eta\). As Adagrad uses a different learning rate for every parameter \(\theta_i\) at every time step \(t\), we first show Adagrad's per-parameter update, which we then vectorize. For brevity, we set \(g_{t, i}\) to be the gradient of the objective function w.r.t. to the parameter \(\theta_i\) at time step \(t\):
\(g_{t, i} = \nabla_\theta J( \theta_{t, i} )\).
The SGD update for every parameter \(\theta_i\) at each time step \(t\) then becomes:
\(\theta_{t+1, i} = \theta_{t, i} - \eta \cdot g_{t, i}\).
In its update rule, Adagrad modifies the general learning rate \(\eta\) at each time step \(t\) for every parameter \(\theta_i\) based on the past gradients that have been computed for \(\theta_i\):
\(\theta_{t+1, i} = \theta_{t, i} - \dfrac{\eta}{\sqrt{G_{t, ii} + \epsilon}} \cdot g_{t, i}\).
\(G_{t} \in \mathbb{R}^{d \times d} \) here is a diagonal matrix where each diagonal element \(i, i\) is the sum of the squares of the gradients w.r.t. \(\theta_i\) up to time step \(t\) [^{25}], while \(\epsilon\) is a smoothing term that avoids division by zero (usually on the order of \(1e-8\)). Interestingly, without the square root operation, the algorithm performs much worse.
As \(G_{t}\) contains the sum of the squares of the past gradients w.r.t. to all parameters \(\theta\) along its diagonal, we can now vectorize our implementation by performing an element-wise matrix-vector multiplication \(\odot\) between \(G_{t}\) and \(g_{t}\):
\(\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{G_{t} + \epsilon}} \odot g_{t}\).
One of Adagrad's main benefits is that it eliminates the need to manually tune the learning rate. Most implementations use a default value of 0.01 and leave it at that.
Adagrad's main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. The following algorithms aim to resolve this flaw.
Adadelta [^{6}] is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size \(w\).
Instead of inefficiently storing \(w\) previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average \(E[g^2]_t\) at time step \(t\) then depends (as a fraction \(\gamma \) similarly to the Momentum term) only on the previous average and the current gradient:
\(E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma) g^2_t \).
We set \(\gamma\) to a similar value as the momentum term, around 0.9. For clarity, we now rewrite our vanilla SGD update in terms of the parameter update vector \( \Delta \theta_t \):
\( \begin{align} \begin{split} \Delta \theta_t &= - \eta \cdot g_{t, i} \\ \theta_{t+1} &= \theta_t + \Delta \theta_t \end{split} \end{align} \)
The parameter update vector of Adagrad that we derived previously thus takes the form:
\( \Delta \theta_t = - \dfrac{\eta}{\sqrt{G_{t} + \epsilon}} \odot g_{t}\).
We now simply replace the diagonal matrix \(G_{t}\) with the decaying average over past squared gradients \(E[g^2]_t\):
\( \Delta \theta_t = - \dfrac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_{t}\).
As the denominator is just the root mean squared (RMS) error criterion of the gradient, we can replace it with the criterion short-hand:
\( \Delta \theta_t = - \dfrac{\eta}{RMS[g]_{t}} g_t\).
The authors note that the units in this update (as well as in SGD, Momentum, or Adagrad) do not match, i.e. the update should have the same hypothetical units as the parameter. To realize this, they first define another exponentially decaying average, this time not of squared gradients but of squared parameter updates:
\(E[\Delta \theta^2]_t = \gamma E[\Delta \theta^2]_{t-1} + (1 - \gamma) \Delta \theta^2_t \).
The root mean squared error of parameter updates is thus:
\(RMS[\Delta \theta]_{t} = \sqrt{E[\Delta \theta^2]_t + \epsilon} \).
Since \(RMS[\Delta \theta]_{t}\) is unknown, we approximate it with the RMS of parameter updates until the previous time step. Replacing the learning rate \(\eta \) in the previous update rule with \(RMS[\Delta \theta]_{t-1}\) finally yields the Adadelta update rule:
\( \begin{align} \begin{split} \Delta \theta_t &= - \dfrac{RMS[\Delta \theta]_{t-1}}{RMS[g]_{t}} g_{t} \\ \theta_{t+1} &= \theta_t + \Delta \theta_t \end{split} \end{align} \)
With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule.
RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera Class.
RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad's radically diminishing learning rates. RMSprop in fact is identical to the first update vector of Adadelta that we derived above:
\(
\begin{align}
\begin{split}
E[g^2]_t &= 0.9 E[g^2]_{t-1} + 0.1 g^2_t \\
\theta_{t+1} &= \theta_{t} - \dfrac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_{t}
\end{split}
\end{align}
\)
RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests \(\gamma\) to be set to 0.9, while a good default value for the learning rate \(\eta\) is 0.001.
Adaptive Moment Estimation (Adam) [^{15}] is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients \(v_t\) like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients \(m_t\), similar to momentum:
\(
\begin{align}
\begin{split}
m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\
v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
\end{split}
\end{align}
\)
\(m_t\) and \(v_t\) are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. As \(m_t\) and \(v_t\) are initialized as vectors of 0's, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. \(\beta_1\) and \(\beta_2\) are close to 1).
They counteract these biases by computing bias-corrected first and second moment estimates:
\( \begin{align} \begin{split} \hat{m}_t &= \dfrac{m_t}{1 - \beta^t_1} \\ \hat{v}_t &= \dfrac{v_t}{1 - \beta^t_2} \end{split} \end{align} \)
They then use these to update the parameters just as we have seen in Adadelta and RMSprop, which yields the Adam update rule:
\(\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\).
The authors propose default values of 0.9 for \(\beta_1\), 0.999 for \(\beta_2\), and \(10^{-8}\) for \(\epsilon\). They show empirically that Adam works well in practice and compares favorably to other adaptive learning-method algorithms.
The \(v_t\) factor in the Adam update rule scales the gradient inversely proportionally to the \(\ell_2\) norm of the past gradients (via the \(v_{t-1}\) term) and current gradient \(|g_t|^2\):
\(v_t = \beta_2 v_{t-1} + (1 - \beta_2) |g_t|^2\)
We can generalize this update to the \(\ell_p\) norm. Note that Kingma and Ba also parameterize \(\beta_2\) as \(\beta^p_2\):
\(v_t = \beta_2^p v_{t-1} + (1 - \beta_2^p) |g_t|^p\)
Norms for large \(p\) values generally become numerically unstable, which is why \(\ell_1\) and \(\ell_2\) norms are most common in practice. However, \(\ell_\infty\) also generally exhibits stable behavior. For this reason, the authors propose AdaMax (Kingma and Ba, 2015) and show that \(v_t\) with \(\ell_\infty\) converges to the following more stable value. To avoid confusion with Adam, we use \(u_t\) to denote the infinity norm-constrained \(v_t\):
\(
\begin{align}
\begin{split}
u_t &= \beta_2^\infty v_{t-1} + (1 - \beta_2^\infty) |g_t|^\infty\\
& = \max(\beta_2 \cdot v_{t-1}, |g_t|)
\end{split}
\end{align}
\)
We can now plug this into the Adam update equation by replacing \(\sqrt{\hat{v}_t} + \epsilon\) with \(u_t\) to obtain the AdaMax update rule:
\(\theta_{t+1} = \theta_{t} - \dfrac{\eta}{u_t} \hat{m}_t\)
Note that as \(u_t\) relies on the \(\max\) operation, it is not as suggestible to bias towards zero as \(m_t\) and \(v_t\) in Adam, which is why we do not need to compute a bias correction for \(u_t\). Good default values are again \(\eta = 0.002\), \(\beta_1 = 0.9\), and \(\beta_2 = 0.999\).
As we have seen before, Adam can be viewed as a combination of RMSprop and momentum: RMSprop contributes the exponentially decaying average of past squared gradients \(v_t\), while momentum accounts for the exponentially decaying average of past gradients \(m_t\). We have also seen that Nesterov accelerated gradient (NAG) is superior to vanilla momentum.
Nadam (Nesterov-accelerated Adaptive Moment Estimation) [^{24}] thus combines Adam and NAG. In order to incorporate NAG into Adam, we need to modify its momentum term \(m_t\).
First, let us recall the momentum update rule using our current notation :
\(
\begin{align}
\begin{split}
g_t &= \nabla_{\theta_t}J(\theta_t)\\
m_t &= \gamma m_{t-1} + \eta g_t\\
\theta_{t+1} &= \theta_t - m_t
\end{split}
\end{align}
\)
where \(J\) is our objective function, \(\gamma\) is the momentum decay term, and \(\eta\) is our step size. Expanding the third equation above yields:
\(\theta_{t+1} = \theta_t - ( \gamma m_{t-1} + \eta g_t)\)
This demonstrates again that momentum involves taking a step in the direction of the previous momentum vector and a step in the direction of the current gradient.
NAG then allows us to perform a more accurate step in the gradient direction by updating the parameters with the momentum step before computing the gradient. We thus only need to modify the gradient \(g_t\) to arrive at NAG:
\(
\begin{align}
\begin{split}
g_t &= \nabla_{\theta_t}J(\theta_t - \gamma m_{t-1})\\
m_t &= \gamma m_{t-1} + \eta g_t\\
\theta_{t+1} &= \theta_t - m_t
\end{split}
\end{align}
\)
Dozat proposes to modify NAG the following way: Rather than applying the momentum step twice -- one time for updating the gradient \(g_t\) and a second time for updating the parameters \(\theta_{t+1}\) -- we now apply the look-ahead momentum vector directly to update the current parameters:
\(
\begin{align}
\begin{split}
g_t &= \nabla_{\theta_t}J(\theta_t)\\
m_t &= \gamma m_{t-1} + \eta g_t\\
\theta_{t+1} &= \theta_t - (\gamma m_t + \eta g_t)
\end{split}
\end{align}
\)
Notice that rather than utilizing the previous momentum vector \(m_{t-1}\) as in the equation of the expanded momentum update rule above, we now use the current momentum vector \(m_t\) to look ahead. In order to add Nesterov momentum to Adam, we can thus similarly replace the previous momentum vector with the current momentum vector. First, recall that the Adam update rule is the following (note that we do not need to modify \(\hat{v}_t\)):
\(
\begin{align}
\begin{split}
m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t\\
\hat{m}_t & = \frac{m_t}{1 - \beta^t_1}\\
\theta_{t+1} &= \theta_{t} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t
\end{split}
\end{align}
\)
Expanding the second equation with the definitions of \(\hat{m}_t\) and \(m_t\) in turn gives us:
\(\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} (\dfrac{\beta_1 m_{t-1}}{1 - \beta^t_1} + \dfrac{(1 - \beta_1) g_t}{1 - \beta^t_1})\)
Note that \(\dfrac{\beta_1 m_{t-1}}{1 - \beta^t_1}\) is just the bias-corrected estimate of the momentum vector of the previous time step. We can thus replace it with \(\hat{m}_{t-1}\):
\(\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} (\beta_1 \hat{m}_{t-1} + \dfrac{(1 - \beta_1) g_t}{1 - \beta^t_1})\)
This equation again looks very similar to our expanded momentum update rule above. We can now add Nesterov momentum just as we did previously by simply replacing this bias-corrected estimate of the momentum vector of the previous time step \(\hat{m}_{t-1}\) with the bias-corrected estimate of the current momentum vector \(\hat{m}_t\), which gives us the Nadam update rule:
\(\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} (\beta_1 \hat{m}_t + \dfrac{(1 - \beta_1) g_t}{1 - \beta^t_1})\)
The following two animations (Image credit: Alec Radford) provide some intuitions towards the optimization behaviour of the presented optimization algorithms. Also have a look here for a description of the same images by Karpathy and another concise overview of the algorithms discussed.
In Image 5, we see their behaviour on the contours of a loss surface (the Beale function) over time. Note that Adagrad, Adadelta, and RMSprop almost immediately head off in the right direction and converge similarly fast, while Momentum and NAG are led off-track, evoking the image of a ball rolling down the hill. NAG, however, is quickly able to correct its course due to its increased responsiveness by looking ahead and heads to the minimum.
Image 6 shows the behaviour of the algorithms at a saddle point, i.e. a point where one dimension has a positive slope, while the other dimension has a negative slope, which pose a difficulty for SGD as we mentioned before. Notice here that SGD, Momentum, and NAG find it difficulty to break symmetry, although the two latter eventually manage to escape the saddle point, while Adagrad, RMSprop, and Adadelta quickly head down the negative slope.
As we can see, the adaptive learning-rate methods, i.e. Adagrad, Adadelta, RMSprop, and Adam are most suitable and provide the best convergence for these scenarios.
So, which optimizer should you now use? If your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate methods. An additional benefit is that you won't need to tune the learning rate but likely achieve the best results with the default value.
In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates. It is identical to Adadelta, except that Adadelta uses the RMS of parameter updates in the numinator update rule. Adam, finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. Kingma et al. [^{15}] show that its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice.
Interestingly, many recent papers use vanilla SGD without momentum and a simple learning rate annealing schedule. As has been shown, SGD usually achieves to find a minimum, but it might take significantly longer than with some of the optimizers, is much more reliant on a robust initialization and annealing schedule, and may get stuck in saddle points rather than local minima. Consequently, if you care about fast convergence and train a deep or complex neural network, you should choose one of the adaptive learning rate methods.
Given the ubiquity of large-scale data solutions and the availability of low-commodity clusters, distributing SGD to speed it up further is an obvious choice.
SGD by itself is inherently sequential: Step-by-step, we progress further towards the minimum. Running it provides good convergence but can be slow particularly on large datasets. In contrast, running SGD asynchronously is faster, but suboptimal communication between workers can lead to poor convergence. Additionally, we can also parallelize SGD on one machine without the need for a large computing cluster. The following are algorithms and architectures that have been proposed to optimize parallelized and distributed SGD.
Niu et al. [^{23}] introduce an update scheme called Hogwild! that allows performing SGD updates in parallel on CPUs. Processors are allowed to access shared memory without locking the parameters. This only works if the input data is sparse, as each update will only modify a fraction of all parameters. They show that in this case, the update scheme achieves almost an optimal rate of convergence, as it is unlikely that processors will overwrite useful information.
Downpour SGD is an asynchronous variant of SGD that was used by Dean et al. [^{4}] in their DistBelief framework (predecessor to TensorFlow) at Google. It runs multiple replicas of a model in parallel on subsets of the training data. These models send their updates to a parameter server, which is split across many machines. Each machine is responsible for storing and updating a fraction of the model's parameters. However, as replicas don't communicate with each other e.g. by sharing weights or updates, their parameters are continuously at risk of diverging, hindering convergence.
McMahan and Streeter [^{12}] extend AdaGrad to the parallel setting by developing delay-tolerant algorithms that not only adapt to past gradients, but also to the update delays. This has been shown to work well in practice.
TensorFlow [^{13}] is Google's recently open-sourced framework for the implementation and deployment of large-scale machine learning models. It is based on their experience with DistBelief and is already used internally to perform computations on a large range of mobile devices as well as on large-scale distributed systems. For distributed execution, a computation graph is split into a subgraph for every device and communication takes place using Send/Receive node pairs. However, the open source version of TensorFlow currently does not support distributed functionality (see here). Update 13.04.16: A distributed version of TensorFlow has been released.
Zhang et al. [^{14}] propose Elastic Averaging SGD (EASGD), which links the parameters of the workers of asynchronous SGD with an elastic force, i.e. a center variable stored by the parameter server. This allows the local variables to fluctuate further from the center variable, which in theory allows for more exploration of the parameter space. They show empirically that this increased capacity for exploration leads to improved performance by finding new local optima.
Finally, we introduce additional strategies that can be used alongside any of the previously mentioned algorithms to further improve the performance of SGD. For a great overview of some other common tricks, refer to [^{22}].
Generally, we want to avoid providing the training examples in a meaningful order to our model as this may bias the optimization algorithm. Consequently, it is often a good idea to shuffle the training data after every epoch.
On the other hand, for some cases where we aim to solve progressively harder problems, supplying the training examples in a meaningful order may actually lead to improved performance and better convergence. The method for establishing this meaningful order is called Curriculum Learning [^{16}].
Zaremba and Sutskever [^{17}] were only able to train LSTMs to evaluate simple programs using Curriculum Learning and show that a combined or mixed strategy is better than the naive one, which sorts examples by increasing difficulty.
To facilitate learning, we typically normalize the initial values of our parameters by initializing them with zero mean and unit variance. As training progresses and we update parameters to different extents, we lose this normalization, which slows down training and amplifies changes as the network becomes deeper.
Batch normalization [^{18}] reestablishes these normalizations for every mini-batch and changes are back-propagated through the operation as well. By making normalization part of the model architecture, we are able to use higher learning rates and pay less attention to the initialization parameters. Batch normalization additionally acts as a regularizer, reducing (and sometimes even eliminating) the need for Dropout.
According to Geoff Hinton: "Early stopping (is) beautiful free lunch" (NIPS 2015 Tutorial slides, slide 63). You should thus always monitor error on a validation set during training and stop (with some patience) if your validation error does not improve enough.
Neelakantan et al. [^{21}] add noise that follows a Gaussian distribution \(N(0, \sigma^2_t)\) to each gradient update:
\(g_{t, i} = g_{t, i} + N(0, \sigma^2_t)\).
They anneal the variance according to the following schedule:
\( \sigma^2_t = \dfrac{\eta}{(1 + t)^\gamma} \).
They show that adding this noise makes networks more robust to poor initialization and helps training particularly deep and complex networks. They suspect that the added noise gives the model more chances to escape and find new local minima, which are more frequent for deeper models.
In this blog post, we have initially looked at the three variants of gradient descent, among which mini-batch gradient descent is the most popular. We have then investigated algorithms that are most commonly used for optimizing SGD: Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, as well as different algorithms to optimize asynchronous SGD. Finally, we've considered other strategies to improve SGD such as shuffling and curriculum learning, batch normalization, and early stopping.
I hope that this blog post was able to provide you with some intuitions towards the motivation and the behaviour of the different optimization algorithms. Are there any obvious algorithms to improve SGD that I've missed? What tricks are you using yourself to facilitate training with SGD? Let me know in the comments below.
Thanks to Denny Britz and Cesar Salgado for reading drafts of this post and providing suggestions.
This blog post is also available as an article on arXiv, in case you want to refer to it later.
In case you found it helpful, consider citing the corresponding arXiv article as:
Sebastian Ruder (2016). An overview of gradient descent optimisation algorithms. arXiv preprint arXiv:1609.04747.
This blog post has been translated into the following languages:
Update 21.06.16: This post was posted to Hacker News. The discussion provides some interesting pointers to related work and other techniques.
Sutton, R. S. (1986). Two problems with backpropagation and other steepest-descent learning procedures for networks. Proc. 8th Annual Conf. Cognitive Science Society. ↩
Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks : The Official Journal of the International Neural Network Society, 12(1), 145–151. http://doi.org/10.1016/S0893-6080(98)00116-6 ↩
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159. Retrieved from http://jmlr.org/papers/v12/duchi11a.html ↩
Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V, … Ng, A. Y. (2012). Large Scale Distributed Deep Networks. NIPS 2012: Neural Information Processing Systems, 1–11. http://doi.org/10.1109/ICDAR.2011.95 ↩
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543. http://doi.org/10.3115/v1/D14-1162 ↩
Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. Retrieved from http://arxiv.org/abs/1212.5701 ↩
Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.), vol. 269, pp. 543– 547. ↩
Bengio, Y., Boulanger-Lewandowski, N., & Pascanu, R. (2012). Advances in Optimizing Recurrent Networks. Retrieved from http://arxiv.org/abs/1212.0901 ↩
Sutskever, I. (2013). Training Recurrent neural Networks. PhD Thesis. ↩
Darken, C., Chang, J., & Moody, J. (1992). Learning rate schedules for faster stochastic gradient search. Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop, (September), 1–11. http://doi.org/10.1109/NNSP.1992.253713 ↩
H. Robinds and S. Monro, “A stochastic approximation method,” Annals of Mathematical Statistics, vol. 22, pp. 400–407, 1951. ↩
Mcmahan, H. B., & Streeter, M. (2014). Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning. Advances in Neural Information Processing Systems (Proceedings of NIPS), 1–9. Retrieved from http://papers.nips.cc/paper/5242-delay-tolerant-algorithms-for-asynchronous-distributed-online-learning.pdf ↩
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., … Zheng, X. (2015). TensorFlow : Large-Scale Machine Learning on Heterogeneous Distributed Systems. ↩
Zhang, S., Choromanska, A., & LeCun, Y. (2015). Deep learning with Elastic Averaging SGD. Neural Information Processing Systems Conference (NIPS 2015), 1–24. Retrieved from http://arxiv.org/abs/1412.6651 ↩
Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13. ↩
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. Proceedings of the 26th Annual International Conference on Machine Learning, 41–48. http://doi.org/10.1145/1553374.1553380 ↩
Zaremba, W., & Sutskever, I. (2014). Learning to Execute, 1–25. Retrieved from http://arxiv.org/abs/1410.4615 ↩
Ioffe, S., & Szegedy, C. (2015). Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv Preprint arXiv:1502.03167v3. ↩
Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. arXiv, 1–14. Retrieved from http://arxiv.org/abs/1406.2572 ↩
Sutskever, I., & Martens, J. (2013). On the importance of initialization and momentum in deep learning. http://doi.org/10.1109/ICASSP.2013.6639346 ↩
Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., & Martens, J. (2015). Adding Gradient Noise Improves Learning for Very Deep Networks, 1–11. Retrieved from http://arxiv.org/abs/1511.06807 ↩
LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). Efficient BackProp. Neural Networks: Tricks of the Trade, 1524, 9–50. http://doi.org/10.1007/3-540-49430-8_2 ↩
Niu, F., Recht, B., Christopher, R., & Wright, S. J. (2011). Hogwild! : A Lock-Free Approach to Parallelizing Stochastic Gradient Descent, 1–22. ↩
Dozat, T. (2016). Incorporating Nesterov Momentum into Adam. ICLR Workshop, (1), 2013–2016. ↩
Duchi et al. [3] give this matrix as an alternative to the full matrix containing the outer products of all previous gradients, as the computation of the matrix square root is infeasible even for a moderate number of parameters \(d\). ↩
Image credit for cover photo: Karpathy's beautiful loss functions tumblr
]]>fastText | vw | ||||
Dataset | ng | time | acc | time | acc |
ag news | 1 | 91.5 | 2s | 91.9 | |
ag news | 2 | 3s | 92.5 | 5s | 92.3 |
amazon full | 1 | 55.8 | 47s | 53.6 | |
amazon full | 2 | 33s | 60.2 | 69s | 56.6 |
amazon polarity | 1 | 91.2 | 46s | 91.3 | |
amazon polarity | 2 | 52s | 94.6 | 68s | 94.2 |
dbpedia | 1 | 98.1 | 8s | 98.4 | |
dbpedia | 2 | 8s | 98.6 | 17s | 98.7 |
sogou news | 1 | 93.9 | 25s | 93.6 | |
sogou news | 2 | 36s | 96.8 | 30s | 96.9 |
yahoo answers | 1 | 72.0 | 30s | 70.6 | |
yahoo answers | 2 | 27s | 72.3 | 48s | 71.0 |
yelp full | 1 | 60.4 | 16s | 56.9 | |
yelp full | 2 | 18s | 63.9 | 37s | 60.0 |
yelp polarity | 1 | 93.8 | 10s | 93.6 | |
yelp polarity | 2 | 15s | 95.7 | 20s | 95.5 |
Basically the only flags to vw are (1) telling it to do multiclass classification with one-against-all, (2) telling it to use 25 bits (not tuned), and telling it to either use unigrams or bigrams. [Comparison note: this means vw is using 33m hash bins; fastText used 10m for unigram models and 100m for bigram models.]% cat run.sh
#!/bin/bash
d=$1
for ngram in 1 2 ; do
cat $d/train.csv | ./csv2vw.pl | \
time vowpal_wabbit/vowpalwabbit/vw --oaa `cat $d/classes.txt | wc -l` \
-b25 --ngram $ngram -f $d/model.$ngram
cat $d/test.csv | ./csv2vw.pl | \
time vowpal_wabbit/vowpalwabbit/vw -t -i $d/model.$ngram
done
There are two exceptions where I did slightly more data munging. The datasets released for dbpedia and Soguo were not properly shuffled, which makes online learning hard. I preprocessed the training data by randomly shuffling it. This took 2.4s for dbpedia and 12s for Soguo.% cat csv2vw.pl
#!/usr/bin/perl -w
use strict;
while (<>) {
chomp;
if (/^"*([0-9]+)"*,"(.+)"*$/) {
print $1 . ' | ';
$_ = lc($2);
s/","/ /g;
s/""/"/g;
s/([^a-z0-9 -\\]+)/ $1 /g;
s/:/C/g;
s/\|/P/g;
print $_ . "\n";
} else {
die "malformed line '$_'";
}
}
This means we end up with a data set that is in a long, skinny format instead of a wide format. Tidy data sets are easier to work with, and this is no less true when one starts to work with text. Most of the tooling and infrastructure needed for text mining with tidy data frames already exists in packages like dplyr, broom, tidyr, and ggplot2. Our goal in writing the tidytext package is to provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. We got a great start on our work when we were at the unconference and recently finished getting the first release ready; this week, tidytext was released on CRAN!
One of the important functions we knew we needed was something to unnest text by some chosen token. In my previous blog posts, I did this with a for
loop first and then with a function that involved several dplyr bits. In the tidytext package, there is a function unnest_tokens
which has this functionality; it restructures text into a one-token-per-row format. This function is a way to convert a dataframe with a text column to be one-token-per-row. Let’s look at an example using Jane Austen’s novels.
(What?! Surely you’re not tired of them yet?)
The janeaustenr package has a function austen_books
that returns a tidy dataframe of all of the novels. Let’s use that, annotate a linenumber
quantity to keep track of lines in the original format, use a regex to find where all the chapters are, and then unnest_tokens
.
This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, sentences, lines, paragraphs, or separation around a regex pattern.
Now that the data is in one-word-per-row format, the TIDY DATA MAGIC can happen and we can manipulate it with tidy tools like dplyr. For example, we can remove stop words (kept in the tidytext dataset stop_words
) with an anti_join
.
Then we can use count
to find the most common words in all of Jane Austen’s novels as a whole.
Sentiment analysis can be done as an inner join. Three sentiment lexicons are in the tidytext package in the sentiment
dataset. Let’s examine how sentiment changes changes during each novel. Let’s find a sentiment score for each word using the Bing lexicon, then count the number of positive and negative words in defined sections of each novel.
Now we can plot these sentiment scores across the plot trajectory of each novel.
This is similar to some of the plots I have made in previous posts, but the effort and time required to make it is drastically less. More importantly, the thinking required to make it comes much more easily because it all falls so naturally out of joins and other dplyr
verbs.
Lots of useful work can be done by tokenizing at the word level, but sometimes it is useful or necessary to look at different units of text. For example, some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole. These algorithms try to understand that
I am not having a good day.
is a negative sentence, not a positive one, because of negation. The Stanford CoreNLP tools and the sentimentr R package (currently available on Github but not CRAN) are examples of such sentiment analysis algorithms. For these, we may want to tokenize text into sentences.
Let’s look at just one.
The sentence tokenizing does seem to have a bit of trouble with UTF-8 encoded text, especially with sections of dialogue; it does much better with punctuation in ASCII.
Near the beginning of this vignette, we used a regex to find where all the chapters were in Austen’s novels. We can use tidy text analysis to ask questions such as what are the most negative chapters in each of Jane Austen’s novels? First, let’s get the list of negative words from the Bing lexicon. Second, let’s make a dataframe of how many words are in each chapter so we can normalize for the length of chapters. Then, let’s find the number of negative words in each chapter and divide by the total words in each chapter. Which chapter has the highest proportion of negative words?
These are the chapters with the most negative words in each book, normalized for number of words in the chapter. What is happening in these chapters? In Chapter 29 of Sense and Sensibility Marianne finds out what an awful jerk Willoughby is by letter, and in Chapter 34 of Pride and Prejudice Mr. Darcy proposes for the first time (so badly!). Chapter 45 of Mansfield Park is almost the end, when Tom is sick with consumption and Mary is revealed as all greedy and a gold-digger, Chapter 15 of Emma is when horrifying Mr. Elton proposes, and Chapter 27 of Northanger Abbey is a short chapter where Catherine gets a terrible letter from her inconstant friend Isabella. Chapter 21 of Persuasion is when Anne’s friend tells her all about Mr. Elliott’s immoral past.
Interestingly, many of those chapters are very close to the ends of the novels; things tend to get really bad for Jane Austen’s characters before their happy endings, it seems. Also, these chapters largely involve terrible revelations about characters through letters or conversations about past events, rather than some action happening directly in the plot. All that, just with dplyr
verbs, because the data is tidy.
Another function in tidytext is pair_count
, which counts pairs of items that occur together within a group. Let’s count the words that occur together in the lines of Pride and Prejudice.
This can be useful, for example, to plot a network of co-occuring words with the igraph and ggraph packages.
Ten/five/whatever thousand pounds a year!
Let’s do another one!
Lots of proper nouns are showing up in these network plots (Box Hill, Frank Churchill, Lady Catherine de Bourgh, etc.), and it is easy to pick out the main characters (Elizabeth, Emma). This type of network analysis is mainly showing us the important people and places in a text, and how they are related.
A common task in text mining is to look at word frequencies and to compare frequencies across different texts. We can do this using tidy data principles pretty smoothly. We already have Jane Austen’s works; let’s get two more sets of texts to compare to. Dave has just put together a new package to search and download books from Project Gutenberg through R; we’re going to use that because this is a better way to follow Project Gutenberg’s rules for robot access. And it is SO nice to use! First, let’s look at some science fiction and fantasy novels by H.G. Wells, who lived in the late 19th and early 20th centuries. Let’s get The Time Machine, The War of the Worlds, The Invisible Man, and The Island of Doctor Moreau.
Just for kicks, what are the most common words in these novels of H.G. Wells?
Now let’s get some well-known works of the Brontë sisters, whose lives overlapped with Jane Austen’s somewhat but who wrote in a bit of a different style. Let’s get Jane Eyre, Wuthering Heights, The Tenant of Wildfell Hall, Villette, and Agnes Grey.
What are the most common words in these novels of the Brontë sisters?
Well, Jane Austen is not going around talking about people’s HEARTS this much; I can tell you that right now. Those Brontë sisters, SO DRAMATIC. Interesting that “time” and “door” are in the top 10 for both H.G. Wells and the Brontë sisters. “Door”?!
Anyway, let’s calculate the frequency for each word for the works of Jane Austen, the Brontë sisters, and H.G. Wells.
I’m using str_extract
here because the UTF-8 encoded texts from Project Gutenberg have some examples of words with underscores around them to indicate emphasis (you know, like italics). The tokenizer treated these as words but I don’t want to count “_any_” separately from “any”. Now let’s plot.
Words that are close to the line in these plots have similar frequencies in both sets of texts, for example, in both Austen and Brontë texts (“miss”, “time”, “lady”, “day” at the upper frequency end) or in both Austen and Wells texts (“time”, “day”, “mind”, “brother” at the high frequency end). Words that are far from the line are words that are found more in one set of texts than another. For example, in the Austen-Brontë plot, words like “elizabeth”, “emma”, “captain”, and “bath” (all proper nouns) are found in Austen’s texts but not much in the Brontë texts, while words like “arthur”, “dark”, “dog”, and “doctor” are found in the Brontë texts but not the Austen texts. In comparing H.G. Wells with Jane Austen, Wells uses words like “beast”, “guns”, “brute”, and “animal” that Austen does not, while Austen uses words like “family”, “friend”, “letter”, and “agreeable” that Wells does not.
Overall, notice that the words in the Austen-Brontë plot are closer to the zero-slope line than in the Austen-Wells plot and also extend to lower frequencies; Austen and the Brontë sisters use more similar words than Austen and H.G. Wells. Also, you might notice the percent frequencies for individual words are different in one plot when compared to another because of the inner join; not all the words are found in all three sets of texts so the percent frequency is a different quantity.
Let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between Austen and the Brontë sisters, and between Austen and Wells?
The relationship between the word frequencies is different between these sets of texts, as it appears in the plots.
There is another whole set of functions in tidytext for converting to and from document-term matrices. Many existing text mining data sets are in document-term matrices, or you might want such a matrix for a specific machine learning application. The tidytext package has tidy
functions for objects from the tm and quanteda packages so you can convert back and forth. (For more on the tidy
verb, see the broom package). This allows, for example, a workflow with easy reading, filtering, and processing to be done using dplyr and other tidy tools, after which the data can be converted into a document-term matrix for machine learning applications. For examples of working with objects from other text mining packages using tidy data principles, see the tidytext vignette on converting to and from document-term matrices.
Many thanks to rOpenSci for hosting the unconference where we started work on the tidytext package, and to Gabriela de Queiroz, who contributed to the package while we were at the unconference. I am super happy to have collaborated with Dave; it has been a delightful experience. The R Markdown file used to make this blog post is available here. I am very happy to hear feedback or questions!
The Life-Changing Magic of Tidying Text was originally published by Julia Silge at data science ish on April 29, 2016.
]]>A central question in text mining and natural language processing is how to quantify what a document is about. Can we do this by looking at the words that make up the document? One way to approach how important a word can be is its term frequency (tf), how frequently a word occurs in a document. There are words in a document, though, that occur many times but may not be important; in English, these are probably words like “the”, “is”, “of”, and so forth. You might take the approach of adding words like these to a list of stop words and removing them before analysis, but it is possible that some of these words might be more important in some documents than others. A list of stop words is not a sophisticated approach to adjusting term frequency for commonly used words.
Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf, the frequency of a term adjusted for how rarely it is used. It is intended to measure how important a word is to a document in a collection (or corpus) of documents. It is a rule-of-thumb or heuristic quantity; while it has proved useful in text mining, search engines, etc., its theoretical foundations are considered less than firm by information theory experts. The inverse document frequency for any given term is defined as
We can use tidy data principles to approach tf-idf analysis and use consistent, effective tools to quantify how important various terms are in a document that is part of a collection.
Let’s start by looking at the published novels of Jane Austen and examine first term frequency, then tf-idf. We can start just by using dplyr verbs such as group_by
and join
. What are the most commonly used words in Jane Austen’s novels? (Let’s also calculate the total words in each novel here, for later use.)
The usual suspects are here, “the”, “and”, “to”, and so forth. Let’s look at the distribution of n/total
for each novel, the number of times a word appears in a novel divided by the total number of terms (words) in that novel. This is exactly what term frequency is.
There are very long tails to the right for these novels (those extremely common words!) that I have not shown in these plots. These plots exhibit similar distributions for all the novels, with many words that occur rarely and fewer words that occur frequently. The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the group of Jane Austen’s novels as a whole. Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common. Let’s do that now.
Notice that idf and thus tf-idf are zero for these extremely common words. These are all words that appear in all six of Jane Austen’s novels, so the idf term (which will then be the natural log of 1) is zero. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the documents in a collection; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection. Let’s look at terms with high tf-idf in Jane Austen’s works.
Here we see all proper nouns, names that are in fact important in these novels. None of them occur in all of novels, and they are important, characteristic words for each text. Some of the values for idf are the same for different terms because there are 6 documents in this corpus and we are seeing the numerical value for ln(6/1), ln(6/2), etc. Let’s look at a visualization for these high tf-idf words.
Let’s look at the novels individually.
Still all proper nouns! These words are, as measured by tf-idf, the most important to each novel and most readers would likely agree.
Let’s work with another corpus of documents, to see what terms are important in a different set of works. In fact, let’s leave the world of fiction and narrative entirely. My background is in physics, so let’s download some classic physics texts from Project Gutenberg and see what terms are important in these works, as measured by tf-idf. Let’s download Discourse on Floating Bodies by Galileo Galilei, Treatise on Light by Christiaan Huygens, Experiments with Alternate Currents of High Potential and High Frequency by Nikola Tesla, and Relativity: The Special and General Theory by Albert Einstein.
This is a pretty diverse bunch. They may all be physics classics, but they were written across a 300-year timespan, and some of them were first written in other languages and then translated to English. Perfectly homogeneous these are not, but that doesn’t stop this from being an interesting exercise!
Here we see just the raw counts, and of course these documents are all very different lengths. Let’s go ahead and calculate tf-idf.
Nice! Let’s look at each text individually.
Very interesting indeed. One thing I saw here that I wanted to understand was what was going on with “gif” in the Einstein text?!
Some cleaning up of the text might be in order. The same thing is true for “eq”, obviously here. “K1” is the name of a coordinate system for Einstein:
Also notice that in this line we have “co-ordinate”, which explains why there are separate “co” and “ordinate” items in the high tf-idf words for the Einstein text. “AB”, “RC”, and so forth are names of rays, circles, angles, and so forth for Huygens.
Let’s remove some of these less meaningful words to make a better plot to end on.
I feel like we don’t hear enough about ramparts or things being ethereal in physics today.
Other notable new functionality in tidytext 0.1.1 includes the ability to tidy LDA objects and approach topic modeling using tidy data principles; check out the topic modeling vignette that is included in the new release for a sad tale of a vandal breaking into a library and tearing apart books. The R Markdown file used to make this blog post is available here. I am very happy to hear feedback or questions!
Term Frequency and tf-idf Using Tidy Data Principles was originally published by Julia Silge at data science ish on June 27, 2016.
]]>