Strata World x DSSG - The Data Science Revolutions

Coming December, we are proud to close the year with our biggest meetup yet - at Strata x Hadoop SG! We have invited an all-star team of strata speakers to share about real-world application of data science. It is a free event, open to all (no need for Strata x Hadoop tickets).

Agenda

• 6.45pm-7pm: Networking session.

• 7.00pm-7.20pm: PyTextRank: Graph algorithms for enhanced natural language processing by Paco Nathan (O’Reilly Media). In this talk, Paco illustrates PyTextRank use cases in media and learning to enable semisupervised word sense disambiguation, move from natural language parsing to natural language understanding, and implement AI-based video search and approximation algorithms for content recommendation based on semantic similarity.

• 7.20pm-7.40pm: From Peer Review to Production: Implementing Research papers in the wild by Arwen Griffioen & Chris Hausler (Zendesk). When researching new approaches to an industrial problem we turn to recently published scientific papers for ideas. However, applying the same model to our real world data rarely has the same quality of results. This is often the case when trying small improvements to an existing approach; we find that adding these small variants to our existing models doesn’t result in the same boost that was seen in the research setting. Why is that? What is happening? We will discuss the differences we found between academic and industrial research during the development of Answer Bot.

• 7.40pm-8pm: Considerations for Data at Speed by Ted Malaska (Blizzard). There are a lot of details that go into a IoT big data system. What is a respectable latency until data access, how to solve multiple regional, where to store the data, how to know what data you have, and where does streaming processing fit in. In this session, Ted will walk through his experiences and lessons learned from seeing implementations in the wild.

• 8.00pm-8.20pm: Magenta by Wolff Dobson (Google). Magenta is a Google Brain project to ask and answer the questions, “Can we use machine learning to create compelling art and music? If so, how? If not, why not?” Built on TensorFlow.

• 8.20pm-8.40pm: Deep Learning on Anonymized Datasets by Yufeng Guo (Google). As more data is collected, privacy becomes an increasingly important consideration in machine learning. Being able to bring value to users while maintaining privacy can be a challenge, but if done right, can strike an amazing balance between privacy and machine learning insight. In this talk, we’ll talk about some of the current techniques available to do this, and briefly showcase one set of techniques for performing deep learning on an anonymized dataset.

• 8.40pm-9pm: R for EVerything by Jared P. Lander (Lander Analytics). Everyone knows I love R. So much that I never want to leave the friendly environs of R and RStudio. Want to download a file? Use download.file. Want to create a directory? Use dir.create. Sending an email? gmailr. Using Git? git2r. Building this slideshow? rmarkdown. Writing a book? knitr. Let’s take a look at everyday activities that can be done in R.

• 9pm-930pm: Networking session.

About the speakers

Paco Nathan leads the Learning Group (https://www.oreilly.com/learning) at O’Reilly Media. Known as a “player/coach” data scientist, Paco led innovative data teams building ML apps at scale for several years and more recently was evangelist for Apache Spark (http://spark.apache.org/), Apache Mesos (http://mesos.apache.org/), and Cascading (http://www.cascading.org/). Paco has expertise in machine learning, distributed systems, functional programming, and cloud computing with 30+ years of tech-industry experience, ranging from Bell Labs to early-stage startups. Paco is an advisor for Amplify Partners (http://amplifypartners.com/) and was cited in 2015 as one of the top 30 people in big data and analytics (http://www.kdnuggets.com/2015/02/top-30-people-big-data-analytics.html) by Innovation Enterprise. He is the author of Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading.

Chris Hausler leads the data science team at Zendesk, a role he describes as turning lots of data into magic, which he does with the help of machine learning, Python, Hadoop, graphs galore, and amazing colleagues. Over his career, he’s held the titles of data scientist, data engineer, researcher, PhD student, consultant, and programmer.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Wolff Dobson is a developer programs engineer at Google specializing in machine learning and games. Previously, he worked as a game developer, where his projects included writing AI for the NBA 2K series and helping design the Wii Motion Plus. Wolff holds a PhD in artificial intelligence from Northwestern University.

Yufeng Guo is a developer advocate for the Google Cloud Platform, where he is trying to make machine learning more understandable and usable for all. He enjoys hearing about new and interesting applications of machine learning.

Jared P. Lander is the Chief Data Scientist of Lander Analytics, author of R for Everyone and an adjunct professor at Columbia University. With an M.A. from Columbia University in statistics and a B.S. from Muhlenberg College in mathematics, he has experience in both academic research and industry. His writings on statistics can be found at jaredlander.com and his work has been featured in publications such as Forbes and the Wall Street Journal.

You can contribute to the content of this article by using GitHub