A coworker recently asked how people keep up with everything that’s going on in Big Data, and I ended up writing a lot, which I thought I’d post here.
For reference, I do daily work as a data scientist mostly in Python these days, but I’ve done R and Hadoop in the past.
I see Big Data as a subset of data engineering, which is really just a subset of overall development concepts, so while I try to read articles specific to new data technologies like Spark and Cassandra, I also read as many general development sources as I can to better understand systems.
1) Hacker News - I read Hacker News on a daily basis. It’s the best source for news and commentary of everything going on in the tech world in the United States today. I click specifically on articles related to data and check out the comments to see what practical experience people have. I also try to read about things I’m not familiar with.
For example, right now on the front page, there is a post called “An APL Compiler targeting a Typed Array Intermediate Language”. I have no idea what most of this means, but I’ll click through anyway and check out the source code to see how the code is structured on GitHub. Then I’ll google around to find out what’s special about typed array intermediate languages. Then I’ll check out the GitHub repo author’s homepage. After doing this kind of internet wandering for months upon months, you start to be familiar with the tech scene and the terminology around it.
2) Newsletters - I receive several data science newsletters that I pick relevant links out of to read and cross-reference with Hacker News. A few favorites:
3) Twitter - I find Twitter to be really noisy and not as enjoyable as it used to be even as little as two years ago. But there are still a few people tweeting good links, and who generate industry discussion
- @thepracticaldev is one of the best Twitter accounts to get a feel for the zeitgeist of software development today and has lots of good links
- @randal_olson is really good about catching all the hot data science links of the day and at generating a community of commentary around those links
- @b0rk is probably one of the friendliest public faces in tech today, very generous with her knowledge, and contagious in her enthusiasm for the topic.
4) Reddit - A couple of subreddits have discussion that is at the quality of Hacker News and maybe even more because they focus less on abstract ideas and more on actual examples.
- r/learnpython - for nuances you missed the first time arond
- r/python - After you’re done learning and have general questions
- r/programming - Sometimes the same posts as Hacker News, but often even more technical rather than startup-y and offer broader discussion
- r/statistics - Sometimes good discussions that really get into the nitty gritty
- r/machinelearning - Getting to be a more active community with lots of good links and pointers
- r/cscareerquestions - More for junior devs just starting in their careers, but has a lot of good discussion about salary negotiations, different work environments, etc.
For entertainment as well as education:
- r/sysadmin and r/talesfromtechsupport - The best groan-worthy stories online. Also you learn a lot about how not to do devops.
- r/programmerhumor - Self-explanatory
5) Arxiv - A popular repository of freely-accessible academic papers. It has sections that tell you the latest papers that have been published or are close to publication for a particular domain of knowledge. I check out what’s new under the fields of Databases, Statistics, and keywords relevant to data engineering systems.
6) I’m part of a data engineering slack where people talk about issues they’ve had at their company and how they’re solving them.
7) I go to local meetups and ask what people are using in their big data/ data engineering stacks. Two really good ones I went to recently were Papers We Love Philly, and the Philly Area Scala Enthusiasts lecture on DataFrames in Spark. I also love DataPhilly and PhillyPUG and attend as much as family life allows.
I also talk to my friends in the industry. What are they using? What are they not doing? What’s in? What’s out? Which vendors are they evaluating?
A helpful but hard way to find your way around the market is to go on job interviews. When I was interviewing last year, I got to know a LOT of local company tech stacks and common problem-solving patterns. I recommend this if you’re actually looking for a job…otherwise, informational interviews could work.
8) I subscribe to mailing lists of development projects I’m interested in. For example, I’m in the Spark mailing list, as well as scikit-learn. I’ve learned a lot about people’s various use cases and issues from these mailing lists, as well as common architecture patterns. I archive them all in my gmail so I can reference.
9) GitHub - I surf it from time to time to see what’s popular, and to see how to write good code and code documentation. It’s particularly useful to see how to structure specific blocks of code in real live projects.
10) Stack OverFlow - Excellent source for answers, but I also sometimes just browse Cross Validated, the stas sister site, to see what’s popular and what people are answering.
11) O’Reilly and Manning emails - I bought books from them at one point and still get the emails. The latest book announcements are a good signal of what tech is in demand. They also sometimes have interesting free webinars.
12) Podcasts - I’ll binge-listen to a bunch every few months. There are starting to be some really high-quality data-driven podcasts out there. The trick is finding ones where the creators have a good back-and-forth and get into the tech right away as opposed to chatting about the weekend for 40 minutes.
13) KD Nuggets - I don’t always read here, but I’ll skim from time to time.