I... am... just like baffled at why Docker is used. Are any of these like not normal unix utilities? What dependencies are you pulling in after GNU/Linux... which is where CONTAINERS come from? I'm even only even putting "GNU" here because it's literally every fucking Linux distro on the planet containing the utilities referenced. Is it common place now-a-days to bloat a docker image with text data or something? Is docker only for the csvlookup tool?
Not trying to be a supermassive asshole but I do feel like I'm losing my mind reading this... In order to use GNU/Linux one must install Linux, setup a container runtime, and then! Use the same utilities available to them in the runtime... from a container?
I feel like if this is the state of a tutorial for someone new to these utilities we're all pretty fucked and should probably take a step back to re-evaluate the mess we've made. I'll go crawl back into my hole now to do just that.
Some programs have different flags and behaviors between Mac OS and linux. This has bitten me quite a few times. Even on Linux there are often competing versions (BSD and GNU version might have different flags for example).
If you are interested in platform independence, why even bother with the Unix utilities? You can accomplish the same with Python, have results that are more readable to the non-Unix crowd, and not rely upon sometimes esoteric knowledge of the Unix utilities (and the inconsistencies between vendors).
That isn't to say that the Unix utilities are without merit. I have used them extensively for similar tasks. That being said, it is in environments where everyone used some version of Unix (typically Linux, Solaris, or Mac OS X).
Docker 'can' be used as a way to provide a shell which has a consistent set of programs installed. (As a bonus, using Docker this way let's you avoid 'polluting' the host OS with various packages).
Sure, this case where it's just coreutils + csvlookup, the benefit for the added abstractions isn't so great. Using docker like this makes more sense if there are more dependencies. (Maybe it's just easier to say "use docker" than to explain how to install a dependency on different OSs?).
In general, I think it's fine for people to be able to solve problems without having a full understanding of the mechanisms involved in getting the tool to work. If they need to know it later, they'll learn it then. This is a top-down approach -- But, naturally, there'll be some awkward solutions on the way.
No. I’m sorry but I disagree in this case. These tools are literally installed on every posix machine on the planet. The whole point here is that there are no dependencies this is the “system.” It’s like trying to explain how an engine works by only looking at the turbo, “Oh yeah, gotta have a turbo or your car won’t drive.” ... You mean I gotta have an engine or I can’t even use a turbo?
The whole point of my post is that running a container otecshstration system to accomplish any of this, for someone new, is insane. It is absolutely 100% only adding complexity and making things more confusing. How many people will not realize these utilities are on every computer they use? How long until someone poisons these repacked uselessly opaque versions? This makes things easier for no one and only furthers to complicate things by adding unnecessary complexity.
If someone is on a Mac, then they are now running a Linux VM, then running containers, then logging into said fucked up environment and then finallly... running... awk?
macOS -> Linux -> Container Runtime -> SSH-> finally use awk, sed, head, or anything else. Why stop there? We can probably then start up Qemu from a container, boot MacOS, and then simply repeat the process for maximum usefulness.
> Just because the complexity is hidden away behind a pretty interface doesn’t mean it isn’t there.
In the context of making things hard for a tutorial or complex for a beginner, it not being visible is the key point. Just like the complexities of how your CPU actually works aren't relevant for this tutorial.
This is a very strong response to what seems like an obvious point clearly laid out at the start.
This is a tutorial and set of examples, and it comes with a fully working environment with example data exactly setup to work. No difference between gsort and sort, no version issues or whether someone is using brew or macports or anything else. No python environment problems or pip Vs pip3 or whatever other random crap you need to deal with when trying to do something simple with lots of varied setups people have.
> This is a very strong response to what seems like an obvious point clearly laid out at the start.
The point is invalid and clearly untrue.
> This is a tutorial and set of examples, and it comes with a fully working environment with example data exactly setup to work. No difference between gsort and sort, no version issues or whether someone is using brew or macports or anything else. No python environment problems or pip Vs pip3 or whatever other random crap you need to deal with when trying to do something simple with lots of varied setups people have.
I really don't see how having to get Docker going is alleviating any of the above headaches but I'm far removed from these tools being "new to me." Setting up virtualization to run a process on my local machine which I can run without virtualization or containerization the very same way seems crazy to me. It seems like an awful amount of complexity to push onto beginners for what is likely to be zero benefit and gives them only what they had at the start...
> You are entirely free to not use docker.
>> In order to use GNU/Linux one must
> Nope. You can. You don't have to.
Yeah that's my point exactly... you don't have to at all in fact, Docker offers most literally nothing but opaque complexity. I was saying you MUST have Linux if you want containers because containers are a Linux feature and DNE on MacOS but all them sweet utilities are already there save the weird csv tool. I don't think the complexity of Docker is warranted for simply handling a CSV tool and I find it to be a disservice to anyone who is learning.
What point is invalid? That it's a prepackaged setup for all users to avoid having these kinds of problems? It's pointed out right at the beginning and then a longer description is linked to about how the writer hit exactly this kind of problem because mac has different basic utilities and their sed worked differently.
> I really don't see how having to get Docker going is alleviating any of the above headaches
Because it is a single thing to get working that then leaves all users of your tutorial on all major platforms dealing with the exact same environment. Have you ever tried to work with getting end users up to speed only to discover that one is using python2.7, another has anaconda but pip is pointing to the wrong place, pip3 is required for someone else and one user has mac so needs to install brew then replace every standard call to sort with gsort. Then dateutils appears as some conflicted dependency...
> Setting up virtualization to run a process on my local machine which I can run without virtualization or containerization the very same way seems crazy to me
Then don't, you don't have to. Go and setup an identical environment. For those that have docker this becomes a single command to run, and those that don't can get setup quickly. If they're heading into data science they're likely to benefit from having at least a basic understanding of docker.
> It seems like an awful amount of complexity to push onto beginners for what is likely to be zero benefit and gives them only what they had at the start...
It doesn't give zero benefit, it removes a whole bunch of differences in setup that are entirely irrelevant to the tutorial.
> Docker offers most literally nothing but opaque complexity.
It offers a standard environment that puts all users in the same place.
> but all them sweet utilities are already there save the weird csv too
No they are not, not on a mac and not on windows.
> I don't think the complexity of Docker is warranted for simply handling a CSV tool
What complexity? 'docker run --rm -it X bash' is literally all I need to do.
csvlook/csvkit, apparently; something I never heard of before, which is actually not installed by default in ubuntu. He also included the data itself, which is pretty helpful, but which could have more usefully gone into a git repo somewhere.
Also "kids these days" all reach for docker. Otherwise any sane person would have just used cat instead of csvlook.
Upvoted the kid's article, because command line text cleaning is of TOWERING IMPORTANCE to any DS person worth their salt. But he is definitely an inexperienced kid.
I think the csvlook utility comes from the Command Line Data Science book the author cites. I read some of that book, there was good info in there, but ultimately it was a bit annoying because it's author based a lot of things on utilities he had written himself (like csvlook) which in many cases just did trivial things you could easily have done with shell commands or coreutils.
It seems lots of people's knowledge of awk is limited to printing fields, and they'll happily chain awk with a bunch of grep and sed when a single awk invocation would do the job without fuss. For instance, TFA uses
In an afternoon, you can learn enough to be more than dangerous.
The presenter drafted up a set of (easy, yet practical) companion exercises and a pt2 video with his own answers. I cannot recommend this talk enough.
Although I hadn't the opportunity to put the knowledge to use and have since forgotten it (i don't work in tech), i find comfort in knowing i can reacquire the power of awk even sooner than the already-short first time around.
For the code here, which I used on an actual dataset (~500 NYT Headlines 1965--1974), the efficiency gaains of a Go / Python rewrite are ... slight.
On an early-2015 Android tablet running Termux.
I've thrown multimillion row datasets at awk (usually gawk, occasionally mawk, nawk on OSX, and, hell, busybox on occasion) without any practical performance issues. I'm virtually always writing for one-off or project-based analyssys, not live web-scale realtime processing. A second or even ten won't be missed.
I'm also aware that building a pipeline out of grep / cut / sed / tr / sort / uniq / awk is often conceptually nearer at hand. It almost always mirrors how I start exploring some dataset.
But a quick translation to straight awk gives cleaner code, more power, easier conversion to a script, and access to a small library of awk-based utilities I've acumulated.
All with far-more-than-adequate performance.
We've spent far more time discussing this than coding, let alone running, it.
> For the code here, which I used on an actual dataset (~500 NYT Headlines 1965--1974), the efficiency gaains of a Go / Python rewrite are ... slight.
500 records isn't a large dataset. Not even close. Large would be orders of millions to billions. And yes, I have had to munge datasets that large on many occasions.
> But a quick translation to straight awk gives cleaner code, more power, easier conversion to a script, and access to a small library of awk-based utilities I've acumulated.
- cleaner: only if you find awk readable. Plenty of people don't. Plenty of people find Go or Python more readable.
- more powerful: again depends. You wouldn't have multithreading builtin like pipelines would. And Python and Go are undoubtedly more powerful than awk. I'm not knocking awk here, just being pragmatic.
- easer conversion to a script: at which point you might as well skip awk entirely and jump straight to Go or Python (or any other programming language)
- and access to a small library of awk-based utilities I've accumulated: that only benefits you. If you're having to sell the benefits of awk to someone then odd are they don't have that small library already to hand ;)
> We've spent far more time discussing this than coding, let alone running, it.
Some of the larger datasets I've had to process have definitely taken longer to munge than my reply here has taken to type :)
Disclaimer: I've honestly not got a problem with awk, I used to use it heavily 20 years ago. But these days its value is diminishing and a lot of the awk evangelists seem to miss the point of why awk isn't well represented in blog posts any more. It's both more verbose than pipelining to coreutils and less powerful than a programming language -- it's that weird middle ground that doesn't provide much value to most people aside those who are already invested into the awk language. Some might see that as a loss but personally I see that as demonstrating the strength of all the other tools we have at our disposal these days.
When you have such a large awk script, you may as well also demonstrate that Awk is a “real scripting language” by putting it into its own script with a `#!/usr/bin/awk -f` shebang line. Much easier to edit than trying to get your shell to do it. (Unless you’re using a fancy editing shell like Emacs’ shell mode; but at that point, why would you need Awk? You probably know far better how to use your own shell-runtime’s text-manipulation primitives.)
Totally agree. I think mlr is a wonderful CLI tool. It is very robust, and can handle/convert multiple tabular formats including csv, tsv, json, fixed-format, etc. It has a pretty decent text output formatting, with the --opprint flag.
I use to be very comfortable using awk/sed/perl/sort/uniq/tr/tail/head from the CLI for the sort of data cleaning this article is talking about. However, over the past year I've found I use VisiData https://github.com/saulpw/visidata for interactive work.
If I need to clean up the data first, I'll use mlr or jq as input to Visidata. If my data is too dirty for mlr, then I'll use Unix toolbox tools mentioned as input to mlr, jq or VisiData.
VisiData provides some ability to script, but when possible I prefer to have the shell do the scripting with all the tools mentioned as input to Visidata.
I’ll probably catch a lot of flack but a lot of criticism in these comments for the author who finished something. It’s easy to sit in the cheap seats and cast stones, but this guy made something that works, and you can always make your own if you know better without tearing down the work of others.
"Packers And Movers Bangalore" prompt moving, relocation and shifting services for people and corporation moving to "Bengaluru" and round the India. For Movers Packers Bangalore city full target report on supply of revenue and effective Movers Packers in Bengaluru, contact today 08290173333. We include our network in major cities like Bangalore, Haridwar, Gurgaon, Chandigarh, Delhi NCR, Pune, Hyderabad, Chennai, Noida, Jaipur, Bhubaneswar, Mumbai, Lucknow, Patna, Bhopal, Ahmedabad and Kolkata.@ https://packersmoversbangalore.in/