To get best quality, general consensus is to shoot in D-Log color space. This will produce a rather bland, washed out result... but will also give you the most flexibility for post processing. Other settings (i.e. "vivid") essentially post-process in-camera.
Look up some videos on "color grading" and you will see that there are many "looks" obtainable with color grading your footage.
The term "cinematic" is generally used to mean a "film" look, which is the opposite of the sharp photographic frames these cameras can produce. When you shoot at a very high shutter speed, each frame is very sharp with plenty of detail. This is great for photography, but considered (by some) bad for video. It produces the "soap opera effect."
Rule of thumb is to set your shutter speed to double your frame rate. Shooting at 1080p60? Aim for a shutter speed of 1/120sec. Shooting at 1080p30? Aim for 1/60sec. That will produce some "motion blur" on moving objects in the frame. When played back as video, it will look more like film and thus more "cinematic."
Getting slower shutter speeds is difficult to impossible in bright conditions. That is where ND filters come in. They limit the amount of light coming into the camera, allowing for a slower shutter.
ND are good, but aren't completely necessary. I would suggest starting off shooting in D-log and learning to color grade your footage. When you have a better handle on that process, then you can work on exposure settings with ND filters.