dcsimg
www.webdeveloper.com
Results 1 to 11 of 11

Thread: free way to publish on internet crawlable txt doc pdf in a folder organized manner?

  1. #1
    Join Date
    Nov 2017
    Posts
    4

    Question free way to publish on internet crawlable txt doc pdf in a folder organized manner?

    Hello

    what is a free way to publish on internet crawlable txt doc pdf in a folder organized manner?

    Basicly I want to publish for free such a folder of txt doc pdf files on the internet, and I want them to be able to be crawled by google bot

    Thank you

  2. #2
    Join Date
    Mar 2007
    Location
    localhost
    Posts
    5,425
    You put the files up, you tell your web host to submit your site to the search engines... Job Done.
    --> JavaScript Frameworks like JQuery, Angular, Node <--
    ... and please remember to wrap code with forum BBCode tags:-

    [CODE]...[/CODE] [HTML]...[/HTML] [PHP]...[/PHP]

    If you can't think outside the box, you will be trapped forever with no escape...

  3. #3
    Join Date
    Nov 2017
    Posts
    4
    Quote Originally Posted by \\.\ View Post
    You put the files up, you tell your web host to submit your site to the search engines... Job Done.
    Thank you. If I mentioned "free" and to not open another post since this is related,

    does anyone knows a free host where I could lodge my 21 GB documents bundle to be crawled by google?

    P.S. After inquiring around some more also found the relevant info that free hosts usually don't allow server settings change, but if they would I would have to enable indexing and fiddle with robots.txt

  4. #4
    Join Date
    Mar 2012
    Posts
    3,961
    Hi and welcome to the site. The advice above is correct, but it might help to clarify a bit:

    1. You need hosting whether you build a conventional web site, or just dump a load of files in the root. You may be able to find free hosting, but it may have conditions (such as allowing ads to be displayed).

    2. As far as I'm aware, the bots will only search the root for such files. If you want the files to be in an organised structure of folders, to get the bots to index the files you need to either:

    a) Provide a file "index.html" in the root, with links to the files in the respective folders. The file will need to be in HTML format as it will act as a basic one page site. Or...

    b) Provide a file "sitemap.txt" in the root, which lists the folders that you want the bots to crawl.

  5. #5
    Join Date
    Nov 2017
    Posts
    4
    Quote Originally Posted by jedaisoul View Post
    Hi and welcome to the site. The advice above is correct, but it might help to clarify a bit:
    ...

    2. As far as I'm aware, the bots will only search the root for such files. If you want the files to be in an organised structure of folders, to get the bots to index the files you need to ...

    b) Provide a file "sitemap.txt" in the root, which lists the folders that you want the bots to crawl.
    Thank you, straight to the point of maintaining a published crawlable structure of folders

    Now for that, anyone knows an automatic robots.txt generator out of files and folders?

    P.S. The point is if someone searches something in google and my file appears relevant, he would get a link to the file and site perhaps. Later I may want google to point out somehow it's a listing of files, not just one file

    Another point is that then the same someone or someone else who finds my files relevant, is also able to orientate easily around using the structure of folders that provide tag-like info about the files

  6. #6
    Join Date
    Mar 2007
    Location
    localhost
    Posts
    5,425
    When you have your host submit to the search engines, your site is crawled and as long as those files are in page links, then the bot will find them.
    --> JavaScript Frameworks like JQuery, Angular, Node <--
    ... and please remember to wrap code with forum BBCode tags:-

    [CODE]...[/CODE] [HTML]...[/HTML] [PHP]...[/PHP]

    If you can't think outside the box, you will be trapped forever with no escape...

  7. #7
    Join Date
    Mar 2012
    Posts
    3,961
    Robots.txt will not help. It only contains exclusions i.e what to ignore. Sitemap.txt lists files to include that the bots would otherwise ignore. E.g. Files that are not linked to in index.html or its children.

  8. #8
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    22,214
    Make a public github repo of your files?
    "Well done....Consciousness to sarcasm in five seconds!" ~ Terry Pratchett, Night Watch

    How to Ask Questions the Smart Way (not affiliated with this site, but well worth reading)

    My Blog
    cwrBlog: simple, no-database PHP blogging framework

  9. #9
    Join Date
    Nov 2017
    Posts
    4
    In the meantime I've gathered this useful info

    [08:45] <daemon> stick em all on a directory with index enabled
    [08:45] <daemon> and in robots.txt allow google
    [09:46] <angry_pidgeon> daemon, thanks, will users be able to browse around the folders in any GUI way? I seem to remember browsing directories of files but don't know if they were crawlable
    [09:47] <daemon> yes
    [09:48] <daemon> http://ftp.freebsd.org/pub/FreeBSD/
    [09:48] <daemon> here is an example of one of them
    [09:48] <daemon> nginx and apache have their own modules for doing it
    [09:48] <daemon> index fancyindex ... plus others
    [10:04] <pokk_> What I'd do is find something that can generate a static index for your files, I'm sure there's a node.js/npm modules to generate such index
    Also, google can crawl pdf files:
    https://www.thewebmaster.com/seo/201...ther-web-page/

    Also, found a index generator based on file/folder hierarchy (google: site index generator from directory):
    https://www.google.com/search?client...fe_rd=cr&dcr=0

    And perhaps an useful tool for a menu (site index generator from file folders):
    https://www.google.com/search?client...fe_rd=cr&dcr=0

    P.S. Github has an 1 GB space limit

  10. #10
    Join Date
    Mar 2007
    Location
    localhost
    Posts
    5,425
    If you have Directory Listing set to ON then anyone can start poking around your folders if they can read your site code.

    If you have Directory Listing set to OFF then no one can see any folders other than what is visible in URL's, so if you have a folder with sub folders, the only thing that is exposed is the path, not the folder.

    You can also change the properties of the folders so that files are not visible to external viewers but are accessible to the server for use in serving up but make public visibility impossible to see any files, even with directory listing turned on, if the file permissions are set in a folder, then it is that which takes priority over a general directive to list directory contents.

    robots.txt won't really help, it is a voluntary code of practice, if a web bot wants to crawl your site, then it can ignore anything requested to be ignored, only real fireproof method is to set directory permissions for folders and files you want secured and turn off directory listing and expose as little to the outside world as possible.

    Use .htaccess rewrite rules to monitor for direct file and directory access and redirect that action as you see fit.
    --> JavaScript Frameworks like JQuery, Angular, Node <--
    ... and please remember to wrap code with forum BBCode tags:-

    [CODE]...[/CODE] [HTML]...[/HTML] [PHP]...[/PHP]

    If you can't think outside the box, you will be trapped forever with no escape...

  11. #11
    Join Date
    Mar 2007
    Location
    localhost
    Posts
    5,425
    ... also... Use HTML5 video player, its built in and does not require an out of date FLASH media player to be installed. I am not suer what browsers support FLASH these days, last I remember was the big security blunder and people dropped FLASH player and using Flash because of the security risk it poses.
    --> JavaScript Frameworks like JQuery, Angular, Node <--
    ... and please remember to wrap code with forum BBCode tags:-

    [CODE]...[/CODE] [HTML]...[/HTML] [PHP]...[/PHP]

    If you can't think outside the box, you will be trapped forever with no escape...

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center