Extrai URL do texto não formatado

2

Eu encontrei apenas exemplos na extração de substrings de texto formatado como arquivos HTML, mas no meu caso eu preciso mostrar uma lista de URLs por exemplo:

... 
https://twitter.com/user1/status/xyza 
https://twitter.com/user1/status/xyzb
https://twitter.com/user1/status/xyzc
https://twitter.com/user2/status/xyza
https://twitter.com/user2/status/xyzb
...

de um arquivo não estruturado e muito grande (+100 MB) é assim que minha entrada se parece:

n          3\n        \n      \n  \n    \n      \n      Retweeted\n    \n      \n        \n          3\n        \n      \n  \n\n      \n  \n    \n      \n        \n      \n      Like\n    \n      \n        \n          5\n        \n      \n  \n    \n      \n        \n      \n      Liked\n    \n      \n        \n          5\n        \n      \n  \n\n      \n\n        \n    \n  \n      \n        \n        More\n      \n  \n  \n  \n    \n    \n  \n  \n    \n      \n        Copy link to Tweet\n      \n      \n        Embed Tweet\n      \n        \n  \n\n\n\n\n  \n\n    \n\n      \n\n      \n        \n  \n    \n      \n  \n\n      \n    \n\n  \n\n\n      \n\n\n    \n      \n          \n\n    \n        \n          \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n        \n        \n  \n    \n  \n      \n\n    \n        \n\n    \n\n          Back to top ↑\n\n  \n\n\n    \n  \n    \n  \n\n\n  \n\n\n    \n  \n    Loading seems to be taking a while.\n    \n      Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.\n    \n  \n\n\n\n      \n    \n  \n\n      \n    \n\n\n\n\n\n  \n\n\n  \n    \n      Suggested by Twitter\n      \n        \n      \n    \n   \n\n    \n  \n    \n  \n    \n    false\n  \n  \n    \n    \n  \n\n  \n\n\n\n  \n      \n  \n    \n      \n        © 2015 Twitter\n        About\n        Help\n        Terms\n        Privacy\n        Cookies\n        Ads info\n      \n    \n  \n\n\n  \n\n\n\n      \n    \n  \n\n\n    \n  \n  \n\n\n\n    \n    \n  \n\n  \n\n  \n\n    \n  \n\n  \n    \n\n\",\"meta_tags\":[{},{\"content\":\"0; URL=https://mobile.twitter.com/i/nojs_router?path=%2FTerriBauman%2Fstatus%2F680996161843380224\"},{\"name\":\"robots\",\"content\":\"NOODP\"},{\"name\":\"msapplication-TileImage\",\"content\":\"//abs.twimg.com/favicons/win8-tile-144.png\"},{\"name\":\"msapplication-TileColor\",\"content\":\"#00aced\"},{\"name\":\"swift-page-name\",\"content\":\"permalink\"},{\"content\":\"article\"},{\"content\":\"https://twitter.com/TerriBauman/status/680996161843380224\"},{\"content\":\"Terri Bauman on Twitter\"},{\"content\":\"https://pbs.twimg.com/media/BcaVtMKCEAAyz9f.jpg:large\"},{\"content\":\"true\"},{\"content\":\"“Social Media Jobs: https://t.co/NDDK4WaRA4 Please Retweet to spread words #OnlineJobs #Jobs”\"},{\"content\":\"Twitter\"},{\"content\":\"2231777543\"}],\"links\":[\"https://twitter.com/\",\"https://twitter.com/about\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/#supported_languages\",\"https://twitter.com/?lang=id\",\"https://twitter.com/?lang=msa\",\"https://twitter.com/?lang=cs\",\"https://twitter.com/?lang=da\",\"https://twitter.com/?lang=de\",\"https://twitter.com/?lang=en-gb\",\"https://twitter.com/?lang=es\",\"https://twitter.com/?lang=fil\",\"https://twitter.com/?lang=fr\",\"https://twitter.com/?lang=it\",\"https://twitter.com/?lang=hu\",\"https://twitter.com/?lang=nl\",\"https://twitter.com/?lang=no\",\"https://twitter.com/?lang=pl\",\"https://twitter.com/?lang=pt\",\"https://twitter.com/?lang=ro\",\"https://twitter.com/?lang=fi\",\"https://twitter.com/?lang=sv\",\"https://twitter.com/?lang=vi\",\"https://twitter.com/?lang=tr\",\"https://twitter.com/?lang=el\",\"https://twitter.com/?lang=ru\",\"https://twitter.com/?lang=uk\",\"https://twitter.com/?lang=he\",\"https://twitter.com/?lang=ar\",\"https://twitter.com/?lang=fa\",\"https://twitter.com/?lang=mr\",\"https://twitter.com/?lang=hi\",\"https://twitter.com/?lang=bn\",\"https://twitter.com/?lang=gu\",\"https://twitter.com/?lang=ta\",\"https://twitter.com/?lang=kn\",\"https://twitter.com/?lang=th\",\"https://twitter.com/?lang=ko\",\"https://twitter.com/?lang=ja\",\"https://twitter.com/?lang=zh-cn\",\"https://twitter.com/?lang=zh-tw\",\"https://twitter.com/login\",\"https://twitter.com/account/begin_password_reset\",\"https://twitter.com/signup\",\"https://twitter.com/TerriBauman\",\"https://pbs.twimg.com/profile_images/598412523734310913/t3ettYkj.jpg\",\"https://pbs.twimg.com/profile_images/598412523734310913/t3ettYkj.jpg\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/hashtag/Entrepreneur?src=hash\",\"https://twitter.com/hashtag/SocialMediaExpert?src=hash\",\"https://twitter.com/hashtag/SocialMediaMarketer?src=hash\",\"https://twitter.com/hashtag/BusinessOwner?src=hash\",\"https://twitter.com/hashtag/InternetMarketer?src=hash\",\"https://twitter.com/hashtag/SocialMediaJobs?src=hash\",\"https://t.co/ZciT91kZwP\",\"https://twitter.com/about\",\"http:////support.twitter.com\",\"https://twitter.com/tos\",\"https://twitter.com/privacy\",\"http:////support.twitter.com/articles/20170514\",\"http:////support.twitter.com/articles/20170451\",\"https://twitter.com/#\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"http://support.twitter.com/forums/26810/entries/78525\",\"http:////dev.twitter.com/docs/embedded-tweets\",\"http:////dev.twitter.com/docs/embedded-tweets\",\"https://twitter.com/account/begin_password_reset\",\"https://twitter.com/signup\",\"https://twitter.com/signup\",\"https://twitter.com/login\",\"http://support.twitter.com/articles/14226-how-to-find-your-twitter-short-code-or-long-code\",\"https://twitter.com/TerriBauman/status/680996164058001408\",\"https://twitter.com/TerriBauman/status/680977383365578752\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://t.co/NDDK4WaRA4\",\"https://twitter.com/hashtag/OnlineJobs?src=hash\",\"https://twitter.com/hashtag/Jobs?src=hash\",\"https://t.co/SJvkM1yWUI\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/cakafete\",\"https://twitter.com/KassemAlYateem\",\"https://twitter.com/Worldspacetech1\",\"https://twitter.com/ElisaBW\",\"https://twitter.com/patrickarrelle\",\"https://twitter.com/AcousticsPro1\",\"https://twitter.com/#\",\"http://status.twitter.com\",\"https://twitter.com/about\",\"http:////support.twitter.com\",\"https://twitter.com/tos\",\"https://twitter.com/privacy\",\"http:////support.twitter.com/articles/20170514\",\"http:////support.twitter.com/articles/20170451\"]}"},{"url":"http://status.twitter.com/page/2","result":"{\"date_crawled\":\"2015-12-27T10:01:58Z\",\"title\":\"Twitter Status\",\"lossyHTML\":\"\n\n\r\n\r\n    \r\n        \r\n        \r\n        \r\n        \r\n            \r\n        \r\n        \r\n        \r\n        \r\n        \r\n        \r\n        \r\n        \r\n        \r\n        \r\n        \r\n        \r\n                \r\n        \r\n\r\n        \r\n        Twitter Status\r\n        \n\r\n        \r\n         \r\n\r\n        \r\n\r\n    \n\n\n\n\n\n\n\n\n\n\n\n\n\n\r\n    \r\n\r\n\r\n\r\n\r\n        \r\n\r\n\r\n\r\n    \r\n    \r\n        \r\n            \r\n                \r\n                    Updates on the status of the Twitter service.\r\n\r\n\r\n\r\n\r\nRelated Links\r\nOfficial Company Blog\r\n\r\nOfficial Help Documents\r\n\r\nDeveloper Community\r\n\r\n\r\n\r\n                    Archive\r\n\r\n\r\n\r\n \r\n                    Powered by Tumblr\r\n                \r\n\r\n                \r\n            \r\n            \r\n\r\n\r\n            \r\n                \r\n                    \r\n       

Eu tenho tentado fazer:

grep 'https://' input.txt | grep 'status' >> output.txt

Eu vi exemplos de uso de sed e awk, mas além de ser extremamente difícil de entender, eles quase sempre são baseados na seleção de colunas, o que no meu caso não é possível.

    
por J. Bend 27.12.2015 / 16:59

1 resposta

3

Tente isso com o GNU grep para obter URLs com duas barras:

grep -o 'http[s]*://[^/][^\]*' file

URLs com duas ou mais barras:

grep -o 'http[s]*://[^\]*' file

Leitura recomendada: Perguntas frequentes sobre expressões regulares de estouro de pilha

[s]*: the star quantifier (*) means that the preceding expression can match zero or more times. Here the preceding expression can be any character from the character class (marked with brackets) which only contains a s. It is easier to use s*.

[^\]*: matches any character except a backslash zero or more times. I escaped the backslash with a backslash to prevent escaping ].

    
por 27.12.2015 / 17:37